Correct Theanos's tutorial: add cross-references, correct some more typos,…

Correct Theanos's tutorial: add cross-references, correct some more typos, improve style a little, correct the logical structure some more

Correct Theanos's tutorial: add cross-references, correct some more typos,…
dba02a39 · Eric Larsen · Frederic · c86c72f4 · dba02a39 · dba02a39
--- a/doc/extending/index.txt
+++ b/doc/extending/index.txt
@@ -6,15 +6,15 @@ Extending Theano
 ================


-This documentation is for users who want to extend Theano with new Types, new
+This advanced tutorial is for users who want to extend Theano with new Types, new
 Operations (Ops), and new graph optimizations.

 Along the way, it also introduces many aspects of how Theano works, so it is
 also good for you if you are interested in getting more under the hood with
 Theano itself.

-Before tackling this tutorial, it is highly recommended to read the
-:ref:`tutorial`.
+Before tackling this more advanced presentation, it is highly recommended to read the
+introductory :ref:`Tutorial<tutorial>`.

 The first few pages will walk you through the definition of a new :ref:`type`,
 ``double``, and a basic arithmetic :ref:`operations <op>` on that Type. We

--- a/doc/library/compile/mode.txt
+++ b/doc/library/compile/mode.txt
+
+.. _libdoc_compile_mode:
+
 ======================================
 :mod:`mode` -- controlling compilation
 ======================================

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt

 .. currentmodule:: tensor

+.. _libdoc_basic_tensor:
+
 ===========================
 Basic Tensor Functionality
 ===========================

--- a/doc/tutorial/adding.txt
+++ b/doc/tutorial/adding.txt
@@ -7,7 +7,7 @@ Baby Steps - Algebra
 Adding two Scalars
 ==================

-So, to get us started with Theano and get a feel of what we're working with, 
+To get us started with Theano and get a feel of what we're working with, 
 let's make a simple function: add two numbers together. Here is how you do
 it:

@@ -38,7 +38,7 @@ to add. Note that from now on, we will use the term

 If you are following along and typing into an interpreter, you may have
 noticed that there was a slight delay in executing the ``function``
-instruction. Behind the scenes, ``f`` was being compiled into C code.
+instruction. Behind the scene, ``f`` was being compiled into C code.


 .. note:
@@ -80,13 +80,14 @@ TensorType(float64, scalar)
 >>> x.type is T.dscalar
 True

-You can learn more about the structures in Theano in :ref:`graphstructures`.
-
 By calling ``T.dscalar`` with a string argument, you create a
 *Variable* representing a floating-point scalar quantity with the
 given name. If you provide no argument, the symbol will be unnamed. Names
 are not required, but they can help debugging.

+More will be said in a moment regarding Theano's inner structure. You
+could also learn more by looking into :ref:`graphstructures`.
+

 **Step 2**

@@ -112,9 +113,8 @@ and giving ``z`` as output:
 The first argument to :func:`function <function.function>` is a list of Variables
 that will be provided as inputs to the function. The second argument
 is a single Variable *or* a list of Variables. For either case, the second
-argument is what we want to see as output when we apply the function.
-
-``f`` may then be used like a normal Python function.
+argument is what we want to see as output when we apply the function. ``f`` may
+then be used like a normal Python function.


 Adding two Matrices
@@ -132,14 +132,14 @@ from the previous example is that you need to instantiate ``x`` and
 >>> z = x + y
 >>> f = function([x, y], z)

-``dmatrix`` is the Type for matrices of doubles. And then we can use
+``dmatrix`` is the Type for matrices of doubles. Then we can use
 our new function on 2D arrays:

 >>> f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
 array([[ 11.,  22.],
       [ 33.,  44.]])

-The variable is a numpy array. We can also use numpy arrays directly as
+The variable is a NumPy array. We can also use NumPy arrays directly as
 inputs:

 >>> import numpy
@@ -160,8 +160,8 @@ The following types are available:
 * **double**: dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
 * **complex**: cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4

-The previous list is not exhaustive. A guide to all types compatible
-with numpy arrays may be found :ref:`here <libdoc_tensor_creation>`.
+The previous list is not exhaustive and a guide to all types compatible
+with NumPy arrays may be found here: :ref:`tensor creation<libdoc_tensor_creation>`.

 .. note::


--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -84,7 +84,7 @@ subsequently make to ``np_array`` have no effect on our shared variable.
 If we are running this with the CPU as the device,
 then changes we make to np_array *right away* will show up in
 ``s_true.get_value()``
-because numpy arrays are mutable, and ``s_true`` is using the ``np_array``
+because NumPy arrays are mutable, and ``s_true`` is using the ``np_array``
 object as it's internal buffer.

 However, this aliasing of ``np_array`` and ``s_true`` is not guaranteed to occur,
@@ -137,15 +137,15 @@ But both of these calls might create copies of the internal memory.

 The reason that ``borrow=True`` might still make a copy is that the internal
 representation of a shared variable might not be what you expect.  When you
-create a shared variable by passing a numpy array for example, then ``get_value()``
-must return a numpy array too.  That's how Theano can make the GPU use
+create a shared variable by passing a NumPy array for example, then ``get_value()``
+must return a NumPy array too.  That's how Theano can make the GPU use
 transparent.  But when you are using a GPU (or in the future perhaps a remote machine), then the numpy.ndarray
 is not the internal representation of your data.
 If you really want Theano to return its internal representation *and never copy it*
 then you should use the ``return_internal_type=True`` argument to
 ``get_value``.  It will never cast the internal object (always return in
 constant time), but might return various datatypes depending on contextual
-factors (e.g. the compute device, the dtype of the numpy array).
+factors (e.g. the compute device, the dtype of the NumPy array).

 .. code-block:: python


--- a/doc/tutorial/conditions.txt
+++ b/doc/tutorial/conditions.txt
@@ -11,9 +11,9 @@ IfElse vs Switch
 - Both Ops build a condition over symbolic variables.
 - ``IfElse`` takes a `boolean` condition and two variables as inputs.
 - ``Switch`` takes a `tensor` as condition and two variables as inputs.
-  ``switch`` is an elementwise operation and it is more general than ``ifelse``.
- Whereas ``switch`` evaluates both 'output' variables ``ifelse`` is lazy and only
-  evaluates one variable respect to the condition.
+  ``switch`` is an elementwise operation and is thus more general than ``ifelse``.
+- Whereas ``switch`` evaluates both 'output' variables, ``ifelse`` is lazy and only
+  evaluates one variable with respect to the condition.

 **Example**

@@ -52,8 +52,8 @@ IfElse vs Switch
      f_lazyifelse(val1, val2, big_mat1, big_mat2)
  print 'time spent evaluating one value %f sec'%(time.clock()-tic)

-In this example, IfElse Op spends less time (about an half) than Switch
-since it computes only one variable instead of both.
+In this example, the ``IfElse`` Op spends less time (about half as much) than ``Switch``
+since it computes only one variable out of the two.

 .. code-block:: python

@@ -62,10 +62,10 @@ since it computes only one variable instead of both.
  time spent evaluating one value 0.3500 sec


-Unless ``linker='vm'`` or ``linker='cvm'`` are used, ``ifelse`` will compute both variables and take the same computation
-time as ``switch``. The linker is not currently set by default to 'cvm' but
-it will be in a near future.
+Unless ``linker='vm'`` or ``linker='cvm'`` are used, ``ifelse`` will compute both
+variables and take the same computation time as ``switch``. Although the linker
+is not currently set by default to 'cvm', it will be in the near future.

-There is not an optimization automatically replacing a ``switch`` with a
+There is no automatic optimization replacing a ``switch`` with a
 broadcasted scalar to an ``ifelse``, as this is not always faster. See
 this `ticket <http://www.assembla.com/spaces/theano/tickets/764>`_.
--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
@@ -22,7 +22,7 @@ Using Test Values
 -----------------

 As of v.0.4.0, Theano has a new mechanism by which graphs are executed
-on-the-fly, before a theano.function is ever compiled. Since optimizations
+on-the-fly, before a ``theano.function`` is ever compiled. Since optimizations
 haven't been applied at this stage, it is easier for the user to locate the
 source of some bug. This functionality is enabled through the config flag
 ``theano.config.compute_test_value``. Its use is best shown through the
@@ -131,12 +131,12 @@ The compute_test_value mechanism works as follows:

 `compute_test_value` can take the following values:

-* ``off``: default behavior. This debugging mechanism is inactive.
-* ``raise``: compute test values on the fly. Any variable for which a test
+* ``off``: Default behavior. This debugging mechanism is inactive.
+* ``raise``: Compute test values on the fly. Any variable for which a test
  value is required, but not provided by the user, is treated as an error. An
  exception is raised accordingly.
-* ``warn``: idem, but a warning is issued instead of an Exception.
-* ``ignore``: silently ignore the computation of intermediate test values, if a
+* ``warn``: Idem, but a warning is issued instead of an Exception.
+* ``ignore``: Silently ignore the computation of intermediate test values, if a
  variable is missing a test value.

 .. note::
@@ -181,6 +181,8 @@ precise inspection of what's being computed where, when, and how, see the
 How do I Print a Graph (before or after compilation)?
 ----------------------------------------------------------

+.. TODO: dead links in the next paragraph
+
 Theano provides two functions (:func:`theano.pp` and
 :func:`theano.printing.debugprint`) to print a graph to the terminal before or after
 compilation.  These two functions print expression graphs in different ways:
@@ -203,8 +205,14 @@ Apply nodes, and which Ops are eating up your CPU cycles.

 Tips:

-* use the flags floatX=float32 to use float32 instead of float64 for the theano type matrix(),vector(),...(if you used dmatrix, dvector() they stay at float64).
-* Check that in the profile mode that there is no Dot operation and you're multiplying two matrices of the same type. Dot should be optimized to dot22 when the inputs are matrices and of the same type. This can happen when using floatX=float32 and something in the graph makes one of the inputs float64.
+* use the flags ``floatX=float32`` to use *float32* instead of *float64* 
+  for the theano type matrix(),vector(),...(if you used dmatrix, dvector()
+  they stay at *float64*).
+* Check that in the profile mode that there is no Dot operation and you're
+  multiplying two matrices of the same type. Dot should be optimized to
+  dot22 when the inputs are matrices and of the same type. This can happen
+  when using floatX=float32 and something in the graph makes one of the
+  inputs *float64*.

 .. _faq_wraplinker:

@@ -239,7 +247,7 @@ along with its position in the graph, the arguments to the ``perform`` or

 Admittedly, this may be a huge amount of
 output to read through if you are using big tensors... but you can choose to
-put logic inside of the print_eval function  that would, for example, only
+put logic inside of the *print_eval* function  that would, for example, only
 print something out if a certain kind of Op was used, at a certain program
 position, or if a particular value shows up in one of the inputs or outputs.
 Use your imagination :)
@@ -247,7 +255,7 @@ Use your imagination :)
 .. TODO: documentation for link.WrapLinkerMany

 This can be a really powerful debugging tool.
-Note the call to ``fn`` inside the call to ``print_eval``; without it, the graph wouldn't get computed at all!
+Note the call to *fn* inside the call to *print_eval*; without it, the graph wouldn't get computed at all!

 How to Use pdb ?
 ----------------
@@ -296,7 +304,7 @@ of the error. There's the script where the compiled function was called --
 but if you're using (improperly parameterized) prebuilt modules, the error
 might originate from ops in these modules, not this script. The last line
 tells us about the Op that caused the exception. In this case it's a "mul"
-involving Variables name "a" and "b". But suppose we instead had an
+involving variables with names "a" and "b". But suppose we instead had an
 intermediate result to which we hadn't given a name.

 After learning a few things about the graph structure in Theano, we can use
@@ -329,7 +337,7 @@ explore around the graph.

 That graph is purely symbolic (no data, just symbols to manipulate it
 abstractly). To get information about the actual parameters, you explore the
-"thunks" objects, which bind the storage for the inputs (and outputs) with
+"thunk" objects, which bind the storage for the inputs (and outputs) with
 the function itself (a "thunk" is a concept related to closures). Here, to
 get the current node's first input's shape, you'd therefore do "p
 thunk.inputs[0][0].shape", which prints out "(3, 4)".

--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -5,6 +5,14 @@
 More Examples
 =============

+At this point it would be wise to begin familiarizing yourself 
+more systematically with Theano's fundamental objects and operations by browsing
+this section of the library: :ref:`libdoc_basic_tensor`.
+
+As the tutorial unfolds, you should also gradually acquaint yourself with the other
+relevant areas of the library and with the relevant subjects of the documentation
+entrance page.
+

 Logistic Function
 =================
@@ -82,7 +90,7 @@ squared difference between two matrices ``a`` and ``b`` at the same time:
   shortcut for allocating symbolic variables that we will often use in the
   tutorials.

-When we use the function, it will return the three variables (the printing
+When we use the function f, it returns the three variables (the printing
 was reformatted for readability):

 >>> f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
@@ -119,7 +127,7 @@ give a default value of 1 for ``y`` by creating a ``Param`` instance with
 its ``default`` field set to 1.

 Inputs with default values must follow inputs without default
-values (like python's functions).  There can be multiple inputs with default values. These parameters can
+values (like Python's functions).  There can be multiple inputs with default values. These parameters can
 be set positionally or by name, as in standard Python:


@@ -146,10 +154,13 @@ array(33.0)
   attributes (set by ``dscalars`` in the example above) and *these* are the
   names of the keyword parameters in the functions that we build.  This is
   the mechanism at work in ``Param(y, default=1)``.  In the case of ``Param(w,
-   default=2, name='w_by_name')``, we override the symbolic variable's name
+   default=2, name='w_by_name')``. We override the symbolic variable's name
   attribute with a name to be used for this function.


+You may like to see :ref:`Function<usingfunction>` in the library for more detail.
+
+
 .. _functionstateexample:

 Using Shared Variables
@@ -172,9 +183,9 @@ internal state, and returns the old state value.
 >>> accumulator = function([inc], state, updates=[(state, state+inc)])

 This code introduces a few new concepts.  The ``shared`` function constructs
-so-called :term:`shared variables <shared variable>`.
-These are hybrid symbolic and non-symbolic
-variables.  Shared variables can be used in symbolic expressions just like
+so-called :ref:`shared variables<libdoc_compile_shared>`.
+These are hybrid symbolic and non-symbolic variables whose value may be shared 
+between multiple functions.  Shared variables can be used in symbolic expressions just like
 the objects returned by ``dmatrices(...)`` but they also have an internal
 value, that defines the value taken by this symbolic variable in *all* the
 functions that use it.  It is called a *shared* variable because its value is
@@ -189,7 +200,7 @@ will replace the ``.value`` of each shared variable with the result of the
 corresponding expression".  Above, our accumulator replaces the ``state``'s value with the sum
 of the state and the increment amount.

-Anyway, let's try it out!
+Let's try it out!

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_8
@@ -214,7 +225,7 @@ array(-1)
 array(2)

 As we mentioned above, you can define more than one function to use the same
-shared variable.  These functions can both update the value.
+shared variable.  These functions can all update the value.

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_8
@@ -226,7 +237,7 @@ array(2)
 array(0)

 You might be wondering why the updates mechanism exists.  You can always
-achieve a similar thing by returning the new expressions, and working with
+achieve a similar result by returning the new expressions, and working with
 them in NumPy as usual.  The updates mechanism can be a syntactic convenience,
 but it is mainly there for efficiency.  Updates to shared variables can
 sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
@@ -278,7 +289,9 @@ RandomStream object (a random number generator) for each such
 variable, and draw from it as necessary. We will call this sort of
 sequence of random numbers a *random stream*. *Random streams* are at
 their core shared variables, so the observations on shared variables
-hold here as well.
+hold here as well. Theanos's random objects are defined and implemented in 
+:ref:`RandomStreams<libdoc_tensor_shared_randomstreams>` and, at a lower level, 
+in :ref:`RandomStreamsBase<libdoc_tensor_raw_random>`.

 Brief Example
 -------------
@@ -301,7 +314,9 @@ Here's a brief example.  The setup code is:
 Here, 'rv_u' represents a random stream of 2x2 matrices of draws from a uniform
 distribution.  Likewise,  'rv_n' represents a random stream of 2x2 matrices of
 draws from a normal distribution.  The distributions that are implemented are
-defined in :class:`RandomStreams`.
+defined in :class:`RandomStreams` and, at a lower level, in :ref:`raw_random<_libdoc_tensor_raw_random>`.
+
+  .. TODO: repair the latter reference on RandomStreams

 Now let's use these objects.  If we call f(), we get random uniform numbers.
 The internal state of the random number generator is automatically updated,
@@ -312,7 +327,7 @@ so we get different random numbers every time.

 When we add the extra argument ``no_default_updates=True`` to
 ``function`` (as in ``g``), then the random number generator state is
-not affected by calling the returned function.  So for example, calling
+not affected by calling the returned function.  So, for example, calling
 ``g`` multiple times will return the same numbers.

 >>> g_val0 = g()  # different numbers from f_val0 and f_val1
@@ -374,7 +389,7 @@ There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`.
 A Real Example: Logistic Regression
 ===================================

-The preceding elements are put to work in this more realistic example. It will be used repeatedly.  
+The preceding elements are featured in this more realistic example.  It will be used repeatedly.  

 .. code-block:: python

@@ -401,7 +416,8 @@ The preceding elements are put to work in this more realistic example. It will b
  prediction = p_1 > 0.5                    # The prediction thresholded
  xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy loss function
  cost = xent.mean() + 0.01*(w**2).sum()    # The cost to minimize
-  gw,gb = T.grad(cost, [w,b])		  # Compute the gradient of cost
+  gw,gb = T.grad(cost, [w,b])		  # Compute the gradient of the cost:
+					  # we shall return to this

  # Compile
  train = theano.function(

--- a/doc/tutorial/extending_theano.txt
+++ b/doc/tutorial/extending_theano.txt
@@ -119,7 +119,7 @@ includes your op.
 The :func:`__str__` method is useful in order to provide a more meaningful
 string representation of your Op.

-The :func:`R_op` method is needed if you want `theano.tensor.Rop` to
+The :func:`R_op` method is needed if you want ``theano.tensor.Rop`` to
 work with your op.

 Op Example
@@ -211,16 +211,16 @@ exception. You can use the `assert` keyword to automatically raise an
 **Testing the infer_shape**

 When a class inherits from the ``InferShapeTester`` class, it gets the
-`self._compile_and_check` method that tests the Op ``infer_shape``
+``self._compile_and_check`` method that tests the Op ``infer_shape``
 method. It tests that the Op gets optimized out of the graph if only
 the shape of the output is needed and not the output
 itself. Additionally, it checks that such an optimized graph computes
 the correct shape, by comparing it to the actual shape of the computed
 output.

-`self._compile_and_check` compiles a Theano function. It takes as
+``self._compile_and_check`` compiles a Theano function. It takes as
 parameters the lists of input and output Theano variables, as would be
-provided to theano.function, and a list of real values to pass to the
+provided to ``theano.function``, and a list of real values to pass to the
 compiled function (don't use shapes that are symmetric, e.g. (3, 3),
 as they can easily to hide errors). It also takes the Op class to
 verify that no Ops of that type appear in the shape-optimized graph.
@@ -264,6 +264,8 @@ the multiplication by 2).

 **Testing the Rop**

+.. TODO: repair defective links in the following paragraph
+
 The class :class:`RopLop_checker`, give the functions
 :func:`RopLop_checker.check_mat_rop_lop`,
 :func:`RopLop_checker.check_rop_lop` and
@@ -316,6 +318,9 @@ if the NVIDIA driver works correctly with our sum reduction code on the
 GPU.


+A more extensive discussion than this section's may be found in the advanced
+tutorial :ref:`Extending Theano<extending>`
+
 -------------------------------------------

 **Exercise**

--- a/doc/tutorial/gpu_data_convert.txt
+++ b/doc/tutorial/gpu_data_convert.txt
 .. _gpu_data_convert:

 ===================================
-PyCUDA/CUDAMat/gnumpy compatibility
+PyCUDA/CUDAMat/Gnumpy compatibility
 ===================================

 PyCUDA
@@ -10,7 +10,7 @@ PyCUDA
 Currently, PyCUDA and Theano have different objects to store GPU
 data. The two implementations do not support the same set of features.
 Theano's implementation is called CudaNdarray and supports
-*strides*. It also only supports the float32 dtype. PyCUDA's implementation
+*strides*. It also only supports the *float32* dtype. PyCUDA's implementation
 is called GPUArray and doesn't support *strides*. However, it can deal with
 all NumPy and CUDA dtypes.

@@ -21,20 +21,20 @@ use both objects in the same script.
 Transfer
 --------

-You can use the `theano.misc.pycuda_utils` module to convert GPUArray to and
-from CudaNdarray. The functions `to_cudandarray(x, copyif=False)` and
-`to_gpuarray(x)` return a new object that occupies the same memory space
+You can use the ``theano.misc.pycuda_utils`` module to convert GPUArray to and
+from CudaNdarray. The functions ``to_cudandarray(x, copyif=False)`` and
+``to_gpuarray(x)`` return a new object that occupies the same memory space
 as the original. Otherwise it raises a ValueError. Because GPUArrays don't
 support *strides*, if the CudaNdarray is strided, we could copy it to
 have a non-strided copy. The resulting GPUArray won't share the same
-memory region. If you want this behavior, set `copyif=True` in
-`to_gpuarray`.
+memory region. If you want this behavior, set ``copyif=True`` in
+``to_gpuarray``.

 Compiling with PyCUDA
 ---------------------

 You can use PyCUDA to compile CUDA functions that work directly on
-CudaNdarrays. Here is an example from the file `theano/misc/tests/test_pycuda_theano_simple.py`:
+CudaNdarrays. Here is an example from the file ``theano/misc/tests/test_pycuda_theano_simple.py``:

 .. code-block:: python

@@ -73,10 +73,10 @@ CudaNdarrays. Here is an example from the file `theano/misc/tests/test_pycuda_th
      assert (numpy.asarray(dest) == a * b).all()


-Theano op using PyCUDA function
+Theano Op using a PyCUDA function
 -------------------------------

-You can use a GPU function compiled with PyCUDA in a Theano op. Here is an example:
+You can use a GPU function compiled with PyCUDA in a Theano op:

 .. code-block:: python

@@ -120,15 +120,15 @@ You can use a GPU function compiled with PyCUDA in a Theano op. Here is an examp
 CUDAMat
 =======

-There are functions for conversion between CUDAMats and Theano CudaNdArray objects. 
-They obey the same principles as PyCUDA's functions and can be found in
+There are functions for conversion between CUDAMat objects and Theano's CudaNdArray objects. 
+They obey the same principles as Theano's PyCUDA functions and can be found in
 theano.misc.cudamat_utils.py

 WARNING: There is a strange problem associated with stride/shape with those converters. 
-To work, the test needs a transpose and reshape...
+In order to work, the test needs a transpose and reshape...

-gnumpy
+Gnumpy
 ======

-There are conversion functions between gnumpy garray objects and Theano CudaNdArrays. 
-They are also similar to PyCUDA's and can be found in theano.misc.gnumpy_utils.py
+There are conversion functions between Gnumpy ``garray`` objects and Theano CudaNdArray objects. 
+They are also similar to Theano's PyCUDA functions and can be found in theano.misc.gnumpy_utils.py.
--- a/doc/tutorial/gradients.txt
+++ b/doc/tutorial/gradients.txt
@@ -33,7 +33,7 @@ array(8.0)
 >>> f(94.2)
 array(188.40000000000001)

-In the example above, we can see from ``pp(gy)`` that we are computing
+In this example, we can see from ``pp(gy)`` that we are computing
 the correct symbolic gradient.
 ``fill((x ** 2), 1.0)`` means to make a matrix of the same shape as
 ``x ** 2`` and fill it with 1.0.
@@ -72,10 +72,10 @@ array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

 In general, for any **scalar** expression ``s``, ``T.grad(s, w)`` provides
-the theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
+the Theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
 this way Theano can be used for doing **efficient** symbolic differentiation
 (as the expression return by ``T.grad`` will be optimized during compilation), even for
-function with many inputs. ( see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
+function with many inputs. (see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
 of symbolic differentiation).

 .. note::
@@ -86,8 +86,11 @@ of symbolic differentiation).
   ``T.grad`` with respect to the *i*-th element of the list given as second argument.
   The first argument of ``T.grad`` has to be a scalar (a tensor
   of size 1). For more information on the semantics of the arguments of
-   ``T.grad`` and details about the implementation, see :ref:`this <libdoc_gradient>`.
+   ``T.grad`` and details about the implementation, see
+   :ref:`this<libdoc_gradient>` section of the library.

+   Additional information on the inner workings of differentiation may also be
+   found in the more advanced tutorial :ref:`Extending Theano<extending>`.

 Computing the Jacobian
 ======================
@@ -106,9 +109,8 @@ do is to loop over the entries in ``y`` and compute the gradient of
    ``scan`` is a generic op in Theano that allows writing in a symbolic
    manner all kinds of recurrent equations. While creating
    symbolic loops (and optimizing them for performance) is a hard task,
-    effort is being done for improving the performance of ``scan``. For more
-    information about how to use this op, see :ref:`this <lib_scan>`.
-
+    effort is being done for improving the performance of ``scan``. We 
+    shall return to ``scan`` in a moment.

 >>> x = T.dvector('x')
 >>> y = x**2
@@ -125,7 +127,7 @@ at each step, we compute the gradient of element ``y[[i]`` with respect to
 matrix which corresponds to the Jacobian.

 .. note::
-    There are a few pitfalls to be aware of regarding ``T.grad``. One of them is that you
+    There are some pitfalls to be aware of regarding ``T.grad``. One of them is that you
    cannot re-write the above expression of the jacobian as
    ``theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
    non_sequences=x)``, even though from the documentation of scan this
@@ -170,8 +172,8 @@ performance gains. A description of one such algorithm can be found here:
  Computation, 1994*

 While in principle we would want Theano to identify these patterns automatically for us,
-in paractice, implementing such optimizations in a generic manner is extremely 
-difficult. Therefore, we offer special functions dedicated to these tasks.
+in practice, implementing such optimizations in a generic manner is extremely 
+difficult. Therefore, we provide special functions dedicated to these tasks.


 R-operator
@@ -182,7 +184,7 @@ vector, namely :math:`\frac{\partial f(x)}{\partial x} v`. The formulation
 can be extended even for `x` being a matrix, or a tensor in general, case in
 which also the Jacobian becomes a tensor and the product becomes some kind
 of tensor product. Because in practice we end up needing to compute such
-expressions in terms of weight matrices, theano supports this more generic
+expressions in terms of weight matrices, Theano supports this more generic
 form of the operation. In order to evaluate the *R-operation* of
 expression ``y``, with respect to ``x``, multiplying the Jacobian with ``v``
 you need to do something similar to this:
@@ -202,11 +204,10 @@ array([ 2.,  2.])
 L-operator
 ----------

-Similar to *R-operator*, the *L-operator* would compute a *row* vector times
-the Jacobian. The mathematical forumla would be :math:`v \frac{\partial
-f(x)}{\partial x}`. As for the *R-operator*, the *L-operator* is supported
-for generic tensors (not only for vectors). Similarly, it can be implemented as
-follows:
+In similitude to the *R-operator*, the *L-operator* would compute a *row* vector times
+the Jacobian. The mathematical formula would be :math:`v \frac{\partial
+f(x)}{\partial x}`. The *L-operator* is also supported for generic tensors 
+(not only for vectors). Similarly, it can be implemented as follows:

 >>> W = T.dmatrix('W')
 >>> v = T.dvector('v')
@@ -220,24 +221,24 @@ array([[ 0.,  0.],

 .. note::
    
-    `v`, the evaluation point, differs between the *L-operator* and the *R-operator*.
-    For the *L-operator*, the evaluation point needs to have the same shape
-    as the output, while for the *R-operator* the evaluation point should
-    have the same shape as the input parameter. Also, the results of these two
+    `v`, the point of evaluation, differs between the *L-operator* and the *R-operator*.
+    For the *L-operator*, the point of evaluation needs to have the same shape
+    as the output, whereas for the *R-operator* this point should
+    have the same shape as the input parameter. Furthermore, the results of these two
    operations differ. The result of the *L-operator* is of the same shape
-    as the input parameter, while the result of the *R-operator* is the same
-    as the output.
+    as the input parameter, while the result of the *R-operator* has a shape similar
+    to the output.


 Hessian times a Vector
 ======================

 If you need to compute the Hessian times a vector, you can make use of the
-above defined operators to do it more efficiently than actually computing
+above-defined operators to do it more efficiently than actually computing
 the exact Hessian and then performing the product. Due to the symmetry of the 
 Hessian matrix, you have two options that will
 give you the same result, though these options might exhibit differing performances. 
-Hence, we suggest profiling the methods before using either of the two:
+Hence, we suggest profiling the methods before using either one of the two:


 >>> x = T.dvector('x')
@@ -266,14 +267,14 @@ Final Pointers
 ==============


-* The ``grad`` function works symbolically: it takes and returns a Theano variable.
+* The ``grad`` function works symbolically: it receives and returns a Theano variables.

-* It can be compared to a macro since it can be applied repeatedly.
+* ``grad`` can be compared to a macro since it can be applied repeatedly.

-* It directly handles scalar costs only.
+* Scalar costs only can be directly handled by ``grad``. Arrays are handled through repeated applications.

-* Built-in functions allow to compute efficiently vector times Jacobian and vector times Hessian.
+* Built-in functions allow to compute efficiently *vector times Jacobian* and *vector times Hessian*.

 * Work is in progress on the optimizations required to compute efficiently the full
-  Jacobian and Hessian matrices and the Jacobian times vector.
+  Jacobian and Hessian matrices and the *Jacobian times vector* expression.

--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -5,17 +5,17 @@
 Tutorial
 ========

-Let us start an interactive session (e.g. ``python`` or ``ipython``) and import Theano.
+Let us start an interactive session (e.g. with ``python`` or ``ipython``) and import Theano.

 >>> from theano import *

-Many of symbols you will need to use are in the ``tensor`` subpackage
-of Theano. Let's import that subpackage under a handy name like
-``T`` (many tutorials use this convention).
+Several of the symbols you will need to use are in the ``tensor`` subpackage
+of Theano. Let us import that subpackage under a handy name like
+``T`` (the tutorials will frequently use this convention).

 >>> import theano.tensor as T

-If that worked you are ready for the tutorial, otherwise check your
+If that succeeded you are ready for the tutorial, otherwise check your
 installation (see :ref:`install`).

 Throughout the tutorial, bear in mind that there is a :ref:`glossary` to help
@@ -32,14 +32,14 @@ you out.
    gradients
    modes
    loading_and_saving
-    aliasing
    conditions
    loop
    sparse
    using_gpu
    gpu_data_convert
+    aliasing
    shape_info
    remarks
-    extending_theano
    debug_faq
+    extending_theano
    faq
--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
@@ -13,13 +13,13 @@ Scan
 - You 'scan' a function along some input sequence, producing an output at each time-step.
 - The function can see the *previous K time-steps* of your function.
 - ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- Advantages of using ``scan`` over for loops:
+- Often a *for* loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- Advantages of using ``scan`` over *for* loops:
  
  - Number of iterations to be part of the symbolic graph.
  - Minimizes GPU transfers (if GPU is involved).
  - Computes gradients through sequential steps.
-  - Slightly faster than using a for loop in Python with a compiled Theano function.
+  - Slightly faster than using a *for* loop in Python with a compiled Theano function.
  - Can lower the overall memory usage by detecting the actual amount of memory needed.

 The full documentation can be found in the library: :ref:`Scan <lib_scan>`.

--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
@@ -13,7 +13,7 @@ The ``config`` module contains several ``attributes`` that modify Theano's behav
 attributes are examined during the import of the ``theano`` module and several are assumed to be
 read-only.

-*As a rule, the attributes in this module should not be modified by user code.*
+*As a rule, the attributes in the* ``config`` *module should not be modified inside the user code.*

 Theano's code comes with default values for these attributes, but you can
 override them from your .theanorc file, and override those values in turn by
@@ -25,7 +25,7 @@ The order of precedence is:
 2. an assignment in :envvar:`THEANO_FLAGS`
 3. an assignment in the .theanorc file (or the file indicated in :envvar:`THEANORC`)

-You can print out the current/effective configuration at any time by printing
+You can display the current/effective configuration at any time by printing
 theano.config.  For example, to see a list  of all active configuration
 variables, type this from the command-line:

@@ -33,6 +33,9 @@ variables, type this from the command-line:

    python -c 'import theano; print theano.config' | less

+
+For more detail, see :ref:`Configuration <libdoc_config>` in the library.
+
 -------------------------------------------

 **Exercise**
@@ -136,7 +139,7 @@ Theano defines the following modes by name:

 - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
 - ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python
+- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and Python
    implementations. This mode can take much longer than the other modes,
    but can identify many kinds of problems.
 - ``'PROFILE_MODE'``: Same optimization then FAST_RUN, put print some profiling information
@@ -168,11 +171,11 @@ Here is a table to compare the different linkers.
 =============  =========  =================  =========  ===
 linker         gc [#gc]_  Raise error by op  Overhead   Definition
 =============  =========  =================  =========  ===
-c|py [#cpy1]_  yes        yes                "+++"      Try c code. If none exist for an op, use python
+c|py [#cpy1]_  yes        yes                "+++"      Try C code. If none exist for an op, use Python
 c|py_nogc      no         yes                "++"       As c|py, but without gc
-c              no         yes                "+"        Use only c code (if none available for an op, raise an error)
-py             yes        yes                "+++"      Use only python code
-c&py [#cpy2]_  no         yes                "+++++"    Use c and python code
+c              no         yes                "+"        Use only C code (if none available for an op, raise an error)
+py             yes        yes                "+++"      Use only Python code
+c&py [#cpy2]_  no         yes                "+++++"    Use c and Python code
 ProfileMode    no         no                 "++++"     Compute some extra profiling info
 DebugMode      no         yes                VERY HIGH  Make many checks on what Theano computes
 =============  =========  =================  =========  ===
@@ -186,6 +189,9 @@ DebugMode      no         yes                VERY HIGH  Make many checks on what
 .. [#cpy2] Deprecated


+For more detail, see :ref:`Mode<libdoc_compile_mode>` in the library.
+
+
 .. _using_debugmode:

 Using DebugMode
@@ -230,13 +236,14 @@ In the example above, there is no way to guarantee that a future call to say,

 If you instantiate DebugMode using the constructor (see :class:`DebugMode`)
 rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
-constructor arguments.  See :ref:`DebugMode <debugMode>` for details.
-The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
+constructor arguments. The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE'``)
 is quite strict.

+For more detail, see :ref:`DebugMode<libdoc_compile_mode>` in the library.

 .. _using_profilemode:

+
 ProfileMode
 ===========

@@ -352,7 +359,9 @@ the *Op-wise summary*, the execution time of all Apply nodes executing
 the same Op are grouped together and the total execution time per Op
 is shown (so if you use ``dot`` twice, you will see only one entry
 there corresponding to the sum of the time spent in each of them).
-
 Finally, notice that the ProfileMode also shows which Ops were running a C
 implementation.

+
+For more detail, see :ref:`ProfileMode<libdoc_compile_mode>` in the library.
+
--- a/doc/tutorial/shape_info.txt
+++ b/doc/tutorial/shape_info.txt
@@ -5,13 +5,13 @@ How Shape Information is Handled by Theano
 ==========================================

 It is not possible to strictly enforce the shape of a Theano variable when
-building a graph since the particular value provided for a parameter of the theano.function can change the
-shape any Theano variable in its graph.
+building a graph since the particular value provided at run-time for a parameter of a
+Theano function may condition the shape of the Theano variables in its graph.

 Currently, information regarding shape is used in two ways in Theano:

- When the exact output shape is known, to generate faster C code for
-  the 2d convolution on the CPU and GPU.
+- To generate faster C code for the 2d convolution on the CPU and the GPU,
+  when the exact output shape is known in advance.

 - To remove computations in the graph when we only want to know the
  shape, but not the actual value of a variable. This is done with the
@@ -39,7 +39,7 @@ output.
 Shape Inference Problem
 =======================

-Theano propagates shape information in the graph. Sometimes this
+Theano propagates information about shape in the graph. Sometimes this
 can lead to errors. For example:

 .. code-block:: python
@@ -90,18 +90,18 @@ example), an inferred shape is computed directly, without executing
 the computation itself (there is no ``join`` in the first output or debugprint).

 This makes the computation of the shape faster, but it can also hide errors. In
-the example, the computation of the shape of ``join`` is done on the first
-theano variable in the ``join`` computation and not on the other.
+the example, the computation of the shape of the output of ``join`` is done only
+based on the first input Theano variable, which leads to an error.

 This might happen with other ops such as elemwise, dot, ...
-Indeed, to make some optimizations (for speed or stability, for instance),
+Indeed, to perform some optimizations (for speed or stability, for instance),
 Theano assumes that the computation is correct and consistent
 in the first place, as it does here.

-You can detect those problem by running the code without this
+You can detect those problems by running the code without this
 optimization, with the Theano flag
-`optimizer_excluding=local_shape_to_shape_i`. You can also have the
-same effect by running in the mode FAST_COMPILE (it will not apply this
+``optimizer_excluding=local_shape_to_shape_i``. You can also obtain the
+same effect by running in the modes FAST_COMPILE (it will not apply this
 optimization, nor most other optimizations) or DEBUG_MODE (it will test
 before and after all optimizations (much slower)).

@@ -109,21 +109,21 @@ before and after all optimizations (much slower)).
 Specifing Exact Shape
 =====================

-Currently, specifying a shape is not as easy and flexible as we want and we plan some
+Currently, specifying a shape is not as easy and flexible as we wish and we plan some
 upgrade.  Here is the current state of what can be done:

- You can pass the shape info directly to the `ConvOp` created
+- You can pass the shape info directly to the ``ConvOp`` created
  when calling conv2d. You simply add the parameters image_shape
  and filter_shape to the call. They must be tuples of 4
-  elements. Ex:
+  elements. For example:

 .. code-block:: python

    theano.tensor.nnet.conv2d(..., image_shape=(7,3,5,5), filter_shape=(2,3,4,4))

- You can use the SpecifyShape op to add shape info anywhere in the
+- You can use the SpecifyShape op to add shape information anywhere in the
  graph. This allows to perform some optimizations. In the following example,
-  this allows to precompute the Theano function to a constant.
+  this makes it possible to precompute the Theano function to a constant.

 .. code-block:: python

@@ -137,7 +137,7 @@ upgrade.  Here is the current state of what can be done:
 Future Plans
 ============

- Add the parameter "constant shape" to theano.shared(). This is probably
+  The parameter "constant shape" will be added to ``theano.shared()``. This is probably
  the most frequent case with ``shared variables``. This will make the code
  simpler and will make it possible to check that the shape does not change when
  updating the shared variable.
--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -11,8 +11,8 @@ Theano Graphs

 Debugging or profiling code written in Theano is not that simple if you
 do not know what goes on under the hood. This chapter is meant to
-introduce you to a required minimum of the inner workings of Theano, 
-for more detail see :ref:`extending`.
+introduce you to a required minimum of the inner workings of Theano.  
+For more detail see :ref:`extending`.

 The first step in writing Theano code is to write down all mathematical 
 relations using symbolic placeholders (**variables**). When writing down 
@@ -122,6 +122,9 @@ Using the
 these gradients can be composed in order to obtain the expression of the 
 gradient of the graph's output with respect to the graph's inputs .

+A coming section of this tutorial will address the topic of differentiation
+in greater detail.
+

 Optimizations
 =============
@@ -141,10 +144,15 @@ twice or reformulate parts of the graph to a GPU specific version.
 For example, one (simple) optimization that Theano uses is to replace 
 the pattern :math:`\frac{xy}{y}` by :math:`x`.

+Further information regarding the optimization
+:ref:`process<optimization>` and the specific :ref:`optimizations<optimizations>` that are applicable
+is respectively available in the library and on the entrance page of the documentation.  
+

 **Example**

-Consider the following example of optimization:
+Symbolic programming involves a change of paradigm: it will become clearer
+as we apply it. Consider the following example of optimization:

 >>> import theano
 >>> a = theano.tensor.vector("a")      # declare symbolic variable
@@ -158,5 +166,3 @@ Consider the following example of optimization:
 ======================================================  =====================================================
 .. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
 ======================================================  =====================================================
-
-Symbolic programming involves a paradigm shift: it is best to use it in order to understand it.
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -9,7 +9,7 @@ One of Theano's design goals is to specify computations at an
 abstract level, so that the internal function compiler has a lot of flexibility
 about how to carry out those computations.  One of the ways we take advantage of
 this flexibility is in carrying out calculations on an Nvidia graphics card when
-there is a CUDA-enabled device in present in the computer.
+there is a CUDA-enabled device present in the computer.

 Setting Up CUDA
 ----------------
@@ -54,7 +54,7 @@ file and run it.

 The program just computes the exp() of a bunch of random numbers.
 Note that we use the `shared` function to
-make sure that the input `x` are stored on the graphics device.
+make sure that the input `x` is stored on the graphics device.

 If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
 whereas on the GPU it takes just over 0.4 seconds. The GPU will not always produce the exact 
@@ -76,12 +76,12 @@ Note that for now GPU operations in Theano require floatX to be float32 (see als
 Returning a Handle to Device-Allocated Data
 -------------------------------------------

-The speedup is not greater in the example above because the function is
+The speedup is not greater in the preceding example because the function is
 returning its result as a NumPy ndarray which has already been copied from the
 device to the host for your convenience.  This is what makes it so easy to swap in device=gpu, but
 if you don't mind being less portable, you might prefer to see a bigger speedup by changing
 the graph to express a computation with a GPU-stored result.  The gpu_from_host
-Op means "copy the input from the host to the gpu" and it is optimized away
+Op means "copy the input from the host to the GPU" and it is optimized away
 after the T.exp(x) is replaced by a GPU version of exp().

 .. If you modify this code, also change :
@@ -135,7 +135,7 @@ To really get maximum performance in this simple example, we need to use an :cla
 instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
 internal use like a working buffer, but by default it will never return a result that is
 allocated in the working buffer.  This is normally what you want, but our example is so simple
-that it has the un-wanted side-effect of really slowing things down.
+that it has the unwanted side-effect of really slowing things down.

 .. 
    TODO:
@@ -207,7 +207,7 @@ what to expect right now:
  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
  as on CPU.
 * Summation 
-  over rows/columns of tensors can be a little slower on the GPU than on the CPU
+  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
 * Copying 
  of large quantities of data to and from a device is relatively slow, and
  often cancels most of the advantage of one or two accelerated functions on
@@ -219,10 +219,10 @@ Tips for Improving Performance on GPU
 -------------------------------------

 * Consider 
-  adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
+  adding ``floatX=float32`` to your .theanorc file if you plan to do a lot of
  GPU work.
 * Prefer  
-  constructors like 'matrix' 'vector' and 'scalar' to 'dmatrix', 'dvector' and
+  constructors like 'matrix', 'vector' and 'scalar' to 'dmatrix', 'dvector' and
  'dscalar' because the former will give you float32 variables when
  floatX=float32.
 * Ensure 
@@ -238,9 +238,9 @@ Tips for Improving Performance on GPU
  mode='PROFILE_MODE'. This should print some timing information at program
  termination (atexit). Is time being used sensibly?   If an Op or Apply is
  taking more time than its share, then if you know something about GPU
-  programming have a look at how it's implemented in theano.sandbox.cuda.
+  programming, have a look at how it's implemented in theano.sandbox.cuda.
  Check the line like 'Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op'
-  that can tell you if not enough of your graph is on the gpu or if there
+  that can tell you if not enough of your graph is on the GPU or if there
  is too much memory transfert.


@@ -302,9 +302,9 @@ Consider the logistic regression:
        print 'Used the cpu'
    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
    train.maker.fgraph.toposort()]):
-        print 'Used the gpu'
+        print 'Used the GPU'
    else:
-        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print 'ERROR, not able to tell if theano used the cpu or the GPU'
        print train.maker.fgraph.toposort()


@@ -354,7 +354,7 @@ What can be done to further increase the speed of the GPU version?
 Software for Directly Programming a GPU
 ---------------------------------------

-Leaving aside Theano which is a meta-programmer, there are:
+Leaving aside Theano which is a meta-programmer, there is:

 * CUDA: C extension by NVIDIA 

@@ -373,7 +373,7 @@ Leaving aside Theano which is a meta-programmer, there are:

   * Convenience: Makes it easy to do GPU meta-programming from within Python. Helpful documentation.

-     (abstractions to compile low-level CUDA code from Python: ``pycuda.driver.SourceModule``).
+     (abstractions to compile low-level CUDA code from Python: ``pycuda.driver.SourceModule``)

   * Completeness: Binding to all of CUDA's driver API.

@@ -381,16 +381,16 @@ Leaving aside Theano which is a meta-programmer, there are:

   * Speed: PyCUDA's base layer is written in C++.

-   * Memory management of GPU objects:
-
-     GPU memory buffer: \texttt{pycuda.gpuarray.GPUArray}.
+   * Good memory management of GPU objects:

-     Object cleanup tied to lifetime of objects (RAII, Resource Acquisition Is Initialization).
+     Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').

     Makes it much easier to write correct, leak- and crash-free code.

     PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory allocated in it is also freed).

+     (GPU memory buffer: \texttt{pycuda.gpuarray.GPUArray})
+
 * PyOpenCL: PyCUDA for OpenCL


@@ -431,8 +431,7 @@ Leaving aside Theano which is a meta-programmer, there are:

 Run the preceding example.

-Modify and execute it to work for a matrix of 20 x 10.
-
+Modify and execute to work for a matrix of shape (20, 10).

 -------------------------------------------

@@ -497,13 +496,12 @@ To test it:

 Run the preceding example.

-Modify and execute the example to multiple two matrix: x * y.
+Modify and execute to multiply two matrices: x * y.

-Modify and execute the example to return 2 outputs: x + y and x - y.
+Modify and execute to return two outputs: x + y and x - y.
 (Currently, elemwise fusion generates computation with only 1 output.)

-Modify and execute the example to support *stride* (i.e. so as not constrain the input to be c contiguous).
-
+Modify and execute to support *stride* (i.e. so as not constrain the input to be c contiguous).

 -------------------------------------------