Correct Theano's tutorial: one more round of corrections

edfd9f24 · Eric Larsen · Frederic · dba02a39 · edfd9f24 · edfd9f24
--- a/doc/tutorial/adding.txt
+++ b/doc/tutorial/adding.txt
@@ -33,12 +33,12 @@ Let's break this down into several steps. The first step is to define
 two symbols (*Variables*) representing the quantities that you want
 to add. Note that from now on, we will use the term 
 *Variable* to mean "symbol" (in other words, 
-``x``, ``y``, ``z`` are all *Variable* objects). The output of the function 
-``f`` is a ``numpy.ndarray`` with zero dimensions.
+*x*, *y*, *z* are all *Variable* objects). The output of the function 
+*f* is a ``numpy.ndarray`` with zero dimensions.

 If you are following along and typing into an interpreter, you may have
 noticed that there was a slight delay in executing the ``function``
-instruction. Behind the scene, ``f`` was being compiled into C code.
+instruction. Behind the scene, *f* was being compiled into C code.


 .. note:
@@ -51,9 +51,9 @@ instruction. Behind the scene, ``f`` was being compiled into C code.
  >>> x = theano.tensor.ivector()
  >>> y = -x
  
-  ``x`` and ``y`` are both Variables, i.e. instances of the
+  *x* and *y* are both Variables, i.e. instances of the
  ``theano.gof.graph.Variable`` class. The
-  type of both ``x`` and ``y`` is ``theano.tensor.ivector``.
+  type of both *x* and *y* is ``theano.tensor.ivector``.


 **Step 1**
@@ -65,9 +65,9 @@ In Theano, all symbols must be typed. In particular, ``T.dscalar``
 is the type we assign to "0-dimensional arrays (`scalar`) of doubles
 (`d`)". It is a Theano :ref:`type`.

-``dscalar`` is not a class. Therefore, neither ``x`` nor ``y``
+``dscalar`` is not a class. Therefore, neither *x* nor *y*
 are actually instances of ``dscalar``. They are instances of
-:class:`TensorVariable`. ``x`` and ``y``
+:class:`TensorVariable`. *x* and *y*
 are, however, assigned the theano Type ``dscalar`` in their ``type``
 field, as you can see here:

@@ -91,13 +91,13 @@ could also learn more by looking into :ref:`graphstructures`.

 **Step 2**

-The second step is to combine ``x`` and ``y`` into their sum ``z``:
+The second step is to combine *x* and *y* into their sum *z*:

 >>> z = x + y

-``z`` is yet another *Variable* which represents the addition of
-``x`` and ``y``. You can use the :ref:`pp <libdoc_printing>`
-function to pretty-print out the computation associated to ``z``.
+*z* is yet another *Variable* which represents the addition of
+*x* and *y*. You can use the :ref:`pp <libdoc_printing>`
+function to pretty-print out the computation associated to *z*.

 >>> print pp(z)
 (x + y)
@@ -105,15 +105,15 @@ function to pretty-print out the computation associated to ``z``.

 **Step 3**

-The last step is to create a function taking ``x`` and ``y`` as inputs
-and giving ``z`` as output:
+The last step is to create a function taking *x* and *y* as inputs
+and giving *z* as output:

 >>> f = function([x, y], z)

 The first argument to :func:`function <function.function>` is a list of Variables
 that will be provided as inputs to the function. The second argument
 is a single Variable *or* a list of Variables. For either case, the second
-argument is what we want to see as output when we apply the function. ``f`` may
+argument is what we want to see as output when we apply the function. *f* may
 then be used like a normal Python function.


@@ -121,8 +121,8 @@ Adding two Matrices
 ===================

 You might already have guessed how to do this. Indeed, the only change
-from the previous example is that you need to instantiate ``x`` and
-``y`` using the matrix Types:
+from the previous example is that you need to instantiate *x* and
+*y* using the matrix Types:

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_adding.test_adding_2
@@ -153,12 +153,12 @@ by :ref:`broadcasting <libdoc_tensor_broadcastable>`.

 The following types are available:

-* **byte**: bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4
-* **32-bit integers**: iscalar, ivector, imatrix, irow, icol, itensor3, itensor4
-* **64-bit integers**: lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4
-* **float**: fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4
-* **double**: dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
-* **complex**: cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4
+* **byte**: ``bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4``
+* **32-bit integers**: ``iscalar, ivector, imatrix, irow, icol, itensor3, itensor4``
+* **64-bit integers**: ``lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4``
+* **float**: ``fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4``
+* **double**: ``dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4``
+* **complex**: ``cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4``

 The previous list is not exhaustive and a guide to all types compatible
 with NumPy arrays may be found here: :ref:`tensor creation<libdoc_tensor_creation>`.

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -5,11 +5,11 @@
 Understanding Memory Aliasing for Speed and Correctness
 =======================================================

-The aggressive reuse of memory is one of the ways Theano makes code fast, and
-it's important for the correctness and speed of your program that you understand
-which buffers Theano might alias to which other.
+The aggressive reuse of memory is one of the ways through which Theano makes code fast, and
+it is important for the correctness and speed of your program that you understand
+how Theano might alias buffers.

-This section describes the principles based on which Theano treats memory, and explains
+This section describes the principles based on which Theano handles memory, and explains
 when you might want to alter the default behaviour of some functions and
 methods for faster performance.

@@ -17,7 +17,7 @@ methods for faster performance.
 The Memory Model: Two Spaces
 ============================

-There are some simple principles that guide Theano's treatment of memory.  The
+There are some simple principles that guide Theano's handling of memory.  The
 main idea is that there is a pool of memory managed by Theano, and Theano tracks
 changes to values in that pool.

@@ -26,14 +26,14 @@ changes to values in that pool.

 * Theano functions only modify buffers that are in Theano's memory space.

-* Theano's memory space includes the buffers allocated to store shared
+* Theano's memory space includes the buffers allocated to store ``shared``
  variables and the temporaries used to evaluate functions.

 * Physically, Theano's memory space may be spread across the host, a GPU
  device(s), and in the future may even include objects on a remote machine.

-* The memory allocated for a shared variable buffer is unique: it is never
-  aliased to another shared variable.
+* The memory allocated for a ``shared`` variable buffer is unique: it is never
+  aliased to another ``shared`` variable.

 * Theano's managed memory is constant while Theano functions are not running
  and Theano's library code is not running.
@@ -42,11 +42,10 @@ changes to values in that pool.
  outputs, and to expect user-space values for inputs.
    
 The distinction between Theano-managed memory and user-managed memory can be
-broken down by some Theano functions (e.g. shared, get_value and the
-constructors for In and Out) by using
-a ``borrow=True`` flag.  This can make those methods faster (by avoiding copy
-operations) at the expense of risking subtle bugs in the overall program (by
-aliasing memory).
+broken down by some Theano functions (e.g. ``shared``, ``get_value`` and the
+constructors for ``In`` and ``Out``) by using a ``borrow=True`` flag. 
+This can make those methods faster (by avoiding copy operations) at the expense 
+of risking subtle bugs in the overall program (by aliasing memory).

 The rest of this section is aimed at helping you to understand when it is safe
 to use the ``borrow=True`` argument and reap the benefits of faster code.
@@ -69,9 +68,9 @@ A ``borrow`` argument can be provided to the shared-variable constructor.
    s_false   = theano.shared(np_array, borrow=False)
    s_true    = theano.shared(np_array, borrow=True)

-By default (``s_default``) and when explicitly setting ``borrow=False``, the
-shared variable we construct gets a [deep] copy of ``np_array``.  So changes we
-subsequently make to ``np_array`` have no effect on our shared variable.
+By default (*s_default*) and when explicitly setting ``borrow=False``, the
+shared variable we construct gets a [deep] copy of *np_array*.  So changes we
+subsequently make to *np_array* have no effect on our shared variable.

 .. code-block:: python

@@ -82,31 +81,30 @@ subsequently make to ``np_array`` have no effect on our shared variable.
    s_true.get_value()     # -> array([2.0, 2.0])

 If we are running this with the CPU as the device,
-then changes we make to np_array *right away* will show up in
+then changes we make to *np_array* *right away* will show up in
 ``s_true.get_value()``
-because NumPy arrays are mutable, and ``s_true`` is using the ``np_array``
+because NumPy arrays are mutable, and *s_true* is using the *np_array*
 object as it's internal buffer.

-However, this aliasing of ``np_array`` and ``s_true`` is not guaranteed to occur,
+However, this aliasing of *np_array* and *s_true* is not guaranteed to occur,
 and may occur only temporarily even if it occurs at all.
 It is not guaranteed to occur because if Theano is using a GPU device, then the
-borrow flag has no effect.
-It may occur only temporarily because
-if we call a Theano function that updates the value of ``s_true`` the aliasing
+``borrow`` flag has no effect. It may occur only temporarily because
+if we call a Theano function that updates the value of *s_true* the aliasing
 relationship *may* or *may not* be broken (the function is allowed to 
-update the shared variable by modifying its buffer, which will preserve
+update the ``shared`` variable by modifying its buffer, which will preserve
 the aliasing, or by changing which buffer the variable points to, which
 will terminate the aliasing).

 *Take home message:*

-It is safe practice (and a good idea) to use ``borrow=True`` in a shared
-variable constructor when the shared variable stands for a large object (in
+It is a safe practice (and a good idea) to use ``borrow=True`` in a ``shared``
+variable constructor when the ``shared`` variable stands for a large object (in
 terms of memory footprint) and you do not want to create copies of it in
 memory.

-It is not a reliable technique to use ``borrow=True`` to modify shared variables
-by side-effect, because with some devices (e.g. GPU devices) this technique will
+It is not a reliable technique to use ``borrow=True`` to modify ``shared`` variables
+through side-effect, because with some devices (e.g. GPU devices) this technique will
 not work.

 Borrowing when Accessing Value of Shared Variables
@@ -115,7 +113,8 @@ Borrowing when Accessing Value of Shared Variables
 Retrieving
 ----------

-A ``borrow`` argument can also be used to control how a shared variable's value is retrieved.
+A ``borrow`` argument can also be used to control how a ``shared`` variable's value is 
+retrieved.


 .. If you modify this code, also change :
@@ -136,11 +135,11 @@ When ``borrow=True`` is passed to ``get_value``, it means that the return value
 But both of these calls might create copies of the internal memory.

 The reason that ``borrow=True`` might still make a copy is that the internal
-representation of a shared variable might not be what you expect.  When you
-create a shared variable by passing a NumPy array for example, then ``get_value()``
+representation of a ``shared`` variable might not be what you expect.  When you
+create a ``shared`` variable by passing a NumPy array for example, then ``get_value()``
 must return a NumPy array too.  That's how Theano can make the GPU use
-transparent.  But when you are using a GPU (or in the future perhaps a remote machine), then the numpy.ndarray
-is not the internal representation of your data.
+transparent.  But when you are using a GPU (or in the future perhaps a remote machine), 
+then the numpy.ndarray is not the internal representation of your data. 
 If you really want Theano to return its internal representation *and never copy it*
 then you should use the ``return_internal_type=True`` argument to
 ``get_value``.  It will never cast the internal object (always return in
@@ -156,28 +155,28 @@ It is possible to use ``borrow=False`` in conjunction with
 This is primarily for internal debugging, not for typical use.

 For the transparent use of different type of optimization Theano can make,
-there is the policy that get_value() always return by default the same object type
-it received when the shared variable was created. So if you created manually data on
-the gpu and create a shared variable on the gpu with this data, get_value will always
-return gpu data even when return_internal_type=False.
+there is the policy that ``get_value()`` always return by default the same object type
+it received when the ``shared`` variable was created. So if you created manually data on
+the gpu and create a ``shared`` variable on the gpu with this data, ``get_value`` will always
+return gpu data even when ``return_internal_type=False``.

 *Take home message:*

 It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
-your code does not modify the return value.  *Do not use this to modify a shared
+your code does not modify the return value.  *Do not use this to modify a ``shared``
 variable by side-effect* because it will make your code device-dependent.
-Modification of GPU variables by this sort of side-effect is impossible.
+Modification of GPU variables through this sort of side-effect is impossible.

 Assigning
 ---------

-Shared variables also have a ``set_value`` method that can accept an optional ``borrow=True`` argument.
-The semantics are similar to those of creating a new shared variable -
-``borrow=False`` is the default and ``borrow=True`` means that Theano *may*
-reuse the buffer you provide as the internal storage for the variable.
+``Shared`` variables also have a ``set_value`` method that can accept an optional
+``borrow=True`` argument. The semantics are similar to those of creating a new 
+``shared`` variable - ``borrow=False`` is the default and ``borrow=True`` means 
+that Theano *may* reuse the buffer you provide as the internal storage for the variable.

-A standard pattern for manually updating the value of a shared variable is as
-follows.
+A standard pattern for manually updating the value of a ``shared`` variable is as
+follows:

 .. code-block:: python

@@ -185,49 +184,54 @@ follows.
        some_inplace_fn(s.get_value(borrow=True)),
        borrow=True)

-This pattern works regardless of the compute device, and when the compute device
+This pattern works regardless of the computing device, and when the latter
 makes it possible to expose Theano's internal variables without a copy, then it
-goes as fast as an in-place update.
+proceeds as fast as an in-place update.


-When shared variables are allocated on the GPU, the transfers to and from GPU device memory can
+When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
 be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:

-* Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
+* Prior to Theano 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
  GPU memory for the new value would be allocated before the old memory was released. If you're
  running near the limits of GPU memory, this could cause you to run out of GPU memory
-  unnecessarily.  *Solution*: update to a newer version of Theano.
+  unnecessarily.

-* If you are going to swap several chunks of data in and out of a shared variable repeatedly,
+  *Solution*: update to a newer version of Theano.
+
+* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
  you will want to reuse the memory that you allocated the first time if possible - it is both
  faster and more memory efficient.
+
  *Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
  data to make sure that every chunk is the same size.

 * It is also worth mentioning that, current GPU copying routines support only contiguous memory.
-  So Theano must make the ``value`` you provide ``c_contiguous`` prior to copying it.
-  This can require an extra copy of the data on the host.  *Solution*: make sure that the value
-  you assign to a CudaNdarraySharedVariable is *already*  ``c_contiguous``.
+  So Theano must make the value you provide *C-contiguous* prior to copying it.
+  This can require an extra copy of the data on the host.
+
+  *Solution*: make sure that the value
+  you assign to a CudaNdarraySharedVariable is *already*  *C-contiguous*.

-(Further remarks on the current implementation of the GPU version of set_value() can be found
+(Further information on the current implementation of the GPU version of ``set_value()`` can be found
 here: :ref:`libdoc_cuda_var`)


-Retrieving and Assigning via the .value Property
+Retrieving and Assigning via the ``.value`` Property
 ------------------------------------------------

-Shared variables have a ``.value`` property that is connected to ``get_value``
+``Shared`` variables have a ``.value`` property that is connected to ``get_value``
 and ``set_value``.  The borrowing behaviour of the property is controlled by a
 boolean configuration variable ``config.shared.value_borrows``, which currently
-defaults to ``True``.  If that variable is ``True`` then an assignment like ``s.value=v``
+defaults to *True*.  If that variable is *True* then an assignment like ``s.value=v``
 is equivalent to ``s.set_value(v, borrow=True)``, and a retrieval like ``print
 s.value`` is equivalent to ``print s.get_value(borrow=True)``.  Likewise, 
-if ``config.shared.value_borrows`` is ``False``, then the borrow parameter that the ``.value`` property 
-passes to ``set_value`` and ``get_value`` is ``False``.
+if ``config.shared.value_borrows`` is *False*, then the borrow parameter that the ``.value`` property 
+passes to ``set_value`` and ``get_value`` is *False*.

-The ``True`` default value of ``config.shared.value_borrows`` means that
+The *True* default value of ``config.shared.value_borrows`` means that
 aliasing can sometimes happen and sometimes not, which can be confusing. 
-Be aware that the default value may be changed to ``False`` sometime in the
+Be aware that the default value may be changed to *False* sometime in the
 not-to-distant future. This change will create more copies, and potentially slow
 down code that accesses ``.value`` attributes inside tight loops.  To avoid this
 potential impact on your code, use the ``.get_value`` and ``.set_value`` methods
@@ -238,7 +242,7 @@ Borrowing when Constructing Function Objects
 ============================================

 A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
-that control how ``theano.function`` handles its arguments and return value[s]. 
+that control how ``theano.function`` handles its argument[s] and return value[s]. 

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_aliasing.test_aliasing_3
@@ -259,17 +263,17 @@ course of evaluating that function (e.g. ``f``).

 Borrowing an output means that Theano will not insist on allocating a fresh
 output buffer every time you call the function.  It will possibly reuse the same one as
-a previous call, and overwrite the old contents.  Consequently, it may overwrite
-old return values by side effect.
+on a previous call, and overwrite the old content.  Consequently, it may overwrite
+old return values through side-effect.
 Those return values may also be overwritten in
 the course of evaluating *another compiled function* (for example, the output
-may be aliased to a shared variable).  So be careful to use a borrowed return
+may be aliased to a ``shared`` variable).  So be careful to use a borrowed return
 value right away before calling any more Theano functions.
 The default is of course to *not borrow* internal results.

-It is also possible to pass an ``return_internal_type=True`` flag to the ``Out``
+It is also possible to pass a ``return_internal_type=True`` flag to the ``Out``
 variable which has the same interpretation as the ``return_internal_type`` flag
-to the shared variable's ``get_value`` function.  Unlike ``get_value()``, the
+to the ``shared`` variable's ``get_value`` function.  Unlike ``get_value()``, the
 combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 ``Out()`` are not guaranteed to avoid copying an output value.  They are just
 hints that give more flexibility to the compilation and optimization of the
@@ -277,11 +281,11 @@ graph.

 *Take home message:*

-When an input ``x`` to a function is not needed after the function returns and you
+When an input *x* to a function is not needed after the function returns and you
 would like to make it available to Theano as additional workspace, then consider
 marking it with ``In(x, borrow=True)``.  It may make the function faster and
 reduce its memory requirement.
-When a return value ``y`` is large (in terms of memory footprint), and you only need to read from it once, right
+When a return value *y* is large (in terms of memory footprint), and you only need to read from it once, right
 away when it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.

--- a/doc/tutorial/conditions.txt
+++ b/doc/tutorial/conditions.txt
@@ -8,11 +8,11 @@ IfElse vs Switch
 ================


- Both Ops build a condition over symbolic variables.
- ``IfElse`` takes a `boolean` condition and two variables as inputs.
- ``Switch`` takes a `tensor` as condition and two variables as inputs.
+- Both ops build a condition over symbolic variables.
+- ``IfElse`` takes a *boolean* condition and two variables as inputs.
+- ``Switch`` takes a *tensor* as condition and two variables as inputs.
  ``switch`` is an elementwise operation and is thus more general than ``ifelse``.
- Whereas ``switch`` evaluates both 'output' variables, ``ifelse`` is lazy and only
+- Whereas ``switch`` evaluates both *output* variables, ``ifelse`` is lazy and only
  evaluates one variable with respect to the condition.

 **Example**
@@ -52,7 +52,7 @@ IfElse vs Switch
      f_lazyifelse(val1, val2, big_mat1, big_mat2)
  print 'time spent evaluating one value %f sec'%(time.clock()-tic)

-In this example, the ``IfElse`` Op spends less time (about half as much) than ``Switch``
+In this example, the ``IfElse`` op spends less time (about half as much) than ``Switch``
 since it computes only one variable out of the two.

 .. code-block:: python
@@ -64,7 +64,7 @@ since it computes only one variable out of the two.

 Unless ``linker='vm'`` or ``linker='cvm'`` are used, ``ifelse`` will compute both
 variables and take the same computation time as ``switch``. Although the linker
-is not currently set by default to 'cvm', it will be in the near future.
+is not currently set by default to ``cvm``, it will be in the near future.

 There is no automatic optimization replacing a ``switch`` with a
 broadcasted scalar to an ``ifelse``, as this is not always faster. See

--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
@@ -6,15 +6,15 @@ Debugging Theano: FAQ and Troubleshooting
 =========================================

 There are many kinds of bugs that might come up in a computer program.
-This page is structured as a FAQ.  It should provide recipes to tackle common
-problems, and introduce some of the tools that we use to find problems in our
-Theano code, and even (it happens) in Theano's internals, such as
+This page is structured as a FAQ.  It provides recipes to tackle common
+problems, and introduces some of the tools that we use to find problems in our
+own Theano code, and even (it happens) in Theano's internals, in
 :ref:`using_debugmode`.

 Isolating the Problem/Testing Theano Compiler
 ---------------------------------------------

-You can run your Theano function in a DebugMode(:ref:`using_debugmode`). 
+You can run your Theano function in a :ref:`DebugMode<using_debugmode>`. 
 This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.


@@ -87,9 +87,9 @@ Running the above code generates the following error message:
    _dot22(x, <TensorType(float64, matrix)>), [_dot22.0], 
    _dot22(x, InplaceDimShuffle{1,0}.0), 'Sequence id of Apply node=4')

-Needless to say the above is not very informative and does not provide much in
+Needless to say, the above is not very informative and does not provide much in
 the way of guidance. However, by instrumenting the code ever so slightly, we
-can get Theano to give us the exact source of the error.
+can get Theano to reveal the exact source of the error.

 .. code-block:: python

@@ -103,12 +103,12 @@ can get Theano to give us the exact source of the error.
    # provide Theano with a default test-value
    x.tag.test_value = numpy.random.rand(5,10)

-In the above, we are tagging the symbolic matrix ``x`` with a special test
+In the above, we are tagging the symbolic matrix *x* with a special test
 value. This allows Theano to evaluate symbolic expressions on-the-fly (by
-calling the ``perform`` method of each Op), as they are being defined. Sources
+calling the ``perform`` method of each op), as they are being defined. Sources
 of error can thus be identified with much more precision and much earlier in
 the compilation pipeline. For example, running the above code yields the
-following error message, which properly identifies line 23 as the culprit.
+following error message, which properly identifies *line 23* as the culprit.

 .. code-block:: bash

@@ -121,33 +121,33 @@ following error message, which properly identifies line 23 as the culprit.
        z[0] = numpy.asarray(numpy.dot(x, y))
    ValueError: ('matrices are not aligned', (5, 10), (20, 10))

-The compute_test_value mechanism works as follows:
+The ``compute_test_value`` mechanism works as follows:

-* Theano ``constants`` and ``shared variables`` are used as is. No need to instrument them.
-* A Theano ``variable`` (i.e. ``dmatrix``, ``vector``, etc.) should be
+* Theano ``constants`` and ``shared`` variables are used as is. No need to instrument them.
+* A Theano *variable* (i.e. ``dmatrix``, ``vector``, etc.) should be
  given a special test value through the attribute ``tag.test_value``.
 * Theano automatically instruments intermediate results. As such, any quantity
-  derived from ``x`` will be given a `tag.test_value` automatically.
+  derived from *x* will be given a ``tag.test_value`` automatically.

-`compute_test_value` can take the following values:
+``compute_test_value`` can take the following values:

 * ``off``: Default behavior. This debugging mechanism is inactive.
 * ``raise``: Compute test values on the fly. Any variable for which a test
  value is required, but not provided by the user, is treated as an error. An
  exception is raised accordingly.
-* ``warn``: Idem, but a warning is issued instead of an Exception.
+* ``warn``: Idem, but a warning is issued instead of an *Exception*.
 * ``ignore``: Silently ignore the computation of intermediate test values, if a
  variable is missing a test value.

 .. note::
-  This feature is currently incompatible with ``Scan`` and also with Ops
+  This feature is currently incompatible with ``Scan`` and also with ops
  which do not implement a ``perform`` method.


 How do I Print an Intermediate Value in a Function/Method?
 ----------------------------------------------------------

-Theano provides a 'Print' Op to do this.
+Theano provides a 'Print' op to do this.

 .. code-block:: python

@@ -166,8 +166,8 @@ Theano provides a 'Print' Op to do this.


 Since Theano runs your program in a topological order, you won't have precise
-control over the order in which multiple Print() Ops are evaluted.  For a more
-precise inspection of what's being computed where, when, and how, see the
+control over the order in which multiple ``Print()`` ops are evaluted.  For a more
+precise inspection of what's being computed where, when, and how, see the discussion 
 :ref:`faq_wraplinker`.

 .. warning::
@@ -178,8 +178,8 @@ precise inspection of what's being computed where, when, and how, see the
    to remove them to know if this is the cause or not.


-How do I Print a Graph (before or after compilation)?
----------------------------------------------------------
+"How do I Print a Graph?" (before or after compilation)
+-------------------------------------------------------

 .. TODO: dead links in the next paragraph

@@ -193,31 +193,33 @@ You can read about them in :ref:`libdoc_printing`.



-The Function I Compiled is Too Slow, what's up?
-----------------------------------------------
-First, make sure you're running in FAST_RUN mode.  
-FAST_RUN is the default mode, but make sure by passing ``mode='FAST_RUN'``
+"The Function I Compiled is Too Slow, what's up?"
+-------------------------------------------------
+
+First, make sure you're running in ``FAST_RUN`` mode. Even though 
+``FAST_RUN`` is the default mode, insist by passing ``mode='FAST_RUN'``
 to ``theano.function`` (or ``theano.make``) or by setting :attr:`config.mode`
 to ``FAST_RUN``.

-Second, try the theano :ref:`using_profilemode`.  This will tell you which
-Apply nodes, and which Ops are eating up your CPU cycles.
+Second, try the Theano :ref:`using_profilemode`.  This will tell you which
+``Apply`` nodes, and which ops are eating up your CPU cycles.

 Tips:

-* use the flags ``floatX=float32`` to use *float32* instead of *float64* 
-  for the theano type matrix(),vector(),...(if you used dmatrix, dvector()
-  they stay at *float64*).
-* Check that in the profile mode that there is no Dot operation and you're
-  multiplying two matrices of the same type. Dot should be optimized to
-  dot22 when the inputs are matrices and of the same type. This can happen
-  when using floatX=float32 and something in the graph makes one of the
-  inputs *float64*.
+* Use the flags ``floatX=float32`` to require type *float32* instead of *float64*; 
+  Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),...
+  since they respectively involve the default types *float32* and *float64*.
+* Check in the ``profile`` mode that there is no ``Dot`` op in the post-compilation
+  graph while you are multiplying two matrices of the same type. ``Dot`` should be
+  optimized to ``dot22`` when the inputs are matrices and of the same type. This can
+  still happen when using ``floatX=float32`` when one of the inputs of the graph is
+  of type *float64*.
+

 .. _faq_wraplinker:

-How do I Step through a Compiled Function with the WrapLinker?
--------------------------------------------------------------
+"How do I Step through a Compiled Function with the WrapLinker?"
+----------------------------------------------------------------

 This is not exactly a FAQ, but the doc is here for now...
 It's pretty easy to roll-your-own evaluation mode.
@@ -234,9 +236,9 @@ Check out this one:
            wrap_linker = theano.gof.WrapLinkerMany([theano.gof.OpWiseCLinker()], [print_eval])
            super(PrintEverythingMode, self).__init__(wrap_linker, optimizer='fast_run')

-When you use ``mode=PrintEverythingMode()`` as the mode for Function or Method,
-then you should see [potentially a lot of] output.  Every Apply node will be printed out,
-along with its position in the graph, the arguments to the ``perform`` or
+When you use ``mode=PrintEverythingMode()`` as the mode for ``Function`` or ``Method``,
+then you should see [potentially a lot of] output.  Every ``Apply`` node will be printed out,
+along with its position in the graph, the arguments to the functions ``perform`` or
 ``c_code`` and the output it computed.  

 >>> x = T.dscalar('x')
@@ -247,15 +249,15 @@ along with its position in the graph, the arguments to the ``perform`` or

 Admittedly, this may be a huge amount of
 output to read through if you are using big tensors... but you can choose to
-put logic inside of the *print_eval* function  that would, for example, only
-print something out if a certain kind of Op was used, at a certain program
-position, or if a particular value shows up in one of the inputs or outputs.
+put logic inside of the *print_eval* function that would, for example, print 
+something out only if a certain kind of op were used, at a certain program
+position, or only if a particular value showed up in one of the inputs or outputs.
 Use your imagination :)

 .. TODO: documentation for link.WrapLinkerMany

-This can be a really powerful debugging tool.
-Note the call to *fn* inside the call to *print_eval*; without it, the graph wouldn't get computed at all!
+This can be a really powerful debugging tool. Note the call to *fn* inside the call to
+*print_eval*; without it, the graph wouldn't get computed at all!

 How to Use pdb ?
 ----------------
@@ -264,7 +266,7 @@ In the majority of cases, you won't be executing from the interactive shell
 but from a set of Python scripts. In such cases, the use of the Python
 debugger can come in handy, especially as your models become more complex.
 Intermediate results don't necessarily have a clear name and you can get
-exceptions which are hard to decipher, due to the "compiled" nature of
+exceptions which are hard to decipher, due to the "compiled" nature of the
 functions.

 Consider this example script ("ex.py"):
@@ -287,7 +289,7 @@ Consider this example script ("ex.py"):
        f(mat1, mat2)

 This is actually so simple the debugging could be done easily, but it's for
-illustrative purposes. As the matrices can't be element-wise multiplied
+illustrative purposes. As the matrices can't be multiplied element-wise
 (unsuitable shapes), we get the following exception:

 .. code-block:: text
@@ -299,11 +301,11 @@ illustrative purposes. As the matrices can't be element-wise multiplied
    File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
    File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))

-The call stack contains a few useful informations to trace back the source
+The call stack contains some useful information to trace back the source
 of the error. There's the script where the compiled function was called --
 but if you're using (improperly parameterized) prebuilt modules, the error
 might originate from ops in these modules, not this script. The last line
-tells us about the Op that caused the exception. In this case it's a "mul"
+tells us about the op that caused the exception. In this case it's a "mul"
 involving variables with names "a" and "b". But suppose we instead had an
 intermediate result to which we hadn't given a name.


--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -74,7 +74,7 @@ Computing More than one Thing at the Same Time

 Theano supports functions with multiple outputs. For example, we can
 compute the :ref:`elementwise <libdoc_tensor_elementwise>` difference, absolute difference, and
-squared difference between two matrices ``a`` and ``b`` at the same time:
+squared difference between two matrices *a* and *b* at the same time:

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_3
@@ -123,7 +123,7 @@ array(35.0)

 This makes use of the :ref:`Param <function_inputs>` class which allows
 you to specify properties of your function's parameters with greater detail. Here we
-give a default value of 1 for ``y`` by creating a ``Param`` instance with
+give a default value of 1 for *y* by creating a ``Param`` instance with
 its ``default`` field set to 1.

 Inputs with default values must follow inputs without default
@@ -149,7 +149,7 @@ array(34.0)
 array(33.0)

 .. note::
-   ``Param`` does not know the name of the local variables ``y`` and ``w``
+   ``Param`` does not know the name of the local variables *y* and *w*
   that are passed as arguments.  The symbolic variable objects have name
   attributes (set by ``dscalars`` in the example above) and *these* are the
   names of the keyword parameters in the functions that we build.  This is
@@ -171,7 +171,7 @@ example, let's say we want to make an accumulator: at the beginning,
 the state is initialized to zero. Then, on each function call, the state
 is incremented by the function's argument.

-First let's define the ``accumulator`` function. It adds its argument to the
+First let's define the *accumulator* function. It adds its argument to the
 internal state, and returns the old state value.

 .. If you modify this code, also change :
@@ -187,13 +187,13 @@ so-called :ref:`shared variables<libdoc_compile_shared>`.
 These are hybrid symbolic and non-symbolic variables whose value may be shared 
 between multiple functions.  Shared variables can be used in symbolic expressions just like
 the objects returned by ``dmatrices(...)`` but they also have an internal
-value, that defines the value taken by this symbolic variable in *all* the
+value that defines the value taken by this symbolic variable in *all* the
 functions that use it.  It is called a *shared* variable because its value is
 shared between many functions.  The value can be accessed and modified by the
 ``.get_value()`` and ``.set_value()`` methods. We will come back to this soon.

-The other new thing in this code is the ``updates`` parameter of function.
-The updates is a list of pairs of the form (shared-variable, new expression).
+The other new thing in this code is the ``updates`` parameter of ``function``.
+``updates`` must be supplied with a list of pairs of the form (shared-variable, new expression).
 It can also be a dictionary whose keys are shared-variables and values are
 the new expressions.  Either way, it means "whenever this function runs, it
 will replace the ``.value`` of each shared variable with the result of the
@@ -241,9 +241,9 @@ achieve a similar result by returning the new expressions, and working with
 them in NumPy as usual.  The updates mechanism can be a syntactic convenience,
 but it is mainly there for efficiency.  Updates to shared variables can
 sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
-updates).  Also, theano has more control over where and how shared variables are
+updates).  Also, Theano has more control over where and how shared variables are
 allocated, which is one of the important elements of getting good performance
-on the GPU.
+on the :ref:`GPU<using_gpu>`.

 It may happen that you expressed some formula using a shared variable, but
 you do *not* want to use its value. In this case, you can use the
@@ -326,16 +326,16 @@ so we get different random numbers every time.
 >>> f_val1 = f()  #different numbers from f_val0

 When we add the extra argument ``no_default_updates=True`` to
-``function`` (as in ``g``), then the random number generator state is
+``function`` (as in *g*), then the random number generator state is
 not affected by calling the returned function.  So, for example, calling
-``g`` multiple times will return the same numbers.
+*g* multiple times will return the same numbers.

 >>> g_val0 = g()  # different numbers from f_val0 and f_val1
 >>> g_val1 = g()  # same numbers as g_val0!

 An important remark is that a random variable is drawn at most once during any
-single function execution.  So the ``nearly_zeros`` function is guaranteed to
-return approximately 0 (except for rounding error) even though the ``rv_u``
+single function execution.  So the *nearly_zeros* function is guaranteed to
+return approximately 0 (except for rounding error) even though the *rv_u*
 random variable appears three times in the output expression.

 >>> nearly_zeros = function([], rv_u + rv_u - 2 * rv_u)
@@ -363,8 +363,8 @@ Sharing Streams Between Functions
 ---------------------------------

 As usual for shared variables, the random number generators used for random
-variables are common between functions.  So our ``nearly_zeros`` function will
-update the state of the generators used in function ``f`` above.
+variables are common between functions.  So our *nearly_zeros* function will
+update the state of the generators used in function *f* above.

 For example:

@@ -416,8 +416,9 @@ The preceding elements are featured in this more realistic example.  It will be
  prediction = p_1 > 0.5                    # The prediction thresholded
  xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy loss function
  cost = xent.mean() + 0.01*(w**2).sum()    # The cost to minimize
-  gw,gb = T.grad(cost, [w,b])		  # Compute the gradient of the cost:
-					  # we shall return to this
+  gw,gb = T.grad(cost, [w,b])		  # Compute the gradient of the cost
+					  # (we shall return to this in a 
+					  # following section of this tutorial)

  # Compile
  train = theano.function(

--- a/doc/tutorial/extending_theano.txt
+++ b/doc/tutorial/extending_theano.txt
@@ -8,12 +8,12 @@ Extending Theano
 Theano Graphs
 -------------

- Theano works with symbolic graphs
- Those graphs are bi-partite graphs (graph with 2 types of nodes)
- The 2 types of nodes are Apply and Variable nodes
- Each Apply node has a link to the Op that it executes
+- Theano works with symbolic graphs.
+- Those graphs are bi-partite graphs (graph with 2 types of nodes).
+- The 2 types of nodes are Apply and Variable nodes.
+- Each Apply node has a link to the op that it executes.

-Inputs and Outputs are lists of Theano variables
+Inputs and Outputs are lists of Theano variables.

 .. image:: ../hpcs2011_tutorial/pics/apply_node.png
    :width: 500 px
@@ -93,12 +93,12 @@ The first one is :func:`make_node`. The second one
 would describe the computations that are required to be done
 at run time. Currently there are 2 different possibilites:
 implement the :func:`perform`
-and/or :func:`c_code <Op.c_code>` (and other related :ref:`c methods
-<cop>`), or the :func:`make_thunk` method. The ``perform`` allows
-to easily wrap an existing Python function into Theano. The ``c_code``
+and/or :func:`c_code <Op.c_code>` methods (and other related :ref:`c methods
+<cop>`), or the :func:`make_thunk` method. ``perform`` allows
+to easily wrap an existing Python function into Theano. ``c_code``
 and related methods allow the op to generate C code that will be 
-compiled and linked by Theano. On the other hand, the ``make_thunk``
-method will be called only once during compilation and should generate
+compiled and linked by Theano. On the other hand, ``make_thunk``
+will be called only once during compilation and should generate
 a ``thunk``: a standalone function that when called will do the wanted computations.
 This is useful if you want to generate code and compile it yourself. For
 example, this allows you to use PyCUDA to compile GPU code.
@@ -117,7 +117,7 @@ The :func:`grad` method is required if you want to differentiate some cost whose
 includes your op.

 The :func:`__str__` method is useful in order to provide a more meaningful
-string representation of your Op.
+string representation of your op.

 The :func:`R_op` method is needed if you want ``theano.tensor.Rop`` to
 work with your op.
@@ -185,9 +185,9 @@ in a file and execute it with the ``nosetests`` program.

 **Basic Tests**

-Basic tests are done by you just by using the Op and checking that it
+Basic tests are done by you just by using the op and checking that it
 returns the right answer. If you detect an error, you must raise an
-exception. You can use the `assert` keyword to automatically raise an
+*exception*. You can use the ``assert`` keyword to automatically raise an
 ``AssertionError``.

 .. code-block:: python
@@ -211,10 +211,10 @@ exception. You can use the `assert` keyword to automatically raise an
 **Testing the infer_shape**

 When a class inherits from the ``InferShapeTester`` class, it gets the
-``self._compile_and_check`` method that tests the Op ``infer_shape``
-method. It tests that the Op gets optimized out of the graph if only
+``self._compile_and_check`` method that tests the op's ``infer_shape``
+method. It tests that the op gets optimized out of the graph if only
 the shape of the output is needed and not the output
-itself. Additionally, it checks that such an optimized graph computes
+itself. Additionally, it checks that the optimized graph computes
 the correct shape, by comparing it to the actual shape of the computed
 output.

@@ -222,8 +222,8 @@ output.
 parameters the lists of input and output Theano variables, as would be
 provided to ``theano.function``, and a list of real values to pass to the
 compiled function (don't use shapes that are symmetric, e.g. (3, 3),
-as they can easily to hide errors). It also takes the Op class to
-verify that no Ops of that type appear in the shape-optimized graph.
+as they can easily to hide errors). It also takes the op class as a parameter to
+verify that no instance of it appears in the shape-optimized graph.

 If there is an error, the function raises an exception. If you want to
 see it fail, you can implement an incorrect ``infer_shape``.
@@ -248,7 +248,7 @@ see it fail, you can implement an incorrect ``infer_shape``.
 **Testing the gradient**

 The function :ref:`verify_grad <validating_grad>`
-verifies the gradient of an Op or Theano graph. It compares the
+verifies the gradient of an op or Theano graph. It compares the
 analytic (symbolically computed) gradient and the numeric
 gradient (computed through the Finite Difference Method).

@@ -266,13 +266,12 @@ the multiplication by 2).

 .. TODO: repair defective links in the following paragraph

-The class :class:`RopLop_checker`, give the functions
-:func:`RopLop_checker.check_mat_rop_lop`,
-:func:`RopLop_checker.check_rop_lop` and
-:func:`RopLop_checker.check_nondiff_rop` that allow to test the
-implementation of the Rop method of one Op.
+The class :class:`RopLop_checker` defines the functions
+:func:`RopLop_checker.check_mat_rop_lop`, :func:`RopLop_checker.check_rop_lop` and
+:func:`RopLop_checker.check_nondiff_rop`. These allow to test the
+implementation of the Rop method of a particular op.

-To verify the Rop method of the DoubleOp, you can use this:
+For instance, to verify the Rop method of the DoubleOp, you can use this:

 .. code-block:: python

@@ -290,7 +289,7 @@ Running your tests

 You can run ``nosetests`` in the Theano folder to run all of Theano's
 tests, including yours if they are somewhere in the directory
-structure.  You can run ``nosetests test_file.py`` to run only the
+structure.  For instance, you can run the following command lines to ``nosetests test_file.py`` to run only the
 tests in that file. You can run ``nosetests
 test_file.py:test_DoubleRop`` to run only the tests inside that test
 class. You can run ``nosetests
@@ -298,7 +297,7 @@ test_file.py:test_DoubleRop.test_double_op`` to run only one
 particular test. More `nosetests
 <http://readthedocs.org/docs/nose/en/latest/>`_ documentation.

-You can also add this at the end of the test file:
+You can also add this block the end of the test file and run the file:

 .. code-block:: python

@@ -311,14 +310,13 @@ You can also add this at the end of the test file:
 **Testing GPU Ops**


-Ops that execute on the GPU should inherit from the
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows Theano
-to make the distinction between both. Currently, we use this to test
-if the NVIDIA driver works correctly with our sum reduction code on the
+Ops to be executed on the GPU should inherit from the ``theano.sandbox.cuda.GpuOp`` 
+and not ``theano.Op``. This allows Theano to distinguish them. Currently, we
+use this to test if the NVIDIA driver works correctly with our sum reduction code on the
 GPU.


-A more extensive discussion than this section's may be found in the advanced
+A more extensive discussion of this section's topic may be found in the advanced
 tutorial :ref:`Extending Theano<extending>`

 -------------------------------------------

--- a/doc/tutorial/faq.txt
+++ b/doc/tutorial/faq.txt
@@ -8,19 +8,17 @@ Frequently Asked Questions
 TypeError: object of type 'TensorVariable' has no len()
 -------------------------------------------------------

-If you receive this error:
+If you receive the following error, it is because the Python function *__len__* cannot 
+be implemented on Theano variables:

 .. code-block:: python

   TypeError: object of type 'TensorVariable' has no len()

-We can't implement the __len__ function on Theano Variables. This is
-because Python requires that this function returns an integer, but we
-can't do this as we are working with symbolic variables. You can use
-`var.shape[0]` as a workaround.
+Python requires that *__len__* returns an integer, yet it cannot be done as Theano's symbolic variables. However, `var.shape[0]` can be used as a workaround.

-Also we can't change the above error message into a more explicit one
-because of some other Python internal behavior that can't be modified.
+This error message cannot be made more explicit because the relevant aspects of Python's 
+internals cannot be modified.


 Faster gcc optimization

--- a/doc/tutorial/gpu_data_convert.txt
+++ b/doc/tutorial/gpu_data_convert.txt
@@ -9,13 +9,13 @@ PyCUDA

 Currently, PyCUDA and Theano have different objects to store GPU
 data. The two implementations do not support the same set of features.
-Theano's implementation is called CudaNdarray and supports
+Theano's implementation is called *CudaNdarray* and supports
 *strides*. It also only supports the *float32* dtype. PyCUDA's implementation
-is called GPUArray and doesn't support *strides*. However, it can deal with
+is called *GPUArray* and doesn't support *strides*. However, it can deal with
 all NumPy and CUDA dtypes.

-We are currently working on having the same base object that will
-mimic Numpy. Until this is ready, here is some information on how to
+We are currently working on having the same base object for both that will
+also mimic Numpy. Until this is ready, here is some information on how to
 use both objects in the same script.

 Transfer
@@ -24,8 +24,8 @@ Transfer
 You can use the ``theano.misc.pycuda_utils`` module to convert GPUArray to and
 from CudaNdarray. The functions ``to_cudandarray(x, copyif=False)`` and
 ``to_gpuarray(x)`` return a new object that occupies the same memory space
-as the original. Otherwise it raises a ValueError. Because GPUArrays don't
-support *strides*, if the CudaNdarray is strided, we could copy it to
+as the original. Otherwise it raises a *ValueError*. Because GPUArrays don't
+support strides, if the CudaNdarray is strided, we could copy it to
 have a non-strided copy. The resulting GPUArray won't share the same
 memory region. If you want this behavior, set ``copyif=True`` in
 ``to_gpuarray``.
@@ -122,13 +122,15 @@ CUDAMat

 There are functions for conversion between CUDAMat objects and Theano's CudaNdArray objects. 
 They obey the same principles as Theano's PyCUDA functions and can be found in
-theano.misc.cudamat_utils.py
+``theano.misc.cudamat_utils.py``.

-WARNING: There is a strange problem associated with stride/shape with those converters. 
-In order to work, the test needs a transpose and reshape...
+.. TODO: this statement is unclear:
+
+WARNING: There is a peculiar problem associated with stride/shape with those converters. 
+In order to work, the test needs a *transpose* and *reshape*...

 Gnumpy
 ======

-There are conversion functions between Gnumpy ``garray`` objects and Theano CudaNdArray objects. 
-They are also similar to Theano's PyCUDA functions and can be found in theano.misc.gnumpy_utils.py.
+There are conversion functions between Gnumpy *garray* objects and Theano CudaNdArray objects. 
+They are also similar to Theano's PyCUDA functions and can be found in ``theano.misc.gnumpy_utils.py``.
--- a/doc/tutorial/gradients.txt
+++ b/doc/tutorial/gradients.txt
@@ -10,12 +10,14 @@ Computing Gradients
 ===================

 Now let's use Theano for a slightly more sophisticated task: create a
-function which computes the derivative of some expression ``y`` with
-respect to its parameter ``x``. To do this we will use the macro ``T.grad``.
+function which computes the derivative of some expression *y* with
+respect to its parameter *x*. To do this we will use the macro ``T.grad``.
 For instance, we can compute the
 gradient of :math:`x^2` with respect to :math:`x`. Note that:
 :math:`d(x^2)/dx = 2 \cdot x`.

+.. TODO: fix the vertical positioning of the expressions in the preceding paragraph
+
 Here is the code to compute this gradient:

 .. If you modify this code, also change :
@@ -36,7 +38,7 @@ array(188.40000000000001)
 In this example, we can see from ``pp(gy)`` that we are computing
 the correct symbolic gradient.
 ``fill((x ** 2), 1.0)`` means to make a matrix of the same shape as
-``x ** 2`` and fill it with 1.0.
+*x* ** *2* and fill it with *1.0*.

 .. note::
    The optimizer simplifies the symbolic gradient expression.  You can see
@@ -56,7 +58,7 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.

 .. figure:: dlogistic.png

-    A plot of the gradient of the logistic function, with x on the x-axis
+    A plot of the gradient of the logistic function, with *x* on the x-axis
    and :math:`ds(x)/dx` on the y-axis.


@@ -71,17 +73,17 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.
 array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

-In general, for any **scalar** expression ``s``, ``T.grad(s, w)`` provides
+In general, for any **scalar** expression *s*, ``T.grad(s, w)`` provides
 the Theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
 this way Theano can be used for doing **efficient** symbolic differentiation
-(as the expression return by ``T.grad`` will be optimized during compilation), even for
+(as the expression returned by ``T.grad`` will be optimized during compilation), even for
 function with many inputs. (see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
 of symbolic differentiation).

 .. note::

   The second argument of ``T.grad`` can be a list, in which case the
-   output is also a list. The order in both lists is important, element
+   output is also a list. The order in both lists is important: element
   *i* of the output list is the gradient of the first argument of
   ``T.grad`` with respect to the *i*-th element of the list given as second argument.
   The first argument of ``T.grad`` has to be a scalar (a tensor
@@ -95,14 +97,17 @@ of symbolic differentiation).
 Computing the Jacobian
 ======================

-Theano implements :func:`theano.gradient.jacobian` macro that does all
-what is needed to compute the Jacobian. The following text explains how
+In Theano's parlance, the term *Jacobian* designates the tensor comprising the
+first differences of the output of a function with respect to its inputs.
+(This is a generalization of to the so-called Jacobian matrix in Mathematics.) 
+Theano implements the :func:`theano.gradient.jacobian` macro that does all
+that is needed to compute the Jacobian. The following text explains how
 to do it manually.

-In order to manually compute the Jacobian of some function ``y`` with
-respect to some parameter ``x`` we need to use ``scan``. What we
-do is to loop over the entries in ``y`` and compute the gradient of
-``y[i]`` with respect to ``x``.
+In order to manually compute the Jacobian of some function *y* with
+respect to some parameter *x* we need to use ``scan``. What we
+do is to loop over the entries in *y* and compute the gradient of
+*y[i]* with respect to *x*.

 .. note::
    
@@ -110,7 +115,7 @@ do is to loop over the entries in ``y`` and compute the gradient of
    manner all kinds of recurrent equations. While creating
    symbolic loops (and optimizing them for performance) is a hard task,
    effort is being done for improving the performance of ``scan``. We 
-    shall return to ``scan`` in a moment.
+    shall return to ``scan`` later in this tutorial.

 >>> x = T.dvector('x')
 >>> y = x**2
@@ -120,31 +125,33 @@ do is to loop over the entries in ``y`` and compute the gradient of
 array([[ 8.,  0.],
       [ 0.,  8.]])
    
-What we do in this code is to generate a sequence of ints from ``0`` to
+What we do in this code is to generate a sequence of *ints* from *0* to
 ``y.shape[0]`` using ``T.arange``. Then we loop through this sequence, and
-at each step, we compute the gradient of element ``y[[i]`` with respect to
-``x``. ``scan`` automatically concatenates all these rows, generating a
+at each step, we compute the gradient of element *y[i]* with respect to
+*x*. ``scan`` automatically concatenates all these rows, generating a
 matrix which corresponds to the Jacobian.

 .. note::
    There are some pitfalls to be aware of regarding ``T.grad``. One of them is that you
-    cannot re-write the above expression of the jacobian as
+    cannot re-write the above expression of the Jacobian as
    ``theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
    non_sequences=x)``, even though from the documentation of scan this
-    seems possible. The reason is that ``y_i`` will not be a function of
-    ``x`` anymore, while ``y[i]`` still is. 
+    seems possible. The reason is that *y_i* will not be a function of
+    *x* anymore, while *y[i]* still is. 


 Computing the Hessian
 =====================

-Theano implements :func:`theano.gradient.hessian` macro that does all
+In Theano, the term *Hessian* has the usual mathematical acception: It is the 
+matrix comprising the second order partial derivative of a function with scalar
+output and vector input. Theano implements :func:`theano.gradient.hessian` macro that does all
 that is needed to compute the Hessian. The following text explains how
 to do it manually.

 You can compute the Hessian manually similarly to the Jacobian. The only
 difference is that now, instead of computing the Jacobian of some expression
-``y``, we compute the Jacobian of ``T.grad(cost,x)``, where ``cost`` is some
+*y*, we compute the Jacobian of ``T.grad(cost,x)``, where *cost* is some
 scalar. 


@@ -181,12 +188,12 @@ R-operator

 The *R operator* is built to evaluate the product between a Jacobian and a
 vector, namely :math:`\frac{\partial f(x)}{\partial x} v`. The formulation
-can be extended even for `x` being a matrix, or a tensor in general, case in
+can be extended even for *x* being a matrix, or a tensor in general, case in
 which also the Jacobian becomes a tensor and the product becomes some kind
 of tensor product. Because in practice we end up needing to compute such
 expressions in terms of weight matrices, Theano supports this more generic
 form of the operation. In order to evaluate the *R-operation* of
-expression ``y``, with respect to ``x``, multiplying the Jacobian with ``v``
+expression *y*, with respect to *x*, multiplying the Jacobian with *v*
 you need to do something similar to this:


@@ -221,19 +228,19 @@ array([[ 0.,  0.],

 .. note::
    
-    `v`, the point of evaluation, differs between the *L-operator* and the *R-operator*.
+    `v`, the *point of evaluation*, differs between the *L-operator* and the *R-operator*.
    For the *L-operator*, the point of evaluation needs to have the same shape
    as the output, whereas for the *R-operator* this point should
    have the same shape as the input parameter. Furthermore, the results of these two
    operations differ. The result of the *L-operator* is of the same shape
    as the input parameter, while the result of the *R-operator* has a shape similar
-    to the output.
+    to that of the output.


 Hessian times a Vector
 ======================

-If you need to compute the Hessian times a vector, you can make use of the
+If you need to compute the *Hessian times a vector*, you can make use of the
 above-defined operators to do it more efficiently than actually computing
 the exact Hessian and then performing the product. Due to the symmetry of the 
 Hessian matrix, you have two options that will
@@ -267,7 +274,7 @@ Final Pointers
 ==============


-* The ``grad`` function works symbolically: it receives and returns a Theano variables.
+* The ``grad`` function works symbolically: it receives and returns Theano variables.

 * ``grad`` can be compared to a macro since it can be applied repeatedly.

@@ -276,5 +283,5 @@ Final Pointers
 * Built-in functions allow to compute efficiently *vector times Jacobian* and *vector times Hessian*.

 * Work is in progress on the optimizations required to compute efficiently the full
-  Jacobian and Hessian matrices and the *Jacobian times vector* expression.
+  Jacobian and the Hessian matrix as well as the *Jacobian times vector*.

--- a/doc/tutorial/loading_and_saving.txt
+++ b/doc/tutorial/loading_and_saving.txt
@@ -6,8 +6,8 @@ Loading and Saving
 ==================

 Python's standard way of saving class instances and reloading them
-is the pickle_ mechanism. Many Theano objects can be serialized (and
-deserialized) by ``pickle``, however, a limitation of ``pickle`` is that
+is the pickle_ mechanism. Many Theano objects can be *serialized* (and
+*deserialized*) by ``pickle``, however, a limitation of ``pickle`` is that
 it does not save the code or data of a class along with the instance of
 the class being serialized. As a result, reloading objects created by a
 previous version of a class can be really problematic.
@@ -126,7 +126,7 @@ maybe defining the attributes you want to save, rather than the ones you
 don't.

 For instance, if the only parameters you want to save are a weight
-matrix ``W`` and a bias ``b``, you can define:
+matrix *W* and a bias *b*, you can define:

 .. code-block:: python

@@ -138,8 +138,8 @@ matrix ``W`` and a bias ``b``, you can define:
        self.W = W
        self.b = b

-If at some point in time ``W`` is renamed to ``weights`` and ``b`` to
-``bias``, the older pickled files will still be usable, if you update these
+If at some point in time *W* is renamed to *weights* and *b* to
+*bias*, the older pickled files will still be usable, if you update these
 functions to reflect the change in name:

 .. code-block:: python
@@ -152,6 +152,6 @@ functions to reflect the change in name:
        self.weights = W
        self.bias = b

-For more information on advanced use of pickle and its internals, see Python's
+For more information on advanced use of ``pickle`` and its internals, see Python's
 pickle_ documentation.

--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
@@ -9,10 +9,10 @@ Scan
 ====

 - A general form of *recurrence*, which can be used for looping.
- *Reduction* and *map* (loop over the leading dimensions) are special cases of scan.
- You 'scan' a function along some input sequence, producing an output at each time-step.
+- *Reduction* and *map* (loop over the leading dimensions) are special cases of ``scan``.
+- You ``scan`` a function along some input sequence, producing an output at each time-step.
 - The function can see the *previous K time-steps* of your function.
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
+- ``sum()`` could be computed by scanning the *z + x(i)* function over a list, given an initial state of *z=0*.
 - Often a *for* loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
 - Advantages of using ``scan`` over *for* loops:
  
@@ -30,6 +30,7 @@ The full documentation can be found in the library: :ref:`Scan <lib_scan>`.

  import theano
  import theano.tensor as T
+  theano.config.warn.subtensor_merge_bug = False

  k = T.iscalar("k"); A = T.vector("A")

@@ -54,8 +55,10 @@ The full documentation can be found in the library: :ref:`Scan <lib_scan>`.

 .. code-block:: python

+  import numpy
  import theano
  import theano.tensor as T
+  theano.config.warn.subtensor_merge_bug = False	

  coefficients = theano.tensor.vector("coefficients")
  x = T.scalar("x"); max_coefficients_supported = 10000

--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
@@ -9,14 +9,14 @@ Configuration Settings and Compiling Modes
 Configuration
 =============

-The ``config`` module contains several ``attributes`` that modify Theano's behavior.  Many of these
+The ``config`` module contains several *attributes* that modify Theano's behavior.  Many of these
 attributes are examined during the import of the ``theano`` module and several are assumed to be
 read-only.

 *As a rule, the attributes in the* ``config`` *module should not be modified inside the user code.*

 Theano's code comes with default values for these attributes, but you can
-override them from your .theanorc file, and override those values in turn by
+override them from your ``.theanorc`` file, and override those values in turn by
 the :envvar:`THEANO_FLAGS` environment variable.

 The order of precedence is:
@@ -110,6 +110,8 @@ time the execution using the command line ``time python file.py``.

 .. TODO: To be resolved:
 
+.. Solution said:
+
 .. You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``

 .. Why the latter portion?
@@ -119,10 +121,10 @@ time the execution using the command line ``time python file.py``.

   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
   * Cast inputs before storing them into a shared variable.
-   * Circumvent the automatic cast of int32 with float32 to float64:
+   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
    
-     * Insert manual cast in your code or use [u]int{8,16}.
-     * Insert manual cast around the mean operator (this involves division by length, which is an int64).
+     * Insert manual cast in your code or use *[u]int{8,16}*.
+     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
     * Notice that a new casting mechanism is being developed.

 -------------------------------------------
@@ -130,7 +132,7 @@ time the execution using the command line ``time python file.py``.
 Mode
 ====

-Everytime :func:`theano.function <function.function>` is called
+Everytime :func:`theano.function <function.function>` is called,
 the symbolic relationships between the input and output Theano *variables*
 are optimized and compiled. The way this compilation occurs
 is controlled by the value of the ``mode`` parameter.
@@ -139,9 +141,9 @@ Theano defines the following modes by name:

 - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
 - ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and Python
-    implementations. This mode can take much longer than the other modes,
-    but can identify many kinds of problems.
+- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and Python 
+    implementations. This mode can take much longer than the other modes,but can identify
+    several kinds of problems.
 - ``'PROFILE_MODE'``: Same optimization then FAST_RUN, put print some profiling information

 The default mode is typically ``FAST_RUN``, but it can be controlled via
@@ -152,18 +154,18 @@ which can be overridden by passing the keyword argument to
 ================= =============================================================== ===============================================================================
 short name        Full constructor                                                What does it do?
 ================= =============================================================== ===============================================================================
-FAST_COMPILE      ``compile.mode.Mode(linker='py', optimizer='fast_compile')``    Python implementations only, quick and cheap graph transformations
-FAST_RUN          ``compile.mode.Mode(linker='c|py', optimizer='fast_run')``      C implementations where available, all available graph transformations.
-DEBUG_MODE        ``compile.debugmode.DebugMode()``                               Both implementations where available, all available graph transformations.
-PROFILE_MODE      ``compile.profilemode.ProfileMode()``                           C implementations where available, all available graph transformations, print profile information.
+``FAST_COMPILE``  ``compile.mode.Mode(linker='py', optimizer='fast_compile')``    Python implementations only, quick and cheap graph transformations
+``FAST_RUN``      ``compile.mode.Mode(linker='c|py', optimizer='fast_run')``      C implementations where available, all available graph transformations.
+``DEBUG_MODE``    ``compile.debugmode.DebugMode()``                               Both implementations where available, all available graph transformations.
+``PROFILE_MODE``  ``compile.profilemode.ProfileMode()``                           C implementations where available, all available graph transformations, print profile information.
 ================= =============================================================== ===============================================================================

 Linkers
 =======

 A mode is composed of 2 things: an optimizer and a linker. Some modes,
-like PROFILE_MODE and DEBUG_MODE, add logic around the optimizer and
-linker. PROFILE_MODE and DEBUG_MODE use their own linker.
+like ``PROFILE_MODE`` and ``DEBUG_MODE``, add logic around the optimizer and
+linker. ``PROFILE_MODE`` and ``DEBUG_MODE`` use their own linker.

 You can select witch linker to use with the Theano flag :attr:`config.linker`.
 Here is a table to compare the different linkers.
@@ -184,8 +186,8 @@ DebugMode      no         yes                VERY HIGH  Make many checks on what
 .. [#gc] Garbage collection of intermediate results during computation.
         Otherwise, their memory space used by the ops is kept between
         Theano function calls, in order not to
-         reallocate memory, and lower the overhead (make it faster...)
-.. [#cpy1] default
+         reallocate memory, and lower the overhead (make it faster...).
+.. [#cpy1] Default
 .. [#cpy2] Deprecated


@@ -201,10 +203,10 @@ While normally you should use the ``FAST_RUN`` or ``FAST_COMPILE`` mode,
 it is useful at first (especially when you are defining new kinds of
 expressions or new optimizations) to run your code using the DebugMode
 (available via ``mode='DEBUG_MODE'``). The DebugMode is designed to
-do several self-checks and assertations that can help to diagnose
-possible programming errors that can lead to incorect output. Note that
-``DEBUG_MODE`` is much slower then ``FAST_RUN`` or ``FAST_COMPILE`` so
-use it only during development (not when you launch 1000 process on a
+run several self-checks and assertions that can help diagnose
+possible programming errors leading to incorrect output. Note that
+``DEBUG_MODE`` is much slower than ``FAST_RUN`` or ``FAST_COMPILE`` so
+use it only during development (not when you launch 1000 processes on a
 cluster!).


@@ -225,14 +227,16 @@ DebugMode is used as follows:


 If any problem is detected, DebugMode will raise an exception according to
-what went wrong, either at call time (``f(5)``) or compile time (
+what went wrong, either at call time (*f(5)*) or compile time (
 ``f = theano.function(x, 10*x, mode='DEBUG_MODE')``). These exceptions
 should *not* be ignored; talk to your local Theano guru or email the
 users list if you cannot make the exception go away.

 Some kinds of errors can only be detected for certain input value combinations.
-In the example above, there is no way to guarantee that a future call to say,
-``f(-1)`` won't cause a problem.  DebugMode is not a silver bullet.
+In the example above, there is no way to guarantee that a future call to, say
+*f(-1)*, won't cause a problem.  DebugMode is not a silver bullet.
+
+.. TODO: repair the following link

 If you instantiate DebugMode using the constructor (see :class:`DebugMode`)
 rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
@@ -277,7 +281,7 @@ implementation only, should use the gof.PerformLinker (or "py" for
 short). On the other hand, a user wanting to profile his graph using C
 implementations wherever possible should use the ``gof.OpWiseCLinker``
 (or "c|py"). For testing the speed of your code we would recommend
-using the 'fast_run' optimizer and ``gof.OpWiseCLinker`` linker.
+using the ``fast_run`` optimizer and the ``gof.OpWiseCLinker`` linker.

 Compiling your Graph with ProfileMode
 -------------------------------------
@@ -300,7 +304,7 @@ the desired timing information, indicating where your graph is spending most
 of its time. This is best shown through an example. Let's use our logistic
 regression example.

-Compiling the module with ProfileMode and calling ``profmode.print_summary()``
+Compiling the module with ``ProfileMode`` and calling ``profmode.print_summary()``
 generates the following output:

 .. code-block:: python
@@ -352,14 +356,14 @@ generates the following output:

 This output has two components. In the first section called
 *Apply-wise summary*, timing information is provided for the worst
-offending Apply nodes. This corresponds to individual Op applications
+offending ``Apply`` nodes. This corresponds to individual op applications
 within your graph which took longest to execute (so if you use
 ``dot`` twice, you will see two entries there). In the second portion,
-the *Op-wise summary*, the execution time of all Apply nodes executing
-the same Op are grouped together and the total execution time per Op
+the *Op-wise summary*, the execution time of all ``Apply`` nodes executing
+the same op are grouped together and the total execution time per op
 is shown (so if you use ``dot`` twice, you will see only one entry
 there corresponding to the sum of the time spent in each of them).
-Finally, notice that the ProfileMode also shows which Ops were running a C
+Finally, notice that the ``ProfileMode`` also shows which ops were running a C
 implementation.



--- a/doc/tutorial/printing_drawing.txt
+++ b/doc/tutorial/printing_drawing.txt
+
+.. _tutorial_printing_drawing:
+
+==============================
+Printing/Drawing Theano graphs
+==============================
+
+.. TODO: repair the defective links in the next paragraph
+
+Theano provides two functions (:func:`theano.pp` and
+:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
+compilation.  These two functions print expression graphs in different ways:
+:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
+Theano also provides :func:`pydotprint` that creates a *png* image of the function.
+You can read about them in :ref:`libdoc_printing`.
+
+Consider again the logistic regression but notice the additional printing instuctions. 
+The following output depicts the pre- and post- compilation graphs.
+
+.. code-block:: python
+    
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+
+
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates={w:w-0.01*gw, b:b-0.01*gb},
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+
+    if any( [x.op.__class__.__name__=='Gemv' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+
+
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+
+    print "target values for D"
+    print D[1]
+
+    print "prediction on D"
+    print predict(D[0])
+
+
+    # Print the picture graphs
+    # after compilation
+    theano.printing.pydotprint(predict,
+                               outfile="pics/logreg_pydotprint_predic.png",
+                               var_with_name_simple=True)
+    # before compilation
+    theano.printing.pydotprint_variables(prediction,
+                               outfile="pics/logreg_pydotprint_prediction.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint(train,
+                               outfile="pics/logreg_pydotprint_train.png",
+                               var_with_name_simple=True)
+
+
+Pretty Printing
+===============
+
+``theano.printing.pprint(variable)``
+
+>>> theano.printing.pprint(prediction)  # (pre-compilation)
+gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))),TensorConstant{0.5})
+
+
+Debug Printing
+==============
+
+``theano.printing.debugprint({fct, variable, list of variables})``
+
+>>> theano.printing.debugprint(prediction)  # (pre-compilation)
+Elemwise{gt,no_inplace} [@181772236] ''
+ |Elemwise{true_div,no_inplace} [@181746668] ''
+ | |InplaceDimShuffle{x} [@181746412] ''
+ | | |TensorConstant{1} [@181745836]
+ | |Elemwise{add,no_inplace} [@181745644] ''
+ | | |InplaceDimShuffle{x} [@181745420] ''
+ | | | |TensorConstant{1} [@181744844]
+ | | |Elemwise{exp,no_inplace} [@181744652] ''
+ | | | |Elemwise{sub,no_inplace} [@181744012] ''
+ | | | | |Elemwise{neg,no_inplace} [@181730764] ''
+ | | | | | |dot [@181729676] ''
+ | | | | | | |x [@181563948]
+ | | | | | | |w [@181729964]
+ | | | | |InplaceDimShuffle{x} [@181743788] ''
+ | | | | | |b [@181730156]
+ |InplaceDimShuffle{x} [@181771788] ''
+ | |TensorConstant{0.5} [@181771148]
+>>> theano.printing.debugprint(predict)  # (post-compilation)
+Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ''   2
+ |dot [@183018796] ''   1
+ | |x [@183000780]
+ | |w [@183000812]
+ |InplaceDimShuffle{x} [@183133580] ''   0
+ | |b [@183000876]
+ |TensorConstant{[ 0.5]} [@183084108]
+
+
+Picture Printing
+================
+
+>>> theano.printing.pydotprint_variables(prediction)  # (pre-compilation)
+
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
+   :width: 800 px
+
+Notice that ``pydotprint()`` requires *Graphviz* and Python's ``pydot``.
+
+>>> theano.printing.pydotprint(predict)  # (post-compilation)
+
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
+   :width: 800 px
+
+>>> theano.printing.pydotprint(train) # This is a small train example!
+
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
+   :width: 1500 px
+
--- a/doc/tutorial/remarks.txt
+++ b/doc/tutorial/remarks.txt
@@ -5,15 +5,19 @@
 Some general Remarks
 =====================

-Theano offers quite a bit of flexibility, but has some limitations too.
-How should you write your algorithm to make the most of what Theano can do?
+.. TODO: This discussion is awkward. Even with this beneficial reordering (28 July 2012) its purpose and message are unclear. 
+

 Limitations
 -----------

- While- or for-Loops within an expression graph are supported, but only via
+Theano offers a good amount of flexibility, but has some limitations too.
+How then can you write your algorithm to make the most of what Theano can do?
+
+
+- *While*- or *for*-Loops within an expression graph are supported, but only via
  the :func:`theano.scan` op (which puts restrictions on how the loop body can
  interact with the rest of the graph).

- Neither ``goto`` nor recursion is supported or planned within expression graphs.
+- Neither *goto* nor *recursion* is supported or planned within expression graphs.

--- a/doc/tutorial/shape_info.txt
+++ b/doc/tutorial/shape_info.txt
@@ -18,7 +18,7 @@ Currently, information regarding shape is used in two ways in Theano:
  `Op.infer_shape <http://deeplearning.net/software/theano/extending/cop.html#Op.infer_shape>`_
  method.

-  ex:
+  Example:

  .. code-block:: python

@@ -40,7 +40,7 @@ Shape Inference Problem
 =======================

 Theano propagates information about shape in the graph. Sometimes this
-can lead to errors. For example:
+can lead to errors. Consider this example:

 .. code-block:: python

@@ -90,19 +90,19 @@ example), an inferred shape is computed directly, without executing
 the computation itself (there is no ``join`` in the first output or debugprint).

 This makes the computation of the shape faster, but it can also hide errors. In
-the example, the computation of the shape of the output of ``join`` is done only
+this example, the computation of the shape of the output of ``join`` is done only
 based on the first input Theano variable, which leads to an error.

-This might happen with other ops such as elemwise, dot, ...
+This might happen with other ops such as ``elemwise`` and ``dot``, for example.
 Indeed, to perform some optimizations (for speed or stability, for instance),
 Theano assumes that the computation is correct and consistent
 in the first place, as it does here.

 You can detect those problems by running the code without this
-optimization, with the Theano flag
+optimization, using the Theano flag
 ``optimizer_excluding=local_shape_to_shape_i``. You can also obtain the
-same effect by running in the modes FAST_COMPILE (it will not apply this
-optimization, nor most other optimizations) or DEBUG_MODE (it will test
+same effect by running in the modes ``FAST_COMPILE`` (it will not apply this
+optimization, nor most other optimizations) or ``DEBUG_MODE`` (it will test
 before and after all optimizations (much slower)).


@@ -113,15 +113,15 @@ Currently, specifying a shape is not as easy and flexible as we wish and we plan
 upgrade.  Here is the current state of what can be done:

 - You can pass the shape info directly to the ``ConvOp`` created
-  when calling conv2d. You simply add the parameters image_shape
-  and filter_shape to the call. They must be tuples of 4
+  when calling ``conv2d``. You simply set the parameters ``image_shape``
+  and ``filter_shape`` inside the call. They must be tuples of 4
  elements. For example:

 .. code-block:: python

    theano.tensor.nnet.conv2d(..., image_shape=(7,3,5,5), filter_shape=(2,3,4,4))

- You can use the SpecifyShape op to add shape information anywhere in the
+- You can use the ``SpecifyShape`` op to add shape information anywhere in the
  graph. This allows to perform some optimizations. In the following example,
  this makes it possible to precompute the Theano function to a constant.

@@ -138,6 +138,6 @@ Future Plans
 ============

  The parameter "constant shape" will be added to ``theano.shared()``. This is probably
-  the most frequent case with ``shared variables``. This will make the code
+  the most frequent occurrence with ``shared`` variables. It will make the code
  simpler and will make it possible to check that the shape does not change when
-  updating the shared variable.
+  updating the ``shared`` variable.
--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -19,7 +19,7 @@ relations using symbolic placeholders (**variables**). When writing down
 these expressions you use operations like ``+``, ``-``, ``**``,
 ``sum()``, ``tanh()``. All these are represented internally as **ops**. 
 An **op** represents a certain computation on some type of inputs
-producing some type of output. You can see it as a function definition
+producing some type of output. You can see it as a *function definition*
 in most programming languages. 

 Theano builds internally a graph structure composed of interconnected 
@@ -69,15 +69,15 @@ Take for example the following code:
    x = T.dmatrix('x')
    y = x*2.

-If you print `type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``, 
+If you enter ``type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``, 
 which is the apply node that connects the op and the inputs to get this
 output. You can now print the name of the op that is applied to get 
-``y``:
+*y*:

 >>> y.owner.op.name
 'Elemwise{mul,no_inplace}'

-Hence, an elementwise multiplication is used to compute ``y``. This
+Hence, an elementwise multiplication is used to compute *y*. This
 multiplication is done between the inputs:

 >>> len(y.owner.inputs)
@@ -89,7 +89,7 @@ InplaceDimShuffle{x,x}.0

 Note that the second input is not 2 as we would have expected. This is 
 because 2 was first :term:`broadcasted <broadcasting>` to a matrix of 
-same shape as x. This is done by using the op ``DimShuffle`` :
+same shape as *x*. This is done by using the op ``DimShuffle`` :

 >>> type(y.owner.inputs[1])
 <class 'theano.tensor.basic.TensorVariable'>
@@ -122,7 +122,7 @@ Using the
 these gradients can be composed in order to obtain the expression of the 
 gradient of the graph's output with respect to the graph's inputs .

-A coming section of this tutorial will address the topic of differentiation
+A following section of this tutorial will examine the topic of differentiation
 in greater detail.


@@ -131,7 +131,7 @@ Optimizations

 When compiling a Theano function, what you give to the
 :func:`theano.function <function.function>` is actually a graph
-(starting from the outputs variables you can traverse the graph up to
+(starting from the output variables you can traverse the graph up to
 the input variables). While this graph structure shows how to compute
 the output from the input, it also offers the possibility to improve the  
 way this computation is carried out. The way optimizations work in 

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -5,11 +5,14 @@
 Using the GPU
 =============

+For an introductory discussion of *Graphical Processing Units* (GPU) and their use for
+intensive parallel computation purposes, see `GPGPU <http://en.wikipedia.org/wiki/GPGPU>`_.
+
 One of Theano's design goals is to specify computations at an
 abstract level, so that the internal function compiler has a lot of flexibility
 about how to carry out those computations.  One of the ways we take advantage of
 this flexibility is in carrying out calculations on an Nvidia graphics card when
-there is a CUDA-enabled device present in the computer.
+the device present in the computer is CUDA-enabled.

 Setting Up CUDA
 ----------------
@@ -52,11 +55,11 @@ file and run it.
    else:
        print 'Used the gpu'

-The program just computes the exp() of a bunch of random numbers.
-Note that we use the `shared` function to
-make sure that the input `x` is stored on the graphics device.
+The program just computes the ``exp()`` of a bunch of random numbers.
+Note that we use the ``shared`` function to
+make sure that the input *x* is stored on the graphics device.

-If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
+If I run this program (in thing.py) with ``device=cpu``, my computer takes a little over 7 seconds,
 whereas on the GPU it takes just over 0.4 seconds. The GPU will not always produce the exact 
 same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.

@@ -71,18 +74,18 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump
    Looping 1000 times took 0.418929815292 seconds
    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

-Note that for now GPU operations in Theano require floatX to be float32 (see also below).
+Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).

 Returning a Handle to Device-Allocated Data
 -------------------------------------------

 The speedup is not greater in the preceding example because the function is
 returning its result as a NumPy ndarray which has already been copied from the
-device to the host for your convenience.  This is what makes it so easy to swap in device=gpu, but
-if you don't mind being less portable, you might prefer to see a bigger speedup by changing
-the graph to express a computation with a GPU-stored result.  The gpu_from_host
-Op means "copy the input from the host to the GPU" and it is optimized away
-after the T.exp(x) is replaced by a GPU version of exp().
+device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
+if you don't mind less portability, you might gain a bigger speedup by changing
+the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
+op means "copy the input from the host to the GPU" and it is optimized away
+after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.

 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_2
@@ -131,12 +134,16 @@ NumPy casting mechanism.
 Running the GPU at Full Speed
 ------------------------------

+..  TODO: the discussion of this section is unintelligible to a beginner
+
+
 To really get maximum performance in this simple example, we need to use an :class:`Out`
 instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
 internal use like a working buffer, but by default it will never return a result that is
 allocated in the working buffer.  This is normally what you want, but our example is so simple
 that it has the unwanted side-effect of really slowing things down.

+
 .. 
    TODO:
    The story here about copying and working buffers is misleading and potentially not correct
@@ -181,12 +188,11 @@ the CPU implementation!
    Result is <CudaNdarray object at 0x31eeaf0>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

-This version of the code ``using borrow=True`` is slightly less safe because if we had saved
-the `r` returned from one function call, we would have to take care and remember that its value might
-be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
-be careful!  The advantage of
-borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
-mistake by failing to account for the resulting memory aliasing.
+This version of the code using ``borrow=True`` is slightly less safe because if we had saved
+the *r* returned from one function call, we would have to take care and remember that its value might
+be over-written by a subsequent function call.  Although ``borrow=True`` makes a dramatic difference
+in this example, be careful!  The advantage of ``borrow=True`` is much weaker in larger graphs, and 
+there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.


 What Can Be Accelerated on the GPU?
@@ -197,8 +203,8 @@ implementations, and vary from device to device, but to give a rough idea of
 what to expect right now:

 * Only computations 
-  with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but
-  float64 computations are still relatively slow (Jan 2010).  
+  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
+  *float64* computations are still relatively slow (Jan 2010).  
 * Matrix
  multiplication, convolution, and large element-wise operations can be
  accelerated a lot (5-50x) when arguments are large enough to keep 30
@@ -219,35 +225,35 @@ Tips for Improving Performance on GPU
 -------------------------------------

 * Consider 
-  adding ``floatX=float32`` to your .theanorc file if you plan to do a lot of
+  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
  GPU work.
 * Prefer  
-  constructors like 'matrix', 'vector' and 'scalar' to 'dmatrix', 'dvector' and
-  'dscalar' because the former will give you float32 variables when
-  floatX=float32.
+  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
+  ``dscalar`` because the former will give you *float32* variables when
+  ``floatX=float32``.
 * Ensure 
-  that your output variables have a float32 dtype and not float64.  The
-  more float32 variables are in your graph, the more work the GPU can do for
+  that your output variables have a *float32* dtype and not *float64*.  The
+  more *float32* variables are in your graph, the more work the GPU can do for
  you.
 * Minimize 
-  tranfers to the GPU device by using shared 'float32' variables to store
+  tranfers to the GPU device by using ``shared`` *float32* variables to store
  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
-  the GPU, 'float32' tensor shared variables are stored on the GPU by default to
+  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
  eliminate transfer time for GPU ops using those variables.
 * If you aren't happy with the performance you see, try building your functions with 
-  mode='PROFILE_MODE'. This should print some timing information at program
-  termination (atexit). Is time being used sensibly?   If an Op or Apply is
+  ``mode='PROFILE_MODE'``. This should print some timing information at program
+  termination (at exit). Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
  programming, have a look at how it's implemented in theano.sandbox.cuda.
-  Check the line like 'Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op'
-  that can tell you if not enough of your graph is on the GPU or if there
-  is too much memory transfert.
+  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
+  This can tell you if not enough of your graph is on the GPU or if there
+  is too much memory transfer.


 Changing the Value of Shared Variables
 --------------------------------------

-To change the value of a shared variable, e.g. to provide new data to process,
+To change the value of a shared variable, e.g. to provide new data to processes,
 use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
 see :ref:`aliasing`.

@@ -321,31 +327,31 @@ Consider the logistic regression:

   

-Modify and execute this example to run on GPU with floatX=float32 and 
-time it using the command line "time python file.py".
+Modify and execute this example to run on GPU with ``floatX=float32`` and 
+time it using the command line ``time python file.py``.

 Is there an increase in speed from CPU to GPU?

-Where does it come from? (Use ProfileMode)
+Where does it come from? (Use ``ProfileMode``)

 What can be done to further increase the speed of the GPU version?


 .. Note::

-   * Only 32 bit floats are currently supported (development is in process).
-   * Shared variables with float32 dtype are by default moved to the GPU memory space.
+   * Only 32 bits floats are currently supported (development is in process).
+   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.

   * There is a limit of one GPU per process.
   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
 
   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
-   * Cast inputs before storing them into a shared variable.
-   * Circumvent the automatic cast of int32 with float32 to float64:
+   * ``Cast`` inputs before storing them into a ``shared`` variable.
+   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
    
-     * Insert manual cast in your code or use [u]int{8,16}.
-     * Insert manual cast around the mean operator (this involves division by length, which is an int64).
+     * Insert manual cast in your code or use *[u]int{8,16}*.
+     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
     * Notice that a new casting mechanism is being developed.


@@ -354,21 +360,21 @@ What can be done to further increase the speed of the GPU version?
 Software for Directly Programming a GPU
 ---------------------------------------

-Leaving aside Theano which is a meta-programmer, there is:
+Leaving aside Theano which is a meta-programmer, there are:

-* CUDA: C extension by NVIDIA 
+* **CUDA**: C extension by NVIDIA 

   * Vendor-specific

   * Numeric libraries (BLAS, RNG, FFT) are maturing.

-* OpenCL: multi-vendor version of CUDA
+* **OpenCL**: multi-vendor version of CUDA

-   * More general, standardized
+   * More general, standardized.

   * Fewer libraries, lesser spread.

-* PyCUDA: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel 
+* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel 
  computation API from Python

   * Convenience: Makes it easy to do GPU meta-programming from within Python. Helpful documentation.
@@ -389,9 +395,9 @@ Leaving aside Theano which is a meta-programmer, there is:

     PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory allocated in it is also freed).

-     (GPU memory buffer: \texttt{pycuda.gpuarray.GPUArray})
+     (GPU memory buffer: \texttt{``pycuda.gpuarray.GPUArray``})

-* PyOpenCL: PyCUDA for OpenCL
+* **PyOpenCL**: PyCUDA for OpenCL


 **Example: PyCUDA**
@@ -496,12 +502,12 @@ To test it:

 Run the preceding example.

-Modify and execute to multiply two matrices: x * y.
+Modify and execute to multiply two matrices: *x* * *y*.

-Modify and execute to return two outputs: x + y and x - y.
-(Currently, elemwise fusion generates computation with only 1 output.)
+Modify and execute to return two outputs: *x + y* and *x - y*.
+(Currently, *elemwise fusion* generates computation with only 1 output.)

-Modify and execute to support *stride* (i.e. so as not constrain the input to be c contiguous).
+Modify and execute to support *stride* (i.e. so as not constrain the input to be C-contiguous).

 -------------------------------------------