Merge pull request #905 from nouiz/add_exerc_docu_rebase

Documentation improvements

Merge pull request #905 from nouiz/add_exerc_docu_rebase
00183e72 · Olivier Delalleau · c0c25559 · e1613241 · 00183e72 · 00183e72
--- a/doc/extending/extending_faq.txt
+++ b/doc/extending/extending_faq.txt
@@ -19,7 +19,7 @@ I wrote a new optimization, but it's not getting used...
 Remember that you have to register optimizations with the :ref:`optdb`
 for them to get used by the normal modes like FAST_COMPILE, FAST_RUN,
-and DEBUG_MODE.
+and DebugMode.
 I wrote a new optimization, and it changed my results even though I'm pretty sure it is correct.

--- a/doc/extending/fibby.txt
+++ b/doc/extending/fibby.txt
@@ -168,7 +168,7 @@ not modify any of the inputs.
 TODO: EXPLAIN DESTROYMAP and VIEWMAP BETTER AND GIVE EXAMPLE.
 When developing an Op, you should run computations in DebugMode, by using
-argument ``mode='DEBUG_MODE'`` to ``theano.function``. DebugMode is
+argument ``mode='DebugMode'`` to ``theano.function``. DebugMode is
 slow, but it can catch many common violations of the Op contract.
 TODO: Like what? How? Talk about Python vs. C too.

--- a/doc/extending/index.txt
+++ b/doc/extending/index.txt
@@ -6,15 +6,15 @@ Extending Theano
 ================
-This documentation is for users who want to extend Theano with new Types, new
+This advanced tutorial is for users who want to extend Theano with new Types, new
 Operations (Ops), and new graph optimizations.
 Along the way, it also introduces many aspects of how Theano works, so it is
 also good for you if you are interested in getting more under the hood with
 Theano itself.
-Before tackling this tutorial, it is highly recommended to read the
+Before tackling this more advanced presentation, it is highly recommended to read the
-:ref:`tutorial`.
+introductory :ref:`Tutorial<tutorial>`.
 The first few pages will walk you through the definition of a new :ref:`type`,
 ``double``, and a basic arithmetic :ref:`operations <op>` on that Type. We

--- a/doc/extending/unittest.txt
+++ b/doc/extending/unittest.txt
@@ -289,7 +289,7 @@ Example:
    f = T.function([a,b],[c],mode='FAST_RUN')
    m = theano.Module()
-    minstance = m.make(mode='DEBUG_MODE')
+    minstance = m.make(mode='DebugMode')
 Whenever possible, unit tests should omit this parameter. Leaving
 out the mode will ensure that unit tests use the default mode.
@@ -306,7 +306,7 @@ type this:
    THEANO_FLAGS='mode=FAST_COMPILE' nosetests
    THEANO_FLAGS='mode=FAST_RUN' nosetests
-    THEANO_FLAGS='mode=DEBUG_MODE' nosetests
+    THEANO_FLAGS='mode=DebugMode' nosetests
 .. _random_value_in_tests:

--- a/doc/glossary.txt
+++ b/doc/glossary.txt
 .. _glossary:
-Glossary of terminology
+Glossary
-=======================
+========
 .. glossary::

--- a/doc/introduction.txt
+++ b/doc/introduction.txt
@@ -190,12 +190,10 @@ Here is the state of that vision as of 24 October 2011 (after Theano release
  * Will provide better support for GPU on Windows and use an OpenCL backend on CPU.
 * Loops work, but not all related optimizations are currently done.
-* The cvm linker allows lazy evaluation. It works, but some work is still
+* The cvm linker allows lazy evaluation. It is the current default linker.
-  needed before enabling it by default.
-  * All tests pass with linker=cvm?
+  * How to have `DebugMode` check it? Right now, DebugMode checks the computation non-lazily.
-  * How to have `DEBUG_MODE` check it? Right now, DebugMode checks the computation non-lazily.
+  * The profiler used by cvm is less complete than `ProfileMode`.
-  * The profiler used by cvm is less complete than `PROFILE_MODE`.
 * SIMD parallelism on the CPU comes from the compiler.
 * Multi-core parallelism is only supported for gemv and gemm, and only

--- a/doc/library/compile/debugmode.txt
+++ b/doc/library/compile/debugmode.txt
@@ -29,7 +29,7 @@ DebugMode can be used as follows:
    x = tensor.dvector('x')
-    f = theano.function([x], 10*x, mode='DEBUG_MODE')
+    f = theano.function([x], 10*x, mode='DebugMode')
    f(5) 
    f(0) 
@@ -42,7 +42,7 @@ It can also be used by passing a DebugMode instance as the mode, as in
 If any problem is detected, DebugMode will raise an exception according to
 what went wrong, either at call time (``f(5)``) or compile time (
-``f = theano.function(x, 10*x, mode='DEBUG_MODE')``). These exceptions
+``f = theano.function(x, 10*x, mode='DebugMode')``). These exceptions
 should *not* be ignored; talk to your local Theano guru or email the
 users list if you cannot make the exception go away.
@@ -51,7 +51,7 @@ In the example above, there is no way to guarantee that a future call to say,
 ``f(-1)`` won't cause a problem.  DebugMode is not a silver bullet.
 If you instantiate DebugMode using the constructor ``compile.DebugMode``
-rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
+rather than the keyword ``DebugMode`` you can configure its behaviour via
 constructor arguments. 
 Reference
@@ -133,7 +133,7 @@ Reference
-The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
+The keyword version of DebugMode (which you get by using ``mode='DebugMode``)
 is quite strict, and can raise several different Exception types.
 There following are DebugMode exceptions you might encounter:
@@ -200,7 +200,7 @@ There following are DebugMode exceptions you might encounter:
    in the same order when run several times in a row.  This can happen if any
    steps are ordered by ``id(object)`` somehow, such as via the default object
    hash function.  A Stochastic optimization invalidates the pattern of work
-    whereby we debug in DEBUG_MODE and then run the full-size jobs in FAST_RUN.
+    whereby we debug in DebugMode and then run the full-size jobs in FAST_RUN.
 .. class:: InvalidValueError(DebugModeError)

--- a/doc/library/compile/mode.txt
+++ b/doc/library/compile/mode.txt
+.. _libdoc_compile_mode:
 ======================================
 :mod:`mode` -- controlling compilation
 ======================================
@@ -17,9 +20,10 @@ Theano defines the following modes by name:
 - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
 - ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python 
+- ``'DebugMode'``: A mode for debuging. See :ref:`DebugMode <debugmode>` for details.
-    implementations. This mode can take much longer than the other modes, 
+- ``'ProfileMode'``: A mode for profiling. See :ref:`ProfileMode <profilemode>` for details.
-    but can identify many kinds of problems.
+- ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode.
+- ``'PROFILE_MODE'``: Deprecated. Use the string ProfileMode.
 The default mode is typically ``FAST_RUN``, but it can be controlled via the
 configuration variable :attr:`config.mode`, which can be

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -13,7 +13,7 @@
 Guide
 =====
-The config module contains many attributes that modify Theano's behavior.  Many of these
+The config module contains many ``attributes`` that modify Theano's behavior.  Many of these
 attributes are consulted during the import of the ``theano`` module and many are assumed to be
 read-only.

--- a/doc/library/gof/index.txt
+++ b/doc/library/gof/index.txt
@@ -13,7 +13,7 @@
 .. toctree::
    :maxdepth: 1
-    fgraph
+    fg
    toolbox
    type

--- a/doc/library/printing.txt
+++ b/doc/library/printing.txt
@@ -12,18 +12,18 @@
 Guide
 ======
-Symbolic printing: the Print() Op
+Printing during execution
----------------------------------
+-------------------------
 Intermediate values in a computation cannot be printed in
 the normal python way with the print statement, because Theano has no *statements*.
-Instead there is the `Print` Op.
+Instead there is the :class:`Print` Op.
 >>> x = T.dvector()
->>> hello_world_op = Print('hello world')
+>>> hello_world_op = printing.Print('hello world')
 >>> printed_x = hello_world_op(x)
 >>> f = function([x], printed_x)
->>> f([1,2,3])
+>>> f([1, 2, 3])
 >>> # output: "hello world __str__ = [ 1.  2.  3.]"
 If you print more than one thing in a function like `f`, they will not
@@ -39,15 +39,15 @@ Printing graphs
 ---------------
 Theano provides two functions (:func:`theano.pp` and
-:func:`theano.debugprint`) to print a graph to the terminal before or after
+:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
 compilation.  These two functions print expression graphs in different ways:
 :func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
-Theano also provides :func:`pydotprint` that creates a png image of the function.
+Theano also provides :func:`theano.printing.pydotprint` that creates a png image of the function.
-1) The first is :func:`theano.pp`. 
+1) The first is :func:`theano.pp`.
 >>> x = T.dscalar('x') 
->>> y = x**2
+>>> y = x ** 2
 >>> gy = T.grad(y, x)
 >>> pp(gy)  # print out the gradient prior to optimization
 '((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
@@ -71,56 +71,63 @@ iteration number or other kinds of information in the name.
    To make graphs legible, :func:`pp` hides some Ops that are actually in the graph.  For example,
    automatic DimShuffles are not shown.
-2) The second function to print a graph is :func:`theano.printing.debugprint(variable_or_function, depth=-1)`
+2) The second function to print a graph is :func:`theano.printing.debugprint`
 >>> theano.printing.debugprint(f.maker.fgraph.outputs[0])
- Elemwise{mul,no_inplace} 46950805397392
+Elemwise{mul,no_inplace} [@A] ''
-   2.0 46950805310800
+ |TensorConstant{2.0} [@B]
-   x 46950804895504
+ |x [@C]
 Each line printed represents a Variable in the graph.
-The line ``   x 46950804895504`` means the variable named 'x' at memory
+The line ``|x [@C`` means the variable named ``x`` with debugprint identifier
-location 46950804895504.  If you accidentally have two variables called 'x' in
+[@C] is an input of the Elemwise.  If you accidentally have two variables called ``x`` in
-your graph, their different memory locations will be your clue.
+your graph, their different debugprint identifier will be your clue.
-The line ``   2.0 46950805310800`` means that there is a constant 2.0 at the
+The line ``|TensorConstant{2.0} [@B]`` means that there is a constant 2.0
-given memory location.
+wit this debugprint identifier.
-The line `` Elemwise{mul,no_inplace} 46950805397392`` is indented less than
+The line ``Elemwise{mul,no_inplace} [@A] ''`` is indented less than
 the other ones, because it means there is a variable computed by multiplying
-the other (more indented) ones together. 
+the other (more indented) ones together.
+The ``|`` symbol are just there to help read big graph. The group
+together inputs to a node.
 Sometimes, you'll see a Variable but not the inputs underneath.  That can
 happen when that Variable has already been printed.  Where else has it been
-printed?  Look for the memory address using the Find feature of your text
+printed?  Look for debugprint identifier using the Find feature of your text
 editor.
 >>> theano.printing.debugprint(gy)
- Elemwise{mul} 46950804894224
+Elemwise{mul} [@A] ''
-   Elemwise{mul} 46950804735120
+ |Elemwise{mul} [@B] ''
-     Elemwise{second,no_inplace} 46950804626128
+ | |Elemwise{second,no_inplace} [@C] ''
-       Elemwise{pow,no_inplace} 46950804625040
+ | | |Elemwise{pow,no_inplace} [@D] ''
-         x 46950658736720
+ | | | |x [@E]
-         2 46950804039760
+ | | | |TensorConstant{2} [@F]
-       1.0 46950804625488
+ | | |TensorConstant{1.0} [@G]
-     2 46950804039760
+ | |TensorConstant{2} [@F]
-   Elemwise{pow} 46950804737616
+ |Elemwise{pow} [@H] ''
-     x 46950658736720
+   |x [@E]
-     Elemwise{sub} 46950804736720
+   |Elemwise{sub} [@I] ''
-       2 46950804039760
+     |TensorConstant{2} [@F]
-       InplaceDimShuffle{} 46950804736016
+     |InplaceDimShuffle{} [@J] ''
-         1 46950804735760
+       |TensorConstant{1} [@K]
 >>> theano.printing.debugprint(gy, depth=2)
- Elemwise{mul} 46950804894224
+Elemwise{mul} [@A] ''   
-   Elemwise{mul} 46950804735120
+ |Elemwise{mul} [@B] ''   
-   Elemwise{pow} 46950804737616
+ |Elemwise{pow} [@C] ''   
 If the depth parameter is provided, it limits the nuber of levels that are
 shown.
-3) The function :func:`theano.printing.pydotprint(fct, outfile=SOME_DEFAULT_VALUE)` will print a compiled theano function to a png file.
+3) The function :func:`theano.printing.pydotprint` will print a compiled theano function to a png file.
 In the image, Apply nodes (the applications of ops) are shown as boxes and variables are shown as ovals.
 The number at the end of each label indicates graph position.  
@@ -170,10 +177,13 @@ Reference
        running the function will print the value that `x` takes in the graph.
-.. function:: theano.printing.pp(*args)
+.. autofunction:: theano.printing.debugprint
-    TODO
+.. function:: theano.pp(*args)
-.. autofunction:: theano.printing.debugprint
+   Just a shortcut to :func:`theano.printing.pp`
+.. autofunction:: theano.printing.pp(*args)
+.. autofunction:: theano.printing.pydotprint
--- a/doc/library/scan.txt
+++ b/doc/library/scan.txt
@@ -136,19 +136,35 @@ arange must have its length specified at creation time.
 Simple accumulation into a scalar, ditching lamba
 -------------------------------------------------
-This should be fairly self-explanatory.
+Although this example would seem almost self-explanatory, it stresses a
+pitfall to be careful of: the initial output state that is supplied, that is 
+``output_info``, must be of a **shape similar to that of the output variable**
+generated at each iteration and moreover, it **must not involve an implicit
+downcast** of the latter. 
 .. code-block:: python
+    import numpy as np
+    import theano
+    import theano.tensor as T
    up_to = T.iscalar("up_to")
    # define a named function, rather than using lambda
    def accumulate_by_adding(arange_val, sum_to_date):
        return sum_to_date + arange_val
+    seq = T.arange(up_to)
+    # An unauthorized implicit downcast from the dtype of 'seq', to that of
+    # 'T.as_tensor_variable(0)' which is of dtype 'int8' by default would occur
+    # if this instruction were to be used instead of the next one:
+    # outputs_info = T.as_tensor_variable(0)
+    outputs_info = T.as_tensor_variable(np.asarray(0, seq.dtype))
    scan_result, scan_updates = theano.scan(fn=accumulate_by_adding,
-                                            outputs_info=T.as_tensor_variable(0),
+                                            outputs_info=outputs_info,
-                                            sequences=T.arange(up_to))
+                                            sequences=seq)
    triangular_sequence = theano.function(inputs=[up_to], outputs=scan_result)
    # test
@@ -157,7 +173,6 @@ This should be fairly self-explanatory.
    print [n * (n + 1) // 2 for n in xrange(some_num)]
 Another simple example
 ----------------------

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
 .. currentmodule:: tensor
+.. _libdoc_basic_tensor:
 ===========================
 Basic Tensor Functionality
 ===========================
@@ -532,7 +534,7 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
-.. function:: shape_padright(x,n_ones = 1)
+.. function:: shape_padright(x, n_ones=1)
    Reshape `x` by right padding the shape with `n_ones` 1s. Note that all
    this new dimension will be broadcastable. To make them non-broadcastable
@@ -597,7 +599,7 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
    Create a matrix by filling the shape of `a` with `b`
-.. function:: eye(n, m = None, k = 0, dtype=theano.config.floatX)
+.. function:: eye(n, m=None, k=0, dtype=theano.config.floatX)
    :param n: number of rows in output (value or theano scalar)
    :param m: number of columns in output (value or theano scalar)
@@ -1065,11 +1067,11 @@ Mathematical
    Returns a variable representing the exponential of a, ie e^a.
-.. function:: maximum(a,b)
+.. function:: maximum(a, b)
   Returns a variable representing the maximum element by element of a and b
-.. function:: minimum(a,b)
+.. function:: minimum(a, b)
   Returns a variable representing the minimum element by element of a and b

--- a/doc/tutorial/adding.txt
+++ b/doc/tutorial/adding.txt
 .. _adding:
-========================================
+====================
-Baby steps - Adding two numbers together
+Baby Steps - Algebra
-========================================
+====================
+Adding two Scalars
-Adding two scalars
 ==================
-So, to get us started with Theano and get a feel of what we're working with, 
+To get us started with Theano and get a feel of what we're working with, 
 let's make a simple function: add two numbers together. Here is how you do
 it:
@@ -34,12 +33,12 @@ Let's break this down into several steps. The first step is to define
 two symbols (*Variables*) representing the quantities that you want
 to add. Note that from now on, we will use the term 
 *Variable* to mean "symbol" (in other words, 
-``x``, ``y``, ``z`` are all *Variable* objects). The output of the function 
+*x*, *y*, *z* are all *Variable* objects). The output of the function 
-``f`` is a ``numpy.ndarray`` with zero dimensions.
+*f* is a ``numpy.ndarray`` with zero dimensions.
 If you are following along and typing into an interpreter, you may have
 noticed that there was a slight delay in executing the ``function``
-instruction. Behind the scenes, ``f`` was being compiled into C code.
+instruction. Behind the scene, *f* was being compiled into C code.
 .. note:
@@ -52,12 +51,10 @@ instruction. Behind the scenes, ``f`` was being compiled into C code.
  >>> x = theano.tensor.ivector()
  >>> y = -x
-  ``x`` and ``y`` are both Variables, i.e. instances of the
+  *x* and *y* are both Variables, i.e. instances of the
  ``theano.gof.graph.Variable`` class. The
-  type of both ``x`` and ``y`` is ``theano.tensor.ivector``.
+  type of both *x* and *y* is ``theano.tensor.ivector``.
-------------------------------------------
 **Step 1**
@@ -68,9 +65,9 @@ In Theano, all symbols must be typed. In particular, ``T.dscalar``
 is the type we assign to "0-dimensional arrays (`scalar`) of doubles
 (`d`)". It is a Theano :ref:`type`.
-``dscalar`` is not a class. Therefore, neither ``x`` nor ``y``
+``dscalar`` is not a class. Therefore, neither *x* nor *y*
 are actually instances of ``dscalar``. They are instances of
-:class:`TensorVariable`. ``x`` and ``y``
+:class:`TensorVariable`. *x* and *y*
 are, however, assigned the theano Type ``dscalar`` in their ``type``
 field, as you can see here:
@@ -83,52 +80,49 @@ TensorType(float64, scalar)
 >>> x.type is T.dscalar
 True
-You can learn more about the structures in Theano in :ref:`graphstructures`.
 By calling ``T.dscalar`` with a string argument, you create a
 *Variable* representing a floating-point scalar quantity with the
 given name. If you provide no argument, the symbol will be unnamed. Names
 are not required, but they can help debugging.
+More will be said in a moment regarding Theano's inner structure. You
+could also learn more by looking into :ref:`graphstructures`.
-------------------------------------------
 **Step 2**
-The second step is to combine ``x`` and ``y`` into their sum ``z``:
+The second step is to combine *x* and *y* into their sum *z*:
 >>> z = x + y
-``z`` is yet another *Variable* which represents the addition of
+*z* is yet another *Variable* which represents the addition of
-``x`` and ``y``. You can use the :ref:`pp <libdoc_printing>`
+*x* and *y*. You can use the :ref:`pp <libdoc_printing>`
-function to pretty-print out the computation associated to ``z``.
+function to pretty-print out the computation associated to *z*.
 >>> print pp(z)
 (x + y)
-------------------------------------------
 **Step 3**
-The last step is to create a function taking ``x`` and ``y`` as inputs
+The last step is to create a function taking *x* and *y* as inputs
-and giving ``z`` as output:
+and giving *z* as output:
 >>> f = function([x, y], z)
 The first argument to :func:`function <function.function>` is a list of Variables
 that will be provided as inputs to the function. The second argument
 is a single Variable *or* a list of Variables. For either case, the second
-argument is what we want to see as output when we apply the function.
+argument is what we want to see as output when we apply the function. *f* may
+then be used like a normal Python function.
-``f`` may then be used like a normal Python function.
+Adding two Matrices
-Adding two matrices
 ===================
 You might already have guessed how to do this. Indeed, the only change
-from the previous example is that you need to instantiate ``x`` and
+from the previous example is that you need to instantiate *x* and
-``y`` using the matrix Types:
+*y* using the matrix Types:
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_adding.test_adding_2
@@ -138,14 +132,14 @@ from the previous example is that you need to instantiate ``x`` and
 >>> z = x + y
 >>> f = function([x, y], z)
-``dmatrix`` is the Type for matrices of doubles. And then we can use
+``dmatrix`` is the Type for matrices of doubles. Then we can use
 our new function on 2D arrays:
 >>> f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
 array([[ 11.,  22.],
       [ 33.,  44.]])
-The variable is a numpy array. We can also use numpy arrays directly as
+The variable is a NumPy array. We can also use NumPy arrays directly as
 inputs:
 >>> import numpy
@@ -159,18 +153,36 @@ by :ref:`broadcasting <libdoc_tensor_broadcastable>`.
 The following types are available:
-* **byte**: bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4
+* **byte**: ``bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4``
-* **32-bit integers**: iscalar, ivector, imatrix, irow, icol, itensor3, itensor4
+* **16-bit integers**: ``wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4``
-* **64-bit integers**: lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4
+* **32-bit integers**: ``iscalar, ivector, imatrix, irow, icol, itensor3, itensor4``
-* **float**: fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4
+* **64-bit integers**: ``lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4``
-* **double**: dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
+* **float**: ``fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4``
-* **complex**: cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4
+* **double**: ``dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4``
+* **complex**: ``cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4``
-The previous list is not exhaustive. A guide to all types compatible
+The previous list is not exhaustive and a guide to all types compatible
-with numpy arrays may be found :ref:`here <libdoc_tensor_creation>`.
+with NumPy arrays may be found here: :ref:`tensor creation<libdoc_tensor_creation>`.
 .. note::
   You, the user---not the system architecture---have to choose whether your
   program will use 32- or 64-bit integers (``i`` prefix vs. the ``l`` prefix)
   and floats (``f`` prefix vs. the ``d`` prefix).
+-------------------------------------------
+**Exercise**
+.. code-block:: python
+  import theano
+  a = theano.tensor.vector() # declare variable
+  out = a + a ** 10               # build symbolic expression
+  f = theano.function([a], out)   # compile function
+  print f([0, 1, 2])  # prints `array([0, 2, 1026])`
+Modify and execute this code to compute this expression: a ** 2 + b ** 2 + 2 * a * b.
+:download:`Solution<adding_solution_1.py>`
--- a/doc/tutorial/adding_solution_1.py
+++ b/doc/tutorial/adding_solution_1.py
+#!/usr/bin/env python
+# Theano tutorial
+# Solution to Exercise in section 'Baby Steps - Algebra'
+import theano
+a = theano.tensor.vector()  # declare variable
+b = theano.tensor.vector()  # declare variable
+out = a ** 2 + b ** 2 + 2 * a * b  # build symbolic expression
+f = theano.function([a, b], out)   # compile function
+print f([1, 2], [4, 5])  # prints [ 25.  49.]
--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -5,53 +5,52 @@
 Understanding Memory Aliasing for Speed and Correctness
 =======================================================
-The aggressive reuse of memory is one of the ways Theano makes code fast, and
+The aggressive reuse of memory is one of the ways through which Theano makes code fast, and
-it's important for the correctness and speed of your program that you understand
+it is important for the correctness and speed of your program that you understand
-which buffers Theano might alias to which others.
+how Theano might alias buffers.
-This file describes the principles for how Theano treats memory, and explains
+This section describes the principles based on which Theano handles memory, and explains
-when you might want to change the default behaviour of some functions and
+when you might want to alter the default behaviour of some functions and
 methods for faster performance.
-The memory model: 2 spaces
+The Memory Model: Two Spaces
-==========================
+============================
-There are some simple principles that guide Theano's treatment of memory.  The
+There are some simple principles that guide Theano's handling of memory.  The
 main idea is that there is a pool of memory managed by Theano, and Theano tracks
 changes to values in that pool.
- 1. Theano manages its own memory space, which typically does not overlap with
+* Theano manages its own memory space, which typically does not overlap with
-    the memory of normal python variables that non-Theano code creates.
+  the memory of normal Python variables that non-Theano code creates.
- 1. Theano Functions only modify buffers that are in Theano's memory space.
+* Theano functions only modify buffers that are in Theano's memory space.
- 1. Theano's memory space includes the buffers allocated to store shared
+* Theano's memory space includes the buffers allocated to store ``shared``
-    variables and the temporaries used to evaluate Functions.
+  variables and the temporaries used to evaluate functions.
- 1. Physically, Theano's memory space may be spread across the host, a GPU
+* Physically, Theano's memory space may be spread across the host, a GPU
-    device(s), and in the future may even include objects on a remote machine.
+  device(s), and in the future may even include objects on a remote machine.
- 1. The memory allocated for a shared variable buffer is unique: it is never
+* The memory allocated for a ``shared`` variable buffer is unique: it is never
-    aliased to another shared variable.
+  aliased to another ``shared`` variable.
- 1. Theano's managed memory is constant while Theano Functions are not running
+* Theano's managed memory is constant while Theano functions are not running
-    and Theano library code is not running.
+  and Theano's library code is not running.
- 1. The default behaviour of Function is to return user-space values for
+* The default behaviour of a function is to return user-space values for
-    outputs, and to expect user-space values for inputs.
+  outputs, and to expect user-space values for inputs.
 The distinction between Theano-managed memory and user-managed memory can be
-broken down by some Theano functions (e.g. shared, get_value and the
+broken down by some Theano functions (e.g. ``shared``, ``get_value`` and the
-constructors for In and Out) by using
+constructors for ``In`` and ``Out``) by using a ``borrow=True`` flag. 
-a ``borrow=True`` flag.  This can make those methods faster (by avoiding copy
+This can make those methods faster (by avoiding copy operations) at the expense 
-operations) at the expense of risking subtle bugs in the overall program (by
+of risking subtle bugs in the overall program (by aliasing memory).
-aliasing memory).
 The rest of this section is aimed at helping you to understand when it is safe
-to use the ``borrow=True`` argument and reap the benefit of faster code.
+to use the ``borrow=True`` argument and reap the benefits of faster code.
-Borrowing when creating shared variables
+Borrowing when Creating Shared Variables
 ========================================
 A ``borrow`` argument can be provided to the shared-variable constructor.
@@ -69,9 +68,9 @@ A ``borrow`` argument can be provided to the shared-variable constructor.
    s_false   = theano.shared(np_array, borrow=False)
    s_true    = theano.shared(np_array, borrow=True)
-By default (``s_default``) and when explicitly setting ``borrow=False``, the
+By default (*s_default*) and when explicitly setting ``borrow=False``, the
-shared variable we construct gets a [deep] copy of ``np_array``.  So changes we
+shared variable we construct gets a [deep] copy of *np_array*.  So changes we
-subsequently make to ``np_array`` have no effect on our shared variable.
+subsequently make to *np_array* have no effect on our shared variable.
 .. code-block:: python
@@ -82,40 +81,40 @@ subsequently make to ``np_array`` have no effect on our shared variable.
    s_true.get_value()     # -> array([2.0, 2.0])
 If we are running this with the CPU as the device,
-then changes we make to np_array *right away* will show up in
+then changes we make to *np_array* *right away* will show up in
 ``s_true.get_value()``
-because numpy arrays are mutable, and ``s_true`` is using the ``np_array``
+because NumPy arrays are mutable, and *s_true* is using the *np_array*
 object as it's internal buffer.
-However, this aliasing of ``np_array`` and ``s_true`` is not guaranteed to occur,
+However, this aliasing of *np_array* and *s_true* is not guaranteed to occur,
 and may occur only temporarily even if it occurs at all.
 It is not guaranteed to occur because if Theano is using a GPU device, then the
-borrow flag has no effect.
+``borrow`` flag has no effect. It may occur only temporarily because
-It may occur only temporarily because
+if we call a Theano function that updates the value of *s_true* the aliasing
-if we call a Theano function that updates the value of ``s_true`` the aliasing
 relationship *may* or *may not* be broken (the function is allowed to 
-update the shared variable by modifying its buffer, which will preserve
+update the ``shared`` variable by modifying its buffer, which will preserve
 the aliasing, or by changing which buffer the variable points to, which
 will terminate the aliasing).
 *Take home message:*
-It is safe practice (and a good idea) to use ``borrow=True`` in a shared
+It is a safe practice (and a good idea) to use ``borrow=True`` in a ``shared``
-variable constructor when the shared variable stands for a large object (in
+variable constructor when the ``shared`` variable stands for a large object (in
 terms of memory footprint) and you do not want to create copies of it in
 memory.
-It is not a reliable technique to use ``borrow=True`` to modify shared variables
+It is not a reliable technique to use ``borrow=True`` to modify ``shared`` variables
-by side-effect, because with some devices (e.g. GPU devices) this technique will
+through side-effect, because with some devices (e.g. GPU devices) this technique will
 not work.
-Borrowing when accessing value of shared variables
+Borrowing when Accessing Value of Shared Variables
 ==================================================
 Retrieving
 ----------
-A ``borrow`` argument can also be used to control how a shared variable's value is retrieved.
+A ``borrow`` argument can also be used to control how a ``shared`` variable's value is 
+retrieved.
 .. If you modify this code, also change :
@@ -136,16 +135,16 @@ When ``borrow=True`` is passed to ``get_value``, it means that the return value
 But both of these calls might create copies of the internal memory.
 The reason that ``borrow=True`` might still make a copy is that the internal
-representation of a shared variable might not be what you expect.  When you
+representation of a ``shared`` variable might not be what you expect.  When you
-create a shared variable by passing a numpy array for example, then ``get_value()``
+create a ``shared`` variable by passing a NumPy array for example, then ``get_value()``
-must return a numpy array too.  That's how Theano can make the GPU use
+must return a NumPy array too.  That's how Theano can make the GPU use
-transparent.  But when you are using a GPU (or in future perhaps a remote machine), then the numpy.ndarray
+transparent.  But when you are using a GPU (or in the future perhaps a remote machine), 
-is not the internal representation of your data.
+then the numpy.ndarray is not the internal representation of your data. 
 If you really want Theano to return its internal representation *and never copy it*
 then you should use the ``return_internal_type=True`` argument to
 ``get_value``.  It will never cast the internal object (always return in
 constant time), but might return various datatypes depending on contextual
-factors (e.g. the compute device, the dtype of the numpy array).
+factors (e.g. the compute device, the dtype of the NumPy array).
 .. code-block:: python
@@ -156,28 +155,28 @@ It is possible to use ``borrow=False`` in conjunction with
 This is primarily for internal debugging, not for typical use.
 For the transparent use of different type of optimization Theano can make,
-there is the policy that get_value() always return by default the same object type
+there is the policy that ``get_value()`` always return by default the same object type
-it received when the shared variable was created. So if you created manually data on
+it received when the ``shared`` variable was created. So if you created manually data on
-the gpu and create a shared variable on the gpu with this data, get_value will always
+the gpu and create a ``shared`` variable on the gpu with this data, ``get_value`` will always
-return gpu data even when return_internal_type=False.
+return gpu data even when ``return_internal_type=False``.
 *Take home message:*
 It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
-your code does not modify the return value.  *Do not use this to modify a shared
+your code does not modify the return value.  *Do not use this to modify a ``shared``
 variable by side-effect* because it will make your code device-dependent.
-Modification of GPU variables by this sort of side-effect is impossible.
+Modification of GPU variables through this sort of side-effect is impossible.
 Assigning
 ---------
-Shared variables also have a ``set_value`` method that can accept an optional ``borrow=True`` argument.
+``Shared`` variables also have a ``set_value`` method that can accept an optional
-The semantics are similar to those of creating a new shared variable -
+``borrow=True`` argument. The semantics are similar to those of creating a new 
-``borrow=False`` is the default and ``borrow=True`` means that Theano *may*
+``shared`` variable - ``borrow=False`` is the default and ``borrow=True`` means 
-reuse the buffer you provide as the internal storage for the variable.
+that Theano *may* reuse the buffer you provide as the internal storage for the variable.
-A standard pattern for manually updating the value of a shared variable is as
+A standard pattern for manually updating the value of a ``shared`` variable is as
-follows.
+follows:
 .. code-block:: python
@@ -185,60 +184,44 @@ follows.
        some_inplace_fn(s.get_value(borrow=True)),
        borrow=True)
-This pattern works regardless of the compute device, and when the compute device
+This pattern works regardless of the computing device, and when the latter
 makes it possible to expose Theano's internal variables without a copy, then it
-goes as fast as an in-place update.
+proceeds as fast as an in-place update.
-When shared variables are allocated on the GPU, the transfers to and from GPU device memory can
+When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
 be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
-* Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
+* Prior to Theano 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
  GPU memory for the new value would be allocated before the old memory was released. If you're
  running near the limits of GPU memory, this could cause you to run out of GPU memory
-  unnecessarily.  *Solution*: update to a newer version of Theano.
+  unnecessarily.
-* If you are going to swap several chunks of data in and out of a shared variable repeatedly,
+  *Solution*: update to a newer version of Theano.
+* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
  you will want to reuse the memory that you allocated the first time if possible - it is both
  faster and more memory efficient.
  *Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
  data to make sure that every chunk is the same size.
 * It is also worth mentioning that, current GPU copying routines support only contiguous memory.
-  So Theano must make the ``value`` you provide ``c_contiguous`` prior to copying it.
+  So Theano must make the value you provide *C-contiguous* prior to copying it.
-  This can require an extra copy of the data on the host.  *Solution*: make sure that the value
+  This can require an extra copy of the data on the host.
-  you assign to a CudaNdarraySharedVariable is *already*  ``c_contiguous``.
-(Further remarks on the current implementation of the GPU version of set_value() can be found
-here: :ref:`libdoc_cuda_var`)
+  *Solution*: make sure that the value
+  you assign to a CudaNdarraySharedVariable is *already*  *C-contiguous*.
-Retrieving and assigning via the .value property
+(Further information on the current implementation of the GPU version of ``set_value()`` can be found
------------------------------------------------
+here: :ref:`libdoc_cuda_var`)
-Shared variables have a ``.value`` property that is connected to ``get_value``
-and ``set_value``.  The borrowing behaviour of the property is controlled by a
-boolean configuration variable ``config.shared.value_borrows``, which currently
-defaults to ``True``.  If that variable is ``True`` then an assignment like ``s.value=v``
-is equivalent to ``s.set_value(v, borrow=True)``, and a retrieval like ``print
-s.value`` is equivalent to ``print s.get_value(borrow=True)``.  Likewise, 
-if ``config.shared.value_borrows`` is ``False``, then the borrow parameter that the ``.value`` property 
-passes to ``set_value`` and ``get_value`` is ``False``.
-The ``True`` default value of ``config.shared.value_borrows`` means that
-aliasing can sometimes happen and sometimes not, which can be confusing. 
-Be aware that the default value may be changed to ``False`` sometime in the
-not-to-distant future. This change will create more copies, and potentially slow
-down code that accesses ``.value`` attributes inside tight loops.  To avoid this
-potential impact on your code, use the ``.get_value`` and ``.set_value`` methods
-directly with appropriate flags.
-Borrowing when constructing Function objects
+Borrowing when Constructing Function Objects
 ============================================
 A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
-that control how ``theano.function`` handles its arguments and return value[s]. 
+that control how ``theano.function`` handles its argument[s] and return value[s]. 
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_aliasing.test_aliasing_3
@@ -248,7 +231,7 @@ that control how ``theano.function`` handles its arguments and return value[s].
    import theano, theano.tensor
    x = theano.tensor.matrix()
-    y = 2*x
+    y = 2 * x
    f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
 Borrowing an input means that Theano will treat the argument you provide as if
@@ -259,28 +242,29 @@ course of evaluating that function (e.g. ``f``).
 Borrowing an output means that Theano will not insist on allocating a fresh
 output buffer every time you call the function.  It will possibly reuse the same one as
-a previous call, and overwrite the old contents.  Consequently, it may overwrite
+on a previous call, and overwrite the old content.  Consequently, it may overwrite
-old return values by side effect.
+old return values through side-effect.
 Those return values may also be overwritten in
 the course of evaluating *another compiled function* (for example, the output
-may be aliased to a shared variable).  So be careful to use a borrowed return
+may be aliased to a ``shared`` variable).  So be careful to use a borrowed return
 value right away before calling any more Theano functions.
 The default is of course to *not borrow* internal results.
-It is also possible to pass an ``return_internal_type=True`` flag to the ``Out``
+It is also possible to pass a ``return_internal_type=True`` flag to the ``Out``
 variable which has the same interpretation as the ``return_internal_type`` flag
-to the shared variable's ``get_value`` function.  Unlike ``get_value()``, the
+to the ``shared`` variable's ``get_value`` function.  Unlike ``get_value()``, the
 combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 ``Out()`` are not guaranteed to avoid copying an output value.  They are just
 hints that give more flexibility to the compilation and optimization of the
 graph.
 *Take home message:*
-When an input ``x`` to a function is not needed after the function returns and you
+When an input *x* to a function is not needed after the function returns and you
 would like to make it available to Theano as additional workspace, then consider
 marking it with ``In(x, borrow=True)``.  It may make the function faster and
 reduce its memory requirement.
-When a return value ``y`` is large (in terms of memory footprint), and you only need to read from it once, right
+When a return value *y* is large (in terms of memory footprint), and you only need to read from it once, right
 away when it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
--- a/doc/tutorial/conditions.txt
+++ b/doc/tutorial/conditions.txt
@@ -4,53 +4,56 @@
 Conditions
 ==========
-**IfElse vs switch**
+IfElse vs Switch
+================
- Build condition over symbolic variables.
- IfElse Op takes a `boolean` condition and two variables to compute as input.
+- Both ops build a condition over symbolic variables.
- Switch take a `tensor` as condition and two variables to compute as input.
+- ``IfElse`` takes a *boolean* condition and two variables as inputs.
-  - Switch is an elementwise operation. It is more general than IfElse.
+- ``Switch`` takes a *tensor* as condition and two variables as inputs.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
+  ``switch`` is an elementwise operation and is thus more general than ``ifelse``.
-  evaluates one variable respect to the condition.
+- Whereas ``switch`` evaluates both *output* variables, ``ifelse`` is lazy and only
+  evaluates one variable with respect to the condition.
 **Example**
 .. code-block:: python
  from theano import tensor as T
  from theano.ifelse import ifelse
  import theano, time, numpy
-  a,b = T.scalars('a','b')
+  a,b = T.scalars('a', 'b')
-  x,y = T.matrices('x','y')
+  x,y = T.matrices('x', 'y')
-  z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
+  z_switch = T.switch(T.lt(a, b), T.mean(x), T.mean(y))
-  z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
+  z_lazy = ifelse(T.lt(a, b), T.mean(x), T.mean(y))
-  f_switch = theano.function([a,b,x,y], z_switch, 
+  f_switch = theano.function([a, b, x, y], z_switch,
                      mode=theano.Mode(linker='vm'))
-  f_lazyifelse = theano.function([a,b,x,y], z_lazy,
+  f_lazyifelse = theano.function([a, b, x, y], z_lazy,
                      mode=theano.Mode(linker='vm'))
  val1 = 0.
  val2 = 1.
-  big_mat1 = numpy.ones((10000,1000))
+  big_mat1 = numpy.ones((10000, 1000))
-  big_mat2 = numpy.ones((10000,1000))
+  big_mat2 = numpy.ones((10000, 1000))
  n_times = 10
  tic = time.clock()
  for i in xrange(n_times):
      f_switch(val1, val2, big_mat1, big_mat2)
-  print 'time spent evaluating both values %f sec'%(time.clock()-tic)
+  print 'time spent evaluating both values %f sec' % (time.clock() - tic)
  tic = time.clock()
  for i in xrange(n_times):
      f_lazyifelse(val1, val2, big_mat1, big_mat2)
-  print 'time spent evaluating one value %f sec'%(time.clock()-tic)
+  print 'time spent evaluating one value %f sec' % (time.clock() - tic)
-In this example, IfElse Op spend less time (about an half) than Switch
+In this example, the ``IfElse`` op spends less time (about half as much) than ``Switch``
-since it computes only one variable instead of both.
+since it computes only one variable out of the two.
 .. code-block:: python
@@ -59,11 +62,10 @@ since it computes only one variable instead of both.
  time spent evaluating one value 0.3500 sec
-It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
+Unless ``linker='vm'`` or ``linker='cvm'`` are used, ``ifelse`` will compute both
-otherwise IfElse will compute both variables and take the same computation
+variables and take the same computation time as ``switch``. Although the linker
-time as the Switch Op. The linker is not currently set by default to 'cvm' but
+is not currently set by default to ``cvm``, it will be in the near future.
-it will be in a near future.
-There is not an optimization to automatically change a switch with a
+There is no automatic optimization replacing a ``switch`` with a
-broadcasted scalar to an ifelse, as this is not always the faster. See
+broadcasted scalar to an ``ifelse``, as this is not always faster. See
 this `ticket <http://www.assembla.com/spaces/theano/tickets/764>`_.
--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
@@ -6,22 +6,23 @@ Debugging Theano: FAQ and Troubleshooting
 =========================================
 There are many kinds of bugs that might come up in a computer program.
-This page is structured as an FAQ.  It should provide recipes to tackle common
+This page is structured as a FAQ.  It provides recipes to tackle common
-problems, and introduce some of the tools that we use to find problems in our
+problems, and introduces some of the tools that we use to find problems in our
-Theano code, and even (it happens) in Theano's internals, such as
+own Theano code, and even (it happens) in Theano's internals, in
 :ref:`using_debugmode`.
-Isolating the problem/Testing Theano compiler
+Isolating the Problem/Testing Theano Compiler
 ---------------------------------------------
-You can run your Theano function in a DebugMode(:ref:`using_debugmode`). This test the Theano optimizations and help to find where NaN, inf and other problem come from.
+You can run your Theano function in a :ref:`DebugMode<using_debugmode>`. 
+This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.
 Using Test Values
 -----------------
 As of v.0.4.0, Theano has a new mechanism by which graphs are executed
-on-the-fly, before a theano.function is ever compiled. Since optimizations
+on-the-fly, before a ``theano.function`` is ever compiled. Since optimizations
 haven't been applied at this stage, it is easier for the user to locate the
 source of some bug. This functionality is enabled through the config flag
 ``theano.config.compute_test_value``. Its use is best shown through the
@@ -34,27 +35,27 @@ following example.
    theano.config.compute_test_value = 'off'
    # configure shared variables
-    W1val = numpy.random.rand(2,10,10).astype(theano.config.floatX)
+    W1val = numpy.random.rand(2, 10, 10).astype(theano.config.floatX)
    W1 = theano.shared(W1val, 'W1')
-    W2val = numpy.random.rand(15,20).astype(theano.config.floatX)
+    W2val = numpy.random.rand(15, 20).astype(theano.config.floatX)
    W2 = theano.shared(W2val, 'W2')
    # input which will be of shape (5,10)
    x  = T.matrix('x')
    # transform the shared variable in some way. Theano does not
-    # know off hand that the matrix func_of_W1 has shape (20,10)
+    # know off hand that the matrix func_of_W1 has shape (20, 10)
-    func_of_W1 = W1.dimshuffle(2,0,1).flatten(2).T
+    func_of_W1 = W1.dimshuffle(2, 0, 1).flatten(2).T
    # source of error: dot product of 5x10 with 20x10
-    h1 = T.dot(x,func_of_W1)  
+    h1 = T.dot(x, func_of_W1)
    # do more stuff
-    h2 = T.dot(h1,W2.T)  
+    h2 = T.dot(h1, W2.T)
    # compile and call the actual function
    f = theano.function([x], h2)
-    f(numpy.random.rand(5,10))
+    f(numpy.random.rand(5, 10))
 Running the above code generates the following error message:
@@ -86,9 +87,9 @@ Running the above code generates the following error message:
    _dot22(x, <TensorType(float64, matrix)>), [_dot22.0], 
    _dot22(x, InplaceDimShuffle{1,0}.0), 'Sequence id of Apply node=4')
-Needless to say the above is not very informative and does not provide much in
+Needless to say, the above is not very informative and does not provide much in
 the way of guidance. However, by instrumenting the code ever so slightly, we
-can get Theano to give us the exact source of the error.
+can get Theano to reveal the exact source of the error.
 .. code-block:: python
@@ -97,17 +98,17 @@ can get Theano to give us the exact source of the error.
    ...
-    # input which will be of shape (5,10)
+    # input which will be of shape (5, 10)
    x  = T.matrix('x')
    # provide Theano with a default test-value
-    x.tag.test_value = numpy.random.rand(5,10)
+    x.tag.test_value = numpy.random.rand(5, 10)
-In the above, we're tagging the symbolic matrix ``x`` with a special test
+In the above, we are tagging the symbolic matrix *x* with a special test
 value. This allows Theano to evaluate symbolic expressions on-the-fly (by
-calling the ``perform`` method of each Op), as they are being defined. Sources
+calling the ``perform`` method of each op), as they are being defined. Sources
 of error can thus be identified with much more precision and much earlier in
 the compilation pipeline. For example, running the above code yields the
-following error message, which properly identifies line 23 as the culprit.
+following error message, which properly identifies *line 23* as the culprit.
 .. code-block:: bash
@@ -120,33 +121,33 @@ following error message, which properly identifies line 23 as the culprit.
        z[0] = numpy.asarray(numpy.dot(x, y))
    ValueError: ('matrices are not aligned', (5, 10), (20, 10))
-The compute_test_value mechanism works as follows:
+The ``compute_test_value`` mechanism works as follows:
-* Theano Constants and SharedVariable are used as is. No need to instrument them.
+* Theano ``constants`` and ``shared`` variables are used as is. No need to instrument them.
-* A Theano ``Variable`` (i.e. ``dmatrix``, ``vector``, etc.) should be
+* A Theano *variable* (i.e. ``dmatrix``, ``vector``, etc.) should be
  given a special test value through the attribute ``tag.test_value``.
 * Theano automatically instruments intermediate results. As such, any quantity
-  derived from ``x`` will be given a `tag.test_value` automatically.
+  derived from *x* will be given a ``tag.test_value`` automatically.
-`compute_test_value` can take the following values:
+``compute_test_value`` can take the following values:
-* ``off``: default behavior. This debugging mechanism is inactive.
+* ``off``: Default behavior. This debugging mechanism is inactive.
-* ``raise``: compute test values on the fly. Any variable for which a test
+* ``raise``: Compute test values on the fly. Any variable for which a test
  value is required, but not provided by the user, is treated as an error. An
  exception is raised accordingly.
-* ``warn``: idem, but a warning is issued instead of an Exception.
+* ``warn``: Idem, but a warning is issued instead of an *Exception*.
-* ``ignore``: silently ignore the computation of intermediate test values, if a
+* ``ignore``: Silently ignore the computation of intermediate test values, if a
  variable is missing a test value.
 .. note::
-  This feature is currently not compatible with ``Scan`` and also with Ops
+  This feature is currently incompatible with ``Scan`` and also with ops
  which do not implement a ``perform`` method.
-How do I print an intermediate value in a Function/Method?
+"How do I Print an Intermediate Value in a Function/Method?"
----------------------------------------------------------
+------------------------------------------------------------
-Theano provides a 'Print' Op to do this.
+Theano provides a 'Print' op to do this.
 .. code-block:: python
@@ -158,15 +159,15 @@ Theano provides a 'Print' Op to do this.
    f_with_print = theano.function([x], x_printed * 5)
    #this runs the graph without any printing
-    assert numpy.all( f([1,2,3]) == [5, 10, 15])
+    assert numpy.all( f([1, 2, 3]) == [5, 10, 15])
    #this runs the graph with the message, and value printed
-    assert numpy.all( f_with_print([1,2,3]) == [5, 10, 15])
+    assert numpy.all( f_with_print([1, 2, 3]) == [5, 10, 15])
 Since Theano runs your program in a topological order, you won't have precise
-control over the order in which multiple Print() Ops are evaluted.  For a more
+control over the order in which multiple ``Print()`` ops are evaluted.  For a more
-precise inspection of what's being computed where, when, and how, see the
+precise inspection of what's being computed where, when, and how, see the discussion 
 :ref:`faq_wraplinker`.
 .. warning::
@@ -177,40 +178,50 @@ precise inspection of what's being computed where, when, and how, see the
    to remove them to know if this is the cause or not.
-How do I print a graph (before or after compilation)?
+"How do I Print a Graph?" (before or after compilation)
----------------------------------------------------------
+-------------------------------------------------------
+.. TODO: dead links in the next paragraph
 Theano provides two functions (:func:`theano.pp` and
 :func:`theano.printing.debugprint`) to print a graph to the terminal before or after
 compilation.  These two functions print expression graphs in different ways:
 :func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
-Theano also provides :func:`pydotprint` that creates a png image of the function.
+Theano also provides :func:`theano.printing.pydotprint` that creates a png image of the function.
 You can read about them in :ref:`libdoc_printing`.
-The function I compiled is too slow, what's up?
+"The Function I Compiled is Too Slow, what's up?"
-----------------------------------------------
+-------------------------------------------------
-First, make sure you're running in FAST_RUN mode.  
-FAST_RUN is the default mode, but make sure by passing ``mode='FAST_RUN'``
+First, make sure you're running in ``FAST_RUN`` mode. Even though 
+``FAST_RUN`` is the default mode, insist by passing ``mode='FAST_RUN'``
 to ``theano.function`` (or ``theano.make``) or by setting :attr:`config.mode`
 to ``FAST_RUN``.
-Second, try the theano :ref:`using_profilemode`.  This will tell you which
+Second, try the Theano :ref:`using_profilemode`.  This will tell you which
-Apply nodes, and which Ops are eating up your CPU cycles.
+``Apply`` nodes, and which ops are eating up your CPU cycles.
 Tips:
-* use the flags floatX=float32 to use float32 instead of float64 for the theano type matrix(),vector(),...(if you used dmatrix, dvector() they stay at float64).
+* Use the flags ``floatX=float32`` to require type *float32* instead of *float64*; 
-* Check that in the profile mode that there is no Dot operation and you're multiplying two matrices of the same type. Dot should be optimized to dot22 when the inputs are matrices and of the same type. This can happen when using floatX=float32 and something in the graph makes one of the inputs float64.
+  Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),...
+  since they respectively involve the default types *float32* and *float64*.
+* Check in the ``profile`` mode that there is no ``Dot`` op in the post-compilation
+  graph while you are multiplying two matrices of the same type. ``Dot`` should be
+  optimized to ``dot22`` when the inputs are matrices and of the same type. This can
+  still happen when using ``floatX=float32`` when one of the inputs of the graph is
+  of type *float64*.
 .. _faq_wraplinker:
-How do I step through a compiled function with the WrapLinker?
+"How do I Step through a Compiled Function with the WrapLinker?"
--------------------------------------------------------------
+----------------------------------------------------------------
-This is not exactly an FAQ, but the doc is here for now...
+This is not exactly a FAQ, but the doc is here for now...
 It's pretty easy to roll-your-own evaluation mode.
 Check out this one:
@@ -225,37 +236,37 @@ Check out this one:
            wrap_linker = theano.gof.WrapLinkerMany([theano.gof.OpWiseCLinker()], [print_eval])
            super(PrintEverythingMode, self).__init__(wrap_linker, optimizer='fast_run')
-When you use ``mode=PrintEverythingMode()`` as the mode for Function or Method,
+When you use ``mode=PrintEverythingMode()`` as the mode for ``Function`` or ``Method``,
-then you should see [potentially a lot of] output.  Every Apply node will be printed out,
+then you should see [potentially a lot of] output.  Every ``Apply`` node will be printed out,
-along with its position in the graph, the arguments to the ``perform`` or
+along with its position in the graph, the arguments to the functions ``perform`` or
 ``c_code`` and the output it computed.  
 >>> x = T.dscalar('x')
->>> f = function([x], [5*x], mode=PrintEverythingMode())
+>>> f = function([x], [5 * x], mode=PrintEverythingMode())
 >>> f(3)
 >>> # print: 0 Elemwise{mul,no_inplace}(5, x) [array(5, dtype=int8), array(3.0)] [array(15.0)]
 >>> # print: [array(15.0)]
 Admittedly, this may be a huge amount of
 output to read through if you are using big tensors... but you can choose to
-put logic inside of the print_eval function  that would, for example, only
+put logic inside of the *print_eval* function that would, for example, print 
-print something out if a certain kind of Op was used, at a certain program
+something out only if a certain kind of op were used, at a certain program
-position, or if a particular value shows up in one of the inputs or outputs.
+position, or only if a particular value showed up in one of the inputs or outputs.
 Use your imagination :)
 .. TODO: documentation for link.WrapLinkerMany
-This can be a really powerful debugging tool.
+This can be a really powerful debugging tool. Note the call to *fn* inside the call to
-Note the call to ``fn`` inside the call to ``print_eval``; without it, the graph wouldn't get computed at all!
+*print_eval*; without it, the graph wouldn't get computed at all!
-How to use pdb ?
+How to Use pdb
----------------
+--------------
 In the majority of cases, you won't be executing from the interactive shell
 but from a set of Python scripts. In such cases, the use of the Python
 debugger can come in handy, especially as your models become more complex.
 Intermediate results don't necessarily have a clear name and you can get
-exceptions which are hard to decipher, due to the "compiled" nature of
+exceptions which are hard to decipher, due to the "compiled" nature of the
 functions.
 Consider this example script ("ex.py"):
@@ -269,16 +280,16 @@ Consider this example script ("ex.py"):
        a = T.dmatrix('a')
        b = T.dmatrix('b')
-        f = theano.function([a,b], [a*b])
+        f = theano.function([a, b], [a * b])
        # matrices chosen so dimensions are unsuitable for multiplication
-        mat1 = numpy.arange(12).reshape((3,4))
+        mat1 = numpy.arange(12).reshape((3, 4))
-        mat2 = numpy.arange(25).reshape((5,5))
+        mat2 = numpy.arange(25).reshape((5, 5))
        f(mat1, mat2)
 This is actually so simple the debugging could be done easily, but it's for
-illustrative purposes. As the matrices can't be element-wise multiplied
+illustrative purposes. As the matrices can't be multiplied element-wise
 (unsuitable shapes), we get the following exception:
 .. code-block:: text
@@ -290,12 +301,12 @@ illustrative purposes. As the matrices can't be element-wise multiplied
    File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
    File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
-The call stack contains a few useful informations to trace back the source
+The call stack contains some useful information to trace back the source
 of the error. There's the script where the compiled function was called --
 but if you're using (improperly parameterized) prebuilt modules, the error
 might originate from ops in these modules, not this script. The last line
-tells us about the Op that caused the exception. In thise case it's a "mul"
+tells us about the op that caused the exception. In this case it's a "mul"
-involving Variables name "a" and "b". But suppose we instead had an
+involving variables with names "a" and "b". But suppose we instead had an
 intermediate result to which we hadn't given a name.
 After learning a few things about the graph structure in Theano, we can use
@@ -328,7 +339,7 @@ explore around the graph.
 That graph is purely symbolic (no data, just symbols to manipulate it
 abstractly). To get information about the actual parameters, you explore the
-"thunks" objects, which bind the storage for the inputs (and outputs) with
+"thunk" objects, which bind the storage for the inputs (and outputs) with
 the function itself (a "thunk" is a concept related to closures). Here, to
 get the current node's first input's shape, you'd therefore do "p
 thunk.inputs[0][0].shape", which prints out "(3, 4)".

--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -2,11 +2,19 @@
 .. _basictutexamples:
 =============
-More examples
+More Examples
 =============
+At this point it would be wise to begin familiarizing yourself 
+more systematically with Theano's fundamental objects and operations by browsing
+this section of the library: :ref:`libdoc_basic_tensor`.
-Logistic function
+As the tutorial unfolds, you should also gradually acquaint yourself with the other
+relevant areas of the library and with the relevant subjects of the documentation
+entrance page.
+Logistic Function
 =================
 Here's another straightforward example, though a bit more elaborate
@@ -61,12 +69,12 @@ array([[ 0.5       ,  0.73105858],
       [ 0.26894142,  0.11920292]])
-Computing more than one thing at the same time
+Computing More than one Thing at the Same Time
 ==============================================
 Theano supports functions with multiple outputs. For example, we can
 compute the :ref:`elementwise <libdoc_tensor_elementwise>` difference, absolute difference, and
-squared difference between two matrices ``a`` and ``b`` at the same time:
+squared difference between two matrices *a* and *b* at the same time:
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_3
@@ -82,7 +90,7 @@ squared difference between two matrices ``a`` and ``b`` at the same time:
   shortcut for allocating symbolic variables that we will often use in the
   tutorials.
-When we use the function, it will return the three variables (the printing
+When we use the function f, it returns the three variables (the printing
 was reformatted for readability):
 >>> f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
@@ -94,9 +102,7 @@ was reformatted for readability):
        [ 1.,  4.]])]
+Setting a Default Value for an Argument
-Setting a default value for an argument
 =======================================
 Let's say you want to define a function that adds two numbers, except
@@ -117,11 +123,11 @@ array(35.0)
 This makes use of the :ref:`Param <function_inputs>` class which allows
 you to specify properties of your function's parameters with greater detail. Here we
-give a default value of 1 for ``y`` by creating a ``Param`` instance with
+give a default value of 1 for *y* by creating a ``Param`` instance with
 its ``default`` field set to 1.
 Inputs with default values must follow inputs without default
-values (like python's functions).  There can be multiple inputs with default values. These parameters can
+values (like Python's functions).  There can be multiple inputs with default values. These parameters can
 be set positionally or by name, as in standard Python:
@@ -143,18 +149,21 @@ array(34.0)
 array(33.0)
 .. note::
-   ``Param`` does not know the name of the local variables ``y`` and ``w``
+   ``Param`` does not know the name of the local variables *y* and *w*
   that are passed as arguments.  The symbolic variable objects have name
   attributes (set by ``dscalars`` in the example above) and *these* are the
   names of the keyword parameters in the functions that we build.  This is
   the mechanism at work in ``Param(y, default=1)``.  In the case of ``Param(w,
-   default=2, name='w_by_name')``, we override the symbolic variable's name
+   default=2, name='w_by_name')``. We override the symbolic variable's name
   attribute with a name to be used for this function.
+You may like to see :ref:`Function<usingfunction>` in the library for more detail.
 .. _functionstateexample:
-Using shared variables
+Using Shared Variables
 ======================
 It is also possible to make a function with an internal state. For
@@ -162,7 +171,7 @@ example, let's say we want to make an accumulator: at the beginning,
 the state is initialized to zero. Then, on each function call, the state
 is incremented by the function's argument.
-First let's define the ``accumulator`` function. It adds its argument to the
+First let's define the *accumulator* function. It adds its argument to the
 internal state, and returns the old state value.
 .. If you modify this code, also change :
@@ -174,24 +183,24 @@ internal state, and returns the old state value.
 >>> accumulator = function([inc], state, updates=[(state, state+inc)])
 This code introduces a few new concepts.  The ``shared`` function constructs
-so-called :term:`shared variables <shared variable>`.
+so-called :ref:`shared variables<libdoc_compile_shared>`.
-These are hybrid symbolic and non-symbolic
+These are hybrid symbolic and non-symbolic variables whose value may be shared 
-variables.  Shared variables can be used in symbolic expressions just like
+between multiple functions.  Shared variables can be used in symbolic expressions just like
 the objects returned by ``dmatrices(...)`` but they also have an internal
-value, that defines the value taken by this symbolic variable in *all* the
+value that defines the value taken by this symbolic variable in *all* the
 functions that use it.  It is called a *shared* variable because its value is
 shared between many functions.  The value can be accessed and modified by the
 ``.get_value()`` and ``.set_value()`` methods. We will come back to this soon.
-The other new thing in this code is the ``updates`` parameter of function.
+The other new thing in this code is the ``updates`` parameter of ``function``.
-The updates is a list of pairs of the form (shared-variable, new expression).
+``updates`` must be supplied with a list of pairs of the form (shared-variable, new expression).
 It can also be a dictionary whose keys are shared-variables and values are
 the new expressions.  Either way, it means "whenever this function runs, it
 will replace the ``.value`` of each shared variable with the result of the
 corresponding expression".  Above, our accumulator replaces the ``state``'s value with the sum
 of the state and the increment amount.
-Anyway, let's try it out!
+Let's try it out!
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_8
@@ -216,7 +225,7 @@ array(-1)
 array(2)
 As we mentioned above, you can define more than one function to use the same
-shared variable.  These functions can both update the value.
+shared variable.  These functions can all update the value.
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_8
@@ -228,13 +237,13 @@ array(2)
 array(0)
 You might be wondering why the updates mechanism exists.  You can always
-achieve a similar thing by returning the new expressions, and working with
+achieve a similar result by returning the new expressions, and working with
-them in numpy as usual.  The updates mechanism can be a syntactic convenience,
+them in NumPy as usual.  The updates mechanism can be a syntactic convenience,
 but it is mainly there for efficiency.  Updates to shared variables can
 sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
-updates).  Also, theano has more control over where and how shared variables are
+updates).  Also, Theano has more control over where and how shared variables are
 allocated, which is one of the important elements of getting good performance
-on the GPU.
+on the :ref:`GPU<using_gpu>`.
 It may happen that you expressed some formula using a shared variable, but
 you do *not* want to use its value. In this case, you can use the
@@ -254,15 +263,15 @@ array(7)
 >>> state.get_value()  # old state still there, but we didn't use it
 array(0)
-The givens parameter can be used to replace any symbolic variable, not just a
+The ``givens`` parameter can be used to replace any symbolic variable, not just a
 shared variable. You can replace constants, and expressions, in general.  Be
-careful though, not to allow the expressions introduced by a givens
+careful though, not to allow the expressions introduced by a ``givens``
 substitution to be co-dependent, the order of substitution is not defined, so
 the substitutions have to work in any order.
 In practice, a good way of thinking about the ``givens`` is as a mechanism
 that allows you to replace any part of your formula with a different
-expression that evaluates to a tensor of same shape and dtype. ``givens``
+expression that evaluates to a tensor of same shape and dtype.
 .. _using_random_numbers:
@@ -272,17 +281,19 @@ Using Random Numbers
 Because in Theano you first express everything symbolically and
 afterwards compile this expression to get functions,
 using pseudo-random numbers is not as straightforward as it is in
-numpy, though also not too complicated.
+NumPy, though also not too complicated.
 The way to think about putting randomness into Theano's computations is
-to put random variables in your graph. Theano will allocate a numpy
+to put random variables in your graph. Theano will allocate a NumPy
 RandomStream object (a random number generator) for each such
 variable, and draw from it as necessary. We will call this sort of
 sequence of random numbers a *random stream*. *Random streams* are at
 their core shared variables, so the observations on shared variables
-hold here as well.
+hold here as well. Theanos's random objects are defined and implemented in 
+:ref:`RandomStreams<libdoc_tensor_shared_randomstreams>` and, at a lower level, 
+in :ref:`RandomStreamsBase<libdoc_tensor_raw_random>`.
-Brief example
+Brief Example
 -------------
 Here's a brief example.  The setup code is:
@@ -303,7 +314,9 @@ Here's a brief example.  The setup code is:
 Here, 'rv_u' represents a random stream of 2x2 matrices of draws from a uniform
 distribution.  Likewise,  'rv_n' represents a random stream of 2x2 matrices of
 draws from a normal distribution.  The distributions that are implemented are
-defined in :class:`RandomStreams`.
+defined in :class:`RandomStreams` and, at a lower level, in :ref:`raw_random<libdoc_tensor_raw_random>`.
+  .. TODO: repair the latter reference on RandomStreams
 Now let's use these objects.  If we call f(), we get random uniform numbers.
 The internal state of the random number generator is automatically updated,
@@ -313,22 +326,22 @@ so we get different random numbers every time.
 >>> f_val1 = f()  #different numbers from f_val0
 When we add the extra argument ``no_default_updates=True`` to
-``function`` (as in ``g``), then the random number generator state is
+``function`` (as in *g*), then the random number generator state is
-not affected by calling the returned function.  So for example, calling
+not affected by calling the returned function.  So, for example, calling
-``g`` multiple times will return the same numbers.
+*g* multiple times will return the same numbers.
 >>> g_val0 = g()  # different numbers from f_val0 and f_val1
 >>> g_val1 = g()  # same numbers as g_val0!
 An important remark is that a random variable is drawn at most once during any
-single function execution.  So the ``nearly_zeros`` function is guaranteed to
+single function execution.  So the *nearly_zeros* function is guaranteed to
-return approximately 0 (except for rounding error) even though the ``rv_u``
+return approximately 0 (except for rounding error) even though the *rv_u*
 random variable appears three times in the output expression.
 >>> nearly_zeros = function([], rv_u + rv_u - 2 * rv_u)
-Seedings Streams
+Seeding Streams
----------------
+---------------
 Random variables can be seeded individually or collectively.
@@ -346,12 +359,12 @@ of the random variables.
 >>> srng.seed(902340)  # seeds rv_u and rv_n with different seeds each
-Sharing Streams between Functions
+Sharing Streams Between Functions
 ---------------------------------
 As usual for shared variables, the random number generators used for random
-variables are common between functions.  So our ``nearly_zeros`` function will
+variables are common between functions.  So our *nearly_zeros* function will
-update the state of the generators used in function ``f`` above.
+update the state of the generators used in function *f* above.
 For example:
@@ -364,7 +377,64 @@ For example:
 >>> v2 = f()             # v2 != v1
-Others Random Distributions
+Other Random Distributions
 ---------------------------
 There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`. 
+.. _logistic_regression:
+A Real Example: Logistic Regression
+===================================
+The preceding elements are featured in this more realistic example.  It will be used repeatedly.  
+.. code-block:: python
+  import numpy
+  import theano
+  import theano.tensor as T
+  rng = numpy.random
+  N = 400
+  feats = 784
+  D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
+  training_steps = 10000
+  # Declare Theano symbolic variables
+  x = T.matrix("x")
+  y = T.vector("y")
+  w = theano.shared(rng.randn(feats), name="w")
+  b = theano.shared(0., name="b")
+  print "Initial model:"
+  print w.get_value(), b.get_value()
+  # Construct Theano expression graph
+  p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))   # Probability that target = 1
+  prediction = p_1 > 0.5                    # The prediction thresholded
+  xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
+  cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
+  gw,gb = T.grad(cost, [w, b])              # Compute the gradient of the cost
+                                            # (we shall return to this in a
+                                            # following section of this tutorial)
+  # Compile
+  train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
+  predict = theano.function(inputs=[x], outputs=prediction)
+  # Train
+  for i in range(training_steps):
+      pred, err = train(D[0], D[1])
+  print "Final model:"
+  print w.get_value(), b.get_value()
+  print "target values for D:", D[1]
+  print "prediction on D:", predict(D[0])
--- a/doc/tutorial/extending_theano.txt
+++ b/doc/tutorial/extending_theano.txt
 .. _extending_theano:
-****************
+================
 Extending Theano
-****************
+================
-Theano graphs
+Theano Graphs
-------------
+=============
- Theano works with symbolic graphs
+- Theano works with symbolic graphs.
- Those graphs are bi-partite graphs (graph with 2 types of nodes)
+- Those graphs are bi-partite graphs (graph with 2 types of nodes).
- The 2 types of nodes are Apply and Variable nodes
+- The two types of nodes are ``Apply`` and ``Variable`` nodes.
- Each Apply node has a link to the Op that it executes
+- Each ``Apply`` node has a link to the op that it executes.
-Inputs and Outputs are lists of Theano variables
+Inputs and Outputs are lists of Theano variables.
 .. image:: ../hpcs2011_tutorial/pics/apply_node.png
    :width: 500 px
@@ -21,27 +21,29 @@ Inputs and Outputs are lists of Theano variables
 .. note::
    This tutorial does not cover how to make an op that returns a view or
-    modify the values in its inputs. So all
+    modifies the values in its inputs. Thus, all ops created with the 
-    Ops created with the instructions here MUST return newly allocated
+    instructions described here MUST return newly allocated
    memory or reuse the memory provided in the parameter
-    output_storage of the :func:`perform` function. See :ref:`views_and_inplace`
+    ``output_storage`` of the :func:`perform` function. See :ref:`views_and_inplace`
-    for explanation of how to do this.
+    for an explanation on how to do this.
-    If your Op returns a view or change the value on its inputs
+    If your op returns a view or changes the value of its inputs
-    without doing as said in that page, Theano will run, but will
+    without doing as prescribed in that page, Theano will run, but will
-    return good results for some graphs, but bad results for others.
+    return correct results for some graphs and wrong results for others.
-    It is recommented that you run your tests in DebugMode (Theano flag
+    It is recommended that you run your tests in DebugMode (Theano *flag*
-    mode=DebugMode) that checks if your Op behaves correctly in this
+    ``mode=DebugMode``) since it verifies if your op behaves correctly in this
    regard.
 .. note::
-   See :ref:`dev_start_guide` for information about git, github, the
+   See the :ref:`dev_start_guide` for information regarding the versioning
-   development workflow and how to make a quality contribution.
+   framework, namely about *git* and *GitHub*, regarding the development workflow and
+   how to make a quality contribution.
-Op contract
-----------
+Op Contract
+===========
 .. code-block:: python
@@ -66,8 +68,8 @@ Op contract
            pass
        # C implementation: [see theano web site for other functions]
-	def c_code(...):
+        def c_code(...):
-	# ...
+            # ...
            pass
        # others implementation (pycuda, ...):
@@ -81,7 +83,7 @@ Op contract
        def grad(self, inputs, g):
            pass
-	def R_op(self, inputs, eval_points):
+        def R_op(self, inputs, eval_points):
            pass
        def infer_shape(node, (i0_shapes, ...))
@@ -89,28 +91,28 @@ Op contract
 .. ../extending/op.txt
-There are 2 mandatory methods that one needs to implement.
+There are two mandatory methods that one needs to implement.
 The first one is :func:`make_node`. The second one 
 would describe the computations that are required to be done
 at run time. Currently there are 2 different possibilites:
 implement the :func:`perform`
-and/or :func:`c_code <Op.c_code>` (and other related :ref:`c methods
+and/or :func:`c_code <Op.c_code>` methods (and other related :ref:`c methods
-<cop>`), or the :func:`make_thunk` method. The ``perform`` allows
+<cop>`), or the :func:`make_thunk` method. ``perform`` allows
-to easily wrap an existing python function into Theano. The ``c_code``
+to easily wrap an existing Python function into Theano. ``c_code``
-and related methods allow the op to generate c code that will be 
+and the related methods allow the op to generate C code that will be 
-compiled and linked by Theano. On the other hand, the ``make_thunk``
+compiled and linked by Theano. On the other hand, ``make_thunk``
-method will be called only once during compilation and should generate
+will be called only once during compilation and should generate
 a ``thunk``: a standalone function that when called will do the wanted computations.
 This is useful if you want to generate code and compile it yourself. For
-example, this allows you to use PyCUDA to compile gpu code.
+example, this allows you to use PyCUDA to compile GPU code.
-Also there are 2 methods that are highly recommended to be implemented. They are
+Also there are two methods whose implementations are highly recommended. They are
 needed in order to merge duplicate computations involving your op. So if you
 do not want Theano to execute your op multiple times with the same inputs,
 do implement them. Those methods are :func:`__eq__` and
 :func:`__hash__`.
-The :func:`infer_shape` method allows to infer shape of some variable, somewhere in the
+The :func:`infer_shape` method allows to infer the shape of some variable, somewhere in the
 middle of the computational graph without actually computing the outputs (when possible).
 This could be helpful if one only needs the shape of the output instead of the actual outputs.
@@ -118,13 +120,13 @@ The :func:`grad` method is required if you want to differentiate some cost whose
 includes your op.
 The :func:`__str__` method is useful in order to provide a more meaningful
-string representation of your Op.
+string representation of your op.
-The :func:`R_op` method is needed if you want `theano.tensor.Rop` to
+The :func:`R_op` method is needed if you want ``theano.tensor.Rop`` to
 work with your op.
-Op example
+Op Example
----------
+==========
 .. code-block:: python
@@ -155,7 +157,7 @@ Op example
        def grad(self, inputs, output_grads):
            return [output_grads[0] * 2]
-	def R_op(self, inputs, eval_points):
+        def R_op(self, inputs, eval_points):
            # R_op can receive None as eval_points.
            # That mean there is no diferientiable path through that input
            # If this imply that you cannot compute some outputs,
@@ -164,7 +166,7 @@ Op example
                return eval_points
            return self.grad(inputs, eval_points)
-Try it!
+You can try it as follows:
 .. code-block:: python
@@ -177,19 +179,20 @@ Try it!
    print inp
    print out
-How to test it
--------------
-Theano has some functions to simplify testing. These help test the
+How To Test it
+==============
+Theano has some functionalities to simplify testing. These help test the
 ``infer_shape``, ``grad`` and ``R_op`` methods. Put the following code
-in a file and execute it with the ``nosetests`` program.
+in a file and execute it with the ``theano-nose`` program.
-Basic tests
+Basic Tests
-===========
+-----------
-Basic tests are done by you just by using the Op and checking that it
+Basic tests are done by you just by using the op and checking that it
 returns the right answer. If you detect an error, you must raise an
-exception. You can use the `assert` keyword to automatically raise an
+*exception*. You can use the ``assert`` keyword to automatically raise an
 ``AssertionError``.
 .. code-block:: python
@@ -210,23 +213,24 @@ exception. You can use the `assert` keyword to automatically raise an
            # Compare the result computed to the expected value.
            assert numpy.allclose(inp * 2, out)
 Testing the infer_shape
-=======================
+-----------------------
 When a class inherits from the ``InferShapeTester`` class, it gets the
-`self._compile_and_check` method that tests the Op ``infer_shape``
+``self._compile_and_check`` method that tests the op's ``infer_shape``
-method. It tests that the Op gets optimized out of the graph if only
+method. It tests that the op gets optimized out of the graph if only
 the shape of the output is needed and not the output
-itself. Additionally, it checks that such an optimized graph computes
+itself. Additionally, it checks that the optimized graph computes
 the correct shape, by comparing it to the actual shape of the computed
 output.
-`self._compile_and_check` compiles a Theano function. It takes as
+``self._compile_and_check`` compiles a Theano function. It takes as
 parameters the lists of input and output Theano variables, as would be
-provided to theano.function, and a list of real values to pass to the
+provided to ``theano.function``, and a list of real values to pass to the
-compiled function (don't use shapes that are symmetric, e.g. (3, 3),
+compiled function (do not use symmetric shapes, e.g. (3, 3),
-as they can easily to hide errors). It also takes the Op class to
+as they can easily hide errors). It also takes the op class as a parameter
-verify that no Ops of that type appear in the shape-optimized graph.
+in order to verify that no instance of it appears in the shape-optimized graph.
 If there is an error, the function raises an exception. If you want to
 see it fail, you can implement an incorrect ``infer_shape``.
@@ -249,10 +253,10 @@ see it fail, you can implement an incorrect ``infer_shape``.
                                    self.op_class)
 Testing the gradient
-====================
+--------------------
 The function :ref:`verify_grad <validating_grad>`
-verifies the gradient of an Op or Theano graph. It compares the
+verifies the gradient of an op or Theano graph. It compares the
 analytic (symbolically computed) gradient and the numeric
 gradient (computed through the Finite Difference Method).
@@ -267,15 +271,16 @@ the multiplication by 2).
                                                    [numpy.random.rand(5, 7, 2)])
 Testing the Rop
-===============
+---------------
+.. TODO: repair defective links in the following paragraph
-The class :class:`RopLop_checker`, give the functions
+The class :class:`RopLop_checker` defines the functions
-:func:`RopLop_checker.check_mat_rop_lop`,
+:func:`RopLop_checker.check_mat_rop_lop`, :func:`RopLop_checker.check_rop_lop` and
-:func:`RopLop_checker.check_rop_lop` and
+:func:`RopLop_checker.check_nondiff_rop`. These allow to test the
-:func:`RopLop_checker.check_nondiff_rop` that allow to test the
+implementation of the Rop method of a particular op.
-implementation of the Rop method of one Op.
-To verify the Rop method of the DoubleOp, you can use this:
+For instance, to verify the Rop method of the DoubleOp, you can use this:
 .. code-block:: python
@@ -288,20 +293,64 @@ To verify the Rop method of the DoubleOp, you can use this:
       def test_double_rop(self):
           self.check_rop_lop(DoubleRop()(self.x), self.in_shape)
-Running your tests
+Testing GPU Ops
+---------------
+Ops to be executed on the GPU should inherit from the ``theano.sandbox.cuda.GpuOp`` 
+and not ``theano.Op``. This allows Theano to distinguish them. Currently, we
+use this to test if the NVIDIA driver works correctly with our sum reduction code on the
+GPU.
+Running Your Tests
 ==================
-You can run ``nosetests`` in the Theano folder to run all of Theano's
+To perform your tests, you may select either one of the three following methods:
-tests, including yours if they are somewhere in the directory
-structure.  You can run ``nosetests test_file.py`` to run only the
+theano-nose
-tests in that file. You can run ``nosetests
+-----------
-test_file.py:test_DoubleRop`` to run only the tests inside that test
-class. You can run ``nosetests
+The method of choice to conduct tests is to run the file ``theano-nose``. In a regular
-test_file.py:test_DoubleRop.test_double_op`` to run only one
+Theano installation, the latter will be on the operating system's path and directly accessible
-particular test. More `nosetests
+from any folder. Otherwise, it can be accessed in the ``Theano/bin`` folder. The following command
-<http://readthedocs.org/docs/nose/en/latest/>`_ documentation.
+lines may be used for the corresponding purposes:
+* ``theano-nose --theano``: Run every test found in Theano's path.
+* ``theano-nose folder_name``: Run every test found in the folder *folder_name*.
+* ``theano-nose test_file.py``: Run every test found in the file *test_file.py*.
+The following are particularly useful for development purposes since they call for
+particular classes or even for particular tests: 
-You can also add this at the end of the test file:
+* ``theano-nose test_file.py:test_DoubleRop``: Run every test found inside the class *test_DoubleRop*.
+* ``theano-nose test_file.py:test_DoubleRop.test_double_op``: Run only the test *test_double_op*
+  in the class *test_DoubleRop*.
+Help with the use and functionalities of ``theano-nose`` may be obtained by running
+it with the command line parameter ``--help (-h)``. 
+nosetests
+---------
+The command ``nosetests`` can also be used.  Although it lacks the useful 
+functionalities that ``theano-nose`` provides, ``nosetests`` can be called similarly
+to ``theano-nose`` from any folder in Python's path like so:
+``nosetests [suffix similar to the above]``.
+More documentation on ``nosetests`` is available here:
+`nosetests <http://readthedocs.org/docs/nose/en/latest/>`_.
+In-file
+-------
+One may also add a block of code similar to the following at the end of the
+file containing a specific test of interest and run the file. In this example, the test
+*test_DoubleRop* in the class *test_double_op* would be performed.
 .. code-block:: python
@@ -310,14 +359,30 @@ You can also add this at the end of the test file:
       t.setUp()
       t.test_double_rop()
-Exercises 8
+We recommand that when we execute a file, we run all tests in that
-----------
+file. This can be done by adding this at the end of your test files:
- Run the code in the file double_op.py.
+.. code-block:: python
- Modify and execute to compute: x * y
- Modify and execute the example to return 2 outputs: x + y and x - y
-  - Our current element-wise fusion generates computation with only 1 output.
+    if __name__ == '__main__':
+        unittest.main()
+Exercise
+========
+Run the code of the *DoubleOp* example above.
+Modify and execute to compute: x * y.
+Modify and execute the example to return two outputs: x + y and x - y.
+You can omit the Rop functions. Try to implement the testing apparatus described above.
+(Notice that Theano's current *elemwise fusion* optimization is
+only applicable to computations involving a single output. Hence, to gain
+efficiency over the basic solution that is asked here, the two operations would
+have to be jointly optimized explicitly in the code.)
 SciPy
 -----
@@ -361,18 +426,15 @@ don't forget to call the parent ``setUp`` function.
 For more details see :ref:`random_value_in_tests`.
-GPU Op
------
-Op that execute on the GPU should inherit from the
+:download:`Solution<extending_theano_solution_1.py>`
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows Theano
-to make the distinction between both. Currently, we use this to test
-if the NVIDIA driver works correctly with our sum reduction code on the
-gpu.
+Final Note
+==========
-Documentation
+A more extensive discussion of this section's content may be found in the advanced
-------------
+tutorial :ref:`Extending Theano<extending>`
 See :ref:`metadocumentation`, for some information on how to generate
 the documentation.

--- a/doc/tutorial/extending_theano_solution_1.py
+++ b/doc/tutorial/extending_theano_solution_1.py
+#!/usr/bin/env python
+# Theano tutorial
+# Solution to Exercise in section 'Extending Theano'
+import unittest
+import theano
+# 1. Op returns x * y
+class ProdOp(theano.Op):
+    def __eq__(self, other):
+        return type(self) == type(other)
+    def __hash__(self):
+        return hash(type(self))
+    def __str__(self):
+        return self.__class__.__name__
+    def make_node(self, x, y):
+        x = theano.tensor.as_tensor_variable(x)
+        y = theano.tensor.as_tensor_variable(y)
+        outdim = x.ndim
+        output = (theano.tensor.TensorType
+                  (dtype=theano.scalar.upcast(x.dtype, y.dtype),
+                      broadcastable=[False] * outdim)())
+        return theano.Apply(self, inputs=[x, y], outputs=[output])
+    def perform(self, node, inputs, output_storage):
+        x, y = inputs
+        z = output_storage[0]
+        z[0] = x * y
+    def infer_shape(self, node, i0_shapes):
+        return [i0_shapes[0]]
+    def grad(self, inputs, output_grads):
+        return [output_grads[0] * inputs[1], output_grads[0] * inputs[0]]
+# 2. Op returns x + y and x - y
+class SumDiffOp(theano.Op):
+    def __eq__(self, other):
+        return type(self) == type(other)
+    def __hash__(self):
+        return hash(type(self))
+    def __str__(self):
+        return self.__class__.__name__
+    def make_node(self, x, y):
+        x = theano.tensor.as_tensor_variable(x)
+        y = theano.tensor.as_tensor_variable(y)
+        outdim = x.ndim
+        output1 = (theano.tensor.TensorType
+                  (dtype=theano.scalar.upcast(x.dtype, y.dtype),
+                      broadcastable=[False] * outdim)())
+        output2 = (theano.tensor.TensorType
+                  (dtype=theano.scalar.upcast(x.dtype, y.dtype),
+                      broadcastable=[False] * outdim)())
+        return theano.Apply(self, inputs=[x, y], outputs=[output1, output2])
+    def perform(self, node, inputs, output_storage):
+        x, y = inputs
+        z1, z2 = output_storage
+        z1[0] = x + y
+        z2[0] = x - y
+    def infer_shape(self, node, i0_shapes):
+        return [i0_shapes[0], i0_shapes[0]]
+    def grad(self, inputs, output_grads):
+        og1, og2 = output_grads
+        if og1 is None:
+            og1 = theano.tensor.zeros_like(og2)
+        if og2 is None:
+            og2 = theano.tensor.zeros_like(og1)
+        return [og1 + og2, og1 - og2]
+# 3. Testing apparatus
+import numpy
+from theano.gof import Op, Apply
+from theano import tensor, function, printing
+from theano.tests import unittest_tools as utt
+class TestProdOp(utt.InferShapeTester):
+    rng = numpy.random.RandomState(43)
+    def setUp(self):
+        super(TestProdOp, self).setUp()
+        self.op_class = ProdOp  # case 1
+    def test_perform(self):
+        x = theano.tensor.matrix()
+        y = theano.tensor.matrix()
+        f = theano.function([x, y], self.op_class()(x, y))
+        x_val = numpy.random.rand(5, 4)
+        y_val = numpy.random.rand(5, 4)
+        out = f(x_val, y_val)
+        assert numpy.allclose(x_val * y_val, out)
+    def test_gradient(self):
+        utt.verify_grad(self.op_class(), [numpy.random.rand(5, 4),
+                                numpy.random.rand(5, 4)],
+                        n_tests=1, rng=TestProdOp.rng)
+    def test_infer_shape(self):
+        x = tensor.dmatrix()
+        y = tensor.dmatrix()
+        self._compile_and_check([x, y], [self.op_class()(x, y)],
+                                [numpy.random.rand(5, 6),
+                                 numpy.random.rand(5, 6)],
+                                self.op_class)
+class TestSumDiffOp(utt.InferShapeTester):
+    rng = numpy.random.RandomState(43)
+    def setUp(self):
+        super(TestSumDiffOp, self).setUp()
+        self.op_class = SumDiffOp
+    def test_perform(self):
+        x = theano.tensor.matrix()
+        y = theano.tensor.matrix()
+        f = theano.function([x, y], self.op_class()(x, y))
+        x_val = numpy.random.rand(5, 4)
+        y_val = numpy.random.rand(5, 4)
+        out = f(x_val, y_val)
+        assert numpy.allclose([x_val + y_val, x_val - y_val], out)
+    def test_gradient(self):
+        def output_0(x, y):
+            return self.op_class()(x, y)[0]
+        def output_1(x, y):
+            return self.op_class()(x, y)[1]
+        utt.verify_grad(output_0, [numpy.random.rand(5, 4),
+                                numpy.random.rand(5, 4)],
+                        n_tests=1, rng=TestSumDiffOp.rng)
+        utt.verify_grad(output_1, [numpy.random.rand(5, 4),
+                                numpy.random.rand(5, 4)],
+                        n_tests=1, rng=TestSumDiffOp.rng)
+    def test_infer_shape(self):
+        x = tensor.dmatrix()
+        y = tensor.dmatrix()
+        # adapt the choice of the next instruction to the op under test
+        self._compile_and_check([x, y], self.op_class()(x, y),
+                                [numpy.random.rand(5, 6),
+                                 numpy.random.rand(5, 6)],
+                                self.op_class)
+if __name__ == "__main__":
+    unittest.main()
--- a/doc/tutorial/faq.txt
+++ b/doc/tutorial/faq.txt
@@ -8,33 +8,46 @@ Frequently Asked Questions
 TypeError: object of type 'TensorVariable' has no len()
 -------------------------------------------------------
-If you receive this error:
+If you receive the following error, it is because the Python function *__len__* cannot 
+be implemented on Theano variables:
 .. code-block:: python
   TypeError: object of type 'TensorVariable' has no len()
-We can't implement the __len__ function on Theano Variables. This is
+Python requires that *__len__* returns an integer, yet it cannot be done as Theano's variables are symbolic. However, `var.shape[0]` can be used as a workaround.
-because Python requires that this function returns an integer, but we
-can't do this as we are working with symbolic variables. You can use
-`var.shape[0]` as a workaround.
-Also we can't change the above error message into a more explicit one
+This error message cannot be made more explicit because the relevant aspects of Python's 
-because of some other Python internal behavior that can't be modified.
+internals cannot be modified.
 Faster gcc optimization
 -----------------------
-You can enable faster gcc optimization with the cxxflags. This list of flags was suggested on the mailing list::
+You can enable faster gcc optimization with the ``cxxflags``. This list of flags was suggested on the mailing list::
    cxxflags=-march=native -O3 -ffast-math -ftree-loop-distribution -funroll-loops -ftracer
-Use it at your own risk. Some people warned that the -ftree-loop-distribution optimization caused them wrong results in the past.
+Use it at your own risk. Some people warned that the ``-ftree-loop-distribution`` optimization resulted in wrong results in the past.
-Also the -march=native must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
+Also the ``-march=native`` flag must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
 Related Projects
 ----------------
 We try to list in this `wiki page <https://github.com/Theano/Theano/wiki/Related-projects>`_ other Theano related projects.
+"What are Theano's Limitations?"
+--------------------------------
+Theano offers a good amount of flexibility, but has some limitations too.
+You must answer for yourself the following question: How can my algorithm be cleverly written 
+so as to make the most of what Theano can do?
+Here is a list of some of the known limitations:
+- *While*- or *for*-Loops within an expression graph are supported, but only via
+  the :func:`theano.scan` op (which puts restrictions on how the loop body can
+  interact with the rest of the graph).
+- Neither *goto* nor *recursion* is supported or planned within expression graphs.
--- a/doc/tutorial/gpu_data_convert.txt
+++ b/doc/tutorial/gpu_data_convert.txt
@@ -7,54 +7,130 @@ PyCUDA/CUDAMat/Gnumpy compatibility
 PyCUDA
 ======
-Currently PyCUDA and Theano have different object to store GPU
+Currently, PyCUDA and Theano have different objects to store GPU
 data. The two implementations do not support the same set of features.
-Theano's implementation is called CudaNdarray and supports
+Theano's implementation is called *CudaNdarray* and supports
-strides. It support only the float32 dtype. PyCUDA's implementation
+*strides*. It also only supports the *float32* dtype. PyCUDA's implementation
-is called GPUArray and doesn't support strides. Instead it can deal with all numpy and Cuda dtypes.
+is called *GPUArray* and doesn't support *strides*. However, it can deal with
+all NumPy and CUDA dtypes.
-We are currently working on having the same base object that will
+We are currently working on having the same base object for both that will
-mimic numpy. Until this is ready, here is some information on how to
+also mimic Numpy. Until this is ready, here is some information on how to
-use both Project in the same script.
+use both objects in the same script.
 Transfer
 --------
-You can use the `theano.misc.pycuda_utils` module to convert GPUArray to and
+You can use the ``theano.misc.pycuda_utils`` module to convert GPUArray to and
-from CudaNdarray. The function `to_cudandarray(x, copyif=False)` and
+from CudaNdarray. The functions ``to_cudandarray(x, copyif=False)`` and
-`to_gpuarray(x)` return a new object that share the same memory space
+``to_gpuarray(x)`` return a new object that occupies the same memory space
-as the original. Otherwise it raise an ValueError. Because GPUArray don't
+as the original. Otherwise it raises a *ValueError*. Because GPUArrays don't
 support strides, if the CudaNdarray is strided, we could copy it to
 have a non-strided copy. The resulting GPUArray won't share the same
-memory region. If you want this behavior, set `copyif=True` in
+memory region. If you want this behavior, set ``copyif=True`` in
-`to_gpuarray`.
+``to_gpuarray``.
 Compiling with PyCUDA
 ---------------------
-You can use PyCUDA to compile some CUDA function that work directly on
+You can use PyCUDA to compile CUDA functions that work directly on
-CudaNdarray. There is an example in the function `test_pycuda_theano`
+CudaNdarrays. Here is an example from the file ``theano/misc/tests/test_pycuda_theano_simple.py``:
-in the file `theano/misc/tests/test_pycuda_theano_simple.py`. Also,
-there is an example that shows how to make an op that calls a pycuda
+.. code-block:: python
-function :ref:`here <pyCUDA_theano>`.
+  import sys
-Theano op using PyCUDA function
+  import numpy
-------------------------------
+  import theano
+  import theano.sandbox.cuda as cuda_ndarray
-You can use gpu function compiled with PyCUDA in a Theano op. Look
+  import theano.misc.pycuda_init
-into the `HPCS2011 tutorial
+  import pycuda
-<http://www.iro.umontreal.ca/~lisa/pointeurs/tutorial_hpcs2011_fixed.pdf>`_ for an example.
+  import pycuda.driver as drv
+  import pycuda.gpuarray
+  def test_pycuda_theano():
+      """Simple example with pycuda function and Theano CudaNdarray object."""
+      from pycuda.compiler import SourceModule
+      mod = SourceModule("""
+  __global__ void multiply_them(float *dest, float *a, float *b)
+  {
+    const int i = threadIdx.x;
+    dest[i] = a[i] * b[i];
+  }
+  """)
+      multiply_them = mod.get_function("multiply_them")
+      a = numpy.random.randn(100).astype(numpy.float32)
+      b = numpy.random.randn(100).astype(numpy.float32)
+      # Test with Theano object
+      ga = cuda_ndarray.CudaNdarray(a)
+      gb = cuda_ndarray.CudaNdarray(b)
+      dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
+      multiply_them(dest, ga, gb,
+                    block=(400, 1, 1), grid=(1, 1))
+      assert (numpy.asarray(dest) == a * b).all()
+Theano Op using a PyCUDA function
+---------------------------------
+You can use a GPU function compiled with PyCUDA in a Theano op:
+.. code-block:: python
+    import numpy, theano
+    import theano.misc.pycuda_init
+    from pycuda.compiler import SourceModule
+    import theano.sandbox.cuda as cuda
+    class PyCUDADoubleOp(theano.Op):
+        def __eq__(self, other):
+            return type(self) == type(other)
+        def __hash__(self):
+            return hash(type(self))
+        def __str__(self):
+            return self.__class__.__name__
+        def make_node(self, inp):
+            inp = cuda.basic_ops.gpu_contiguous(
+               cuda.basic_ops.as_cuda_ndarray_variable(inp))
+            assert inp.dtype == "float32"
+            return theano.Apply(self, [inp], [inp.type()])
+        def make_thunk(self, node, storage_map, _, _2):
+            mod = SourceModule("""
+        __global__ void my_fct(float * i0, float * o0, int size) {
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if(i<size){
+            o0[i] = i0[i] * 2;
+        }
+      }""")
+            pycuda_fct = mod.get_function("my_fct")
+            inputs = [ storage_map[v] for v in node.inputs]
+            outputs = [ storage_map[v] for v in node.outputs]
+            def thunk():
+                z = outputs[0]
+                if z[0] is None or z[0].shape!=inputs[0][0].shape:
+                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
+                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
+                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
+                           block=(512, 1, 1), grid=grid)
+            return thunk
 CUDAMat
 =======
-There is conversion function between CUDAMat object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.cudamat_utils.py
+There are functions for conversion between CUDAMat objects and Theano's CudaNdArray objects. 
+They obey the same principles as Theano's PyCUDA functions and can be found in
+``theano.misc.cudamat_utils.py``.
+.. TODO: this statement is unclear:
-WARNING: there is a strange problem with stride/shape with those converter. The test to work need a transpose and reshape...
+WARNING: There is a peculiar problem associated with stride/shape with those converters. 
+In order to work, the test needs a *transpose* and *reshape*...
 Gnumpy
 ======
-There is conversion function between gnumpy garray object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.gnumpy_utils.py
+There are conversion functions between Gnumpy *garray* objects and Theano CudaNdArray objects. 
+They are also similar to Theano's PyCUDA functions and can be found in ``theano.misc.gnumpy_utils.py``.
--- a/doc/tutorial/gradients.txt
+++ b/doc/tutorial/gradients.txt
@@ -6,24 +6,26 @@
 Derivatives in Theano
 =====================
-Computing gradients
+Computing Gradients
 ===================
 Now let's use Theano for a slightly more sophisticated task: create a
-function which computes the derivative of some expression ``y`` with
+function which computes the derivative of some expression *y* with
-respect to its parameter ``x``. To do this we will use the macro ``T.grad``.
+respect to its parameter *x*. To do this we will use the macro ``T.grad``.
 For instance, we can compute the
 gradient of :math:`x^2` with respect to :math:`x`. Note that:
 :math:`d(x^2)/dx = 2 \cdot x`.
-Here is code to compute this gradient:
+.. TODO: fix the vertical positioning of the expressions in the preceding paragraph
+Here is the code to compute this gradient:
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_examples.test_examples_4
 >>> from theano import pp
 >>> x = T.dscalar('x')
->>> y = x**2
+>>> y = x ** 2
 >>> gy = T.grad(y, x)
 >>> pp(gy)  # print out the gradient prior to optimization
 '((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
@@ -33,10 +35,10 @@ array(8.0)
 >>> f(94.2)
 array(188.40000000000001)
-In the example above, we can see from ``pp(gy)`` that we are computing
+In this example, we can see from ``pp(gy)`` that we are computing
 the correct symbolic gradient.
 ``fill((x ** 2), 1.0)`` means to make a matrix of the same shape as
-``x ** 2`` and fill it with 1.0.
+*x* ** *2* and fill it with *1.0*.
 .. note::
    The optimizer simplifies the symbolic gradient expression.  You can see
@@ -56,7 +58,7 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.
 .. figure:: dlogistic.png
-    A plot of the gradient of the logistic function, with x on the x-axis
+    A plot of the gradient of the logistic function, with *x* on the x-axis
    and :math:`ds(x)/dx` on the y-axis.
@@ -71,133 +73,137 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.
 array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])
-In general, for any **scalar** expression ``s``, ``T.grad(s, w)`` provides
+In general, for any **scalar** expression *s*, ``T.grad(s, w)`` provides
-the theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
+the Theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
 this way Theano can be used for doing **efficient** symbolic differentiation
-(as
+(as the expression returned by ``T.grad`` will be optimized during compilation), even for
-the expression return by ``T.grad`` will be optimized during compilation) even for
+function with many inputs. (see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
-function with many inputs. ( see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
 of symbolic differentiation).
 .. note::
   The second argument of ``T.grad`` can be a list, in which case the
-   output is also a list. The order in both list is important, element
+   output is also a list. The order in both lists is important: element
   *i* of the output list is the gradient of the first argument of
   ``T.grad`` with respect to the *i*-th element of the list given as second argument.
   The first argument of ``T.grad`` has to be a scalar (a tensor
   of size 1). For more information on the semantics of the arguments of
-   ``T.grad`` and details about the implementation, see :ref:`this <libdoc_gradient>`.
+   ``T.grad`` and details about the implementation, see
+   :ref:`this<libdoc_gradient>` section of the library.
+   Additional information on the inner workings of differentiation may also be
+   found in the more advanced tutorial :ref:`Extending Theano<extending>`.
 Computing the Jacobian
 ======================
-Theano implements :func:`theano.gradient.jacobian` macro that does all
+In Theano's parlance, the term *Jacobian* designates the tensor comprising the
-what is needed to compute the Jacobian. The following text explains how
+first partial derivatives of the output of a function with respect to its inputs.
+(This is a generalization of to the so-called Jacobian matrix in Mathematics.) 
+Theano implements the :func:`theano.gradient.jacobian` macro that does all
+that is needed to compute the Jacobian. The following text explains how
 to do it manually.
-In order to manually compute the Jacobian of some function ``y`` with
+In order to manually compute the Jacobian of some function *y* with
-respect to some parameter ``x`` we need to use ``scan``. What we
+respect to some parameter *x* we need to use ``scan``. What we
-do is to loop over the entries in ``y`` and compute the gradient of
+do is to loop over the entries in *y* and compute the gradient of
-``y[i]`` with respect to ``x``.
+*y[i]* with respect to *x*.
 .. note::
-    ``scan`` is a generic op in Theano that allows writting in a symbolic
+    ``scan`` is a generic op in Theano that allows writing in a symbolic
-    manner all kind of recurrent equations. While in principle, creating
+    manner all kinds of recurrent equations. While creating
    symbolic loops (and optimizing them for performance) is a hard task,
-    effort is being done for improving the performance of ``scan``. More
+    effort is being done for improving the performance of ``scan``. We 
-    information about how to use this op, see :ref:`this <lib_scan>`.
+    shall return to :ref:`scan<tutloop>` later in this tutorial.
 >>> x = T.dvector('x')
->>> y = x**2
+>>> y = x ** 2
->>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences = T.arange(y.shape[0]), non_sequences = [y,x])
+>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
->>> f = function([x], J, updates = updates)
+>>> f = function([x], J, updates=updates)
->>> f([4,4])
+>>> f([4, 4])
 array([[ 8.,  0.],
       [ 0.,  8.]])
-What we did in this code, is to generate a sequence of ints from ``0`` to
+What we do in this code is to generate a sequence of *ints* from *0* to
 ``y.shape[0]`` using ``T.arange``. Then we loop through this sequence, and
-at each step, we compute the gradient of element ``y[[i]`` with respect to
+at each step, we compute the gradient of element *y[i]* with respect to
-``x``. ``scan`` automatically concatenates all these rows, generating a
+*x*. ``scan`` automatically concatenates all these rows, generating a
-matrix, which corresponds to the Jacobian.
+matrix which corresponds to the Jacobian.
 .. note::
-    There are a few gotchas regarding ``T.grad``. One of them is that you
+    There are some pitfalls to be aware of regarding ``T.grad``. One of them is that you
-    can not re-write the above expression of the jacobian as
+    cannot re-write the above expression of the Jacobian as
    ``theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
    non_sequences=x)``, even though from the documentation of scan this
-    seems possible. The reason is that ``y_i`` will not be a function of
+    seems possible. The reason is that *y_i* will not be a function of
-    ``x`` anymore, while ``y[i]`` still is. 
+    *x* anymore, while *y[i]* still is. 
 Computing the Hessian
 =====================
-Theano implements :func:`theano.gradient.hessian` macro that does all
+In Theano, the term *Hessian* has the usual mathematical acception: It is the 
+matrix comprising the second order partial derivative of a function with scalar
+output and vector input. Theano implements :func:`theano.gradient.hessian` macro that does all
 that is needed to compute the Hessian. The following text explains how
 to do it manually.
-You can compute the Hessian manually as the Jacobian. The only
+You can compute the Hessian manually similarly to the Jacobian. The only
 difference is that now, instead of computing the Jacobian of some expression
-``y``, we compute the Jacobian of ``T.grad(cost,x)``, where ``cost`` is some
+*y*, we compute the Jacobian of ``T.grad(cost,x)``, where *cost* is some
 scalar. 
 >>> x = T.dvector('x')
->>> y = x**2
+>>> y = x ** 2
 >>> cost = y.sum()
 >>> gy = T.grad(cost, x)
->>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences = T.arange(gy.shape[0]), non_sequences = [gy,x])
+>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
->>> f = function([x], H, updates = updates)
+>>> f = function([x], H, updates=updates)
->>> f([4,4])
+>>> f([4, 4])
 array([[ 2.,  0.],
       [ 0.,  2.]])
-Jacobian times a vector
+Jacobian times a Vector
 =======================
 Sometimes we can express the algorithm in terms of Jacobians times vectors,
 or vectors times Jacobians. Compared to evaluating the Jacobian and then
-doing the product, there are methods that computes the wanted result, while
+doing the product, there are methods that compute the desired results while
-avoiding actually evaluating the Jacobian. This can bring about significant 
+avoiding actual evaluation of the Jacobian. This can bring about significant 
 performance gains. A description of one such algorithm can be found here: 
 * Barak A. Pearlmutter, "Fast Exact Multiplication by the Hessian", *Neural
  Computation, 1994*
-While in principle we would want Theano to identify such patterns for us,
+While in principle we would want Theano to identify these patterns automatically for us,
-in practice, implementing such optimizations in a generic manner can be 
+in practice, implementing such optimizations in a generic manner is extremely 
-close to impossible. As such, we offer special functions that
+difficult. Therefore, we provide special functions dedicated to these tasks.
-can be used to compute such expression.
 R-operator
 ----------
-The *R operator* is suppose to evaluate the product between a Jacobian and a
+The *R operator* is built to evaluate the product between a Jacobian and a
 vector, namely :math:`\frac{\partial f(x)}{\partial x} v`. The formulation
-can be extended even for `x` being a matrix, or a tensor in general, case in
+can be extended even for *x* being a matrix, or a tensor in general, case in
 which also the Jacobian becomes a tensor and the product becomes some kind
 of tensor product. Because in practice we end up needing to compute such
-expression in terms of weight matrices, theano supports this more generic
+expressions in terms of weight matrices, Theano supports this more generic
-meaning of the operation. In order to evaluate the *R-operation* of
+form of the operation. In order to evaluate the *R-operation* of
-expression ``y``, with respect to ``x``, multiplying the Jacobian with ``v``
+expression *y*, with respect to *x*, multiplying the Jacobian with *v*
 you need to do something similar to this:
 >>> W = T.dmatrix('W')
 >>> V = T.dmatrix('V')
 >>> x = T.dvector('x')
->>> y = T.dot(x,W)
+>>> y = T.dot(x, W)
 >>> JV = T.Rop(y, W, V)
->>> f = theano.function([W,V,x], JV)
+>>> f = theano.function([W, V, x], JV)
->>> f([[1,1],[1,1]], [[2,2,],[2,2]], [0,1])
+>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
 array([ 2.,  2.])
 :ref:`List <R_op_list>` of Op that implement Rop.
@@ -205,51 +211,50 @@ array([ 2.,  2.])
 L-operator
 ----------
-Similar to *R-operator* the *L-operator* would compute a *row* vector times
+In similitude to the *R-operator*, the *L-operator* would compute a *row* vector times
-the Jacobian. The mathematical forumla would be :math:`v \frac{\partial
+the Jacobian. The mathematical formula would be :math:`v \frac{\partial
-f(x)}{\partial x}`. As for the *R-operator*, the *L-operator* is supported
+f(x)}{\partial x}`. The *L-operator* is also supported for generic tensors 
-for generic tensors (not only for vectors). Similarly, it can be used as
+(not only for vectors). Similarly, it can be implemented as follows:
-follows:
 >>> W = T.dmatrix('W')
 >>> v = T.dvector('v')
 >>> x = T.dvector('x')
->>> y = T.dot(x,W)
+>>> y = T.dot(x, W)
 >>> VJ = T.Lop(y, W, v)
 >>> f = theano.function([W,v,x], JV)
->>> f([[1,1],[1,1]], [2,2,], [0,1])
+>>> f([[1, 1], [1, 1]], [2, 2], [0, 1])
 array([[ 0.,  0.],
       [ 2.,  2.]])
 .. note::
-    `v`, the evaluation point, differs between the *L-operator* and the *R-operator*.
+    `v`, the *point of evaluation*, differs between the *L-operator* and the *R-operator*.
-    For the *L-operator*, the evaluation point needs to have the same shape
+    For the *L-operator*, the point of evaluation needs to have the same shape
-    as the output, while for the *R-operator* the evaluation point should
+    as the output, whereas for the *R-operator* this point should
-    have the same shape as the input parameter. Also the result of these two
+    have the same shape as the input parameter. Furthermore, the results of these two
-    opeartion differs. The result of the *L-operator* is of the same shape
+    operations differ. The result of the *L-operator* is of the same shape
-    as the input parameter, while the result of the *R-operator* is the same
+    as the input parameter, while the result of the *R-operator* has a shape similar
-    as the output.
+    to that of the output.
-Hessian times a vector
+Hessian times a Vector
 ======================
-If you need to compute the Hessian times a vector, you can make use of the
+If you need to compute the *Hessian times a vector*, you can make use of the
-above defined operators to do it more efficiently than actually computing
+above-defined operators to do it more efficiently than actually computing
-the exact Hessian and then doing the product. Due to the symmetry of the 
+the exact Hessian and then performing the product. Due to the symmetry of the 
 Hessian matrix, you have two options that will
-give you the same result, though these options might exhibit different performance, so we
+give you the same result, though these options might exhibit differing performances. 
-suggest to profile the methods before using either of the two:
+Hence, we suggest profiling the methods before using either one of the two:
 >>> x = T.dvector('x')
 >>> v = T.dvector('v')
->>> y = T.sum(x**2)
+>>> y = T.sum(x ** 2)
 >>> gy = T.grad(y, x)
->>> vH = T.grad( T.sum(gy*v), x)
+>>> vH = T.grad(T.sum(gy * v), x)
->>> f = theano.function([x,v], vH)
+>>> f = theano.function([x, v], vH)
->>> f([4,4],[2,2])
+>>> f([4, 4], [2, 2])
 array([ 4.,  4.])
@@ -257,10 +262,26 @@ or, making use of the *R-operator*:
 >>> x = T.dvector('x')
 >>> v = T.dvector('v')
->>> y = T.sum(x**2)
+>>> y = T.sum(x ** 2)
 >>> gy = T.grad(y, x)
->>> Hv = T.Rop(gy,x,v)
+>>> Hv = T.Rop(gy, x, v)
->>> f = theano.function([x,v], Hv)
+>>> f = theano.function([x, v], Hv)
->>> f([4,4],[2,2])
+>>> f([4, 4], [2, 2])
 array([ 4.,  4.])
+Final Pointers
+==============
+* The ``grad`` function works symbolically: it receives and returns Theano variables.
+* ``grad`` can be compared to a macro since it can be applied repeatedly.
+* Scalar costs only can be directly handled by ``grad``. Arrays are handled through repeated applications.
+* Built-in functions allow to compute efficiently *vector times Jacobian* and *vector times Hessian*.
+* Work is in progress on the optimizations required to compute efficiently the full
+  Jacobian and the Hessian matrix as well as the *Jacobian times vector*.
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -5,20 +5,21 @@
 Tutorial
 ========
-Let us start an interactive session (e.g. ``python`` or ``ipython``) and import Theano.
+Let us start an interactive session (e.g. with ``python`` or ``ipython``) and import Theano.
 >>> from theano import *
-Many of symbols you will need to use are in the ``tensor`` subpackage
+Several of the symbols you will need to use are in the ``tensor`` subpackage
-of Theano. Let's import that subpackage under a handy name like
+of Theano. Let us import that subpackage under a handy name like
-``T`` (many tutorials use this convention).
+``T`` (the tutorials will frequently use this convention).
 >>> import theano.tensor as T
-If that worked you are ready for the tutorial, otherwise check your
+If that succeeded you are ready for the tutorial, otherwise check your
 installation (see :ref:`install`).
-Throughout the tutorial, bear in mind that there is a :ref:`glossary` to help
+Throughout the tutorial, bear in mind that there is a :ref:`glossary` as well
+as *index* and *modules* links in the upper-right corner of each page to help
 you out.
 .. toctree::
@@ -27,18 +28,18 @@ you out.
    numpy
    adding
    examples
-    gradients
-    loading_and_saving
    symbolic_graphs
+    printing_drawing
+    gradients
    modes
-    aliasing
+    loading_and_saving
    conditions
    loop
    sparse
    using_gpu
    gpu_data_convert
+    aliasing
    shape_info
-    remarks
-    extending_theano
    debug_faq
+    extending_theano
    faq
--- a/doc/tutorial/loading_and_saving.txt
+++ b/doc/tutorial/loading_and_saving.txt
@@ -6,8 +6,8 @@ Loading and Saving
 ==================
 Python's standard way of saving class instances and reloading them
-is the pickle_ mechanism. Many Theano objects can be serialized (and
+is the pickle_ mechanism. Many Theano objects can be *serialized* (and
-deserialized) by ``pickle``, however, a limitation of ``pickle`` is that
+*deserialized*) by ``pickle``, however, a limitation of ``pickle`` is that
 it does not save the code or data of a class along with the instance of
 the class being serialized. As a result, reloading objects created by a
 previous version of a class can be really problematic.
@@ -24,7 +24,7 @@ as you would in the course of any other Python program.
 .. _pickle: http://docs.python.org/library/pickle.html
-The basics of pickling
+The Basics of Pickling
 ======================
 The two modules ``pickle`` and ``cPickle`` have the same functionalities, but
@@ -45,7 +45,7 @@ You can serialize (or *save*, or *pickle*) objects to a file with
 .. note::
    If you want your saved object to be stored efficiently, don't forget
-    to use ``cPickle.HIGHEST_PROTOCOL``, the resulting file can be
+    to use ``cPickle.HIGHEST_PROTOCOL``. The resulting file can be
    dozens of times smaller than with the default protocol.
 .. note::
@@ -81,7 +81,7 @@ For more details about pickle's usage, see
 `Python documentation <http://docs.python.org/library/pickle.html#usage>`_.
-Short-term serialization
+Short-Term Serialization
 ========================
 If you are confident that the class instance you are serializing will be
@@ -114,7 +114,7 @@ For instance, you can define functions along the lines of:
        self.training_set = cPickle.load(file(self.training_set_file, 'rb'))
-Long-term serialization
+Long-Term Serialization
 =======================
 If the implementation of the class you want to save is quite unstable, for
@@ -126,7 +126,7 @@ maybe defining the attributes you want to save, rather than the ones you
 don't.
 For instance, if the only parameters you want to save are a weight
-matrix ``W`` and a bias ``b``, you can define:
+matrix *W* and a bias *b*, you can define:
 .. code-block:: python
@@ -138,8 +138,8 @@ matrix ``W`` and a bias ``b``, you can define:
        self.W = W
        self.b = b
-If, at some point in time, ``W`` is renamed to ``weights`` and ``b`` to
+If at some point in time *W* is renamed to *weights* and *b* to
-``bias``, the older pickled files will still be usable, if you update these
+*bias*, the older pickled files will still be usable, if you update these
 functions to reflect the change in name:
 .. code-block:: python
@@ -152,6 +152,6 @@ functions to reflect the change in name:
        self.weights = W
        self.bias = b
-For more information on advanced use of pickle and its internals, see Python's
+For more information on advanced use of ``pickle`` and its internals, see Python's
 pickle_ documentation.
--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
@@ -4,4 +4,94 @@
 Loop
 ====
-You can use :ref:`Scan <lib_scan>` to do all type of loop in Theano. All the documentation about it is in the library for now.
+Scan
+====
+- A general form of *recurrence*, which can be used for looping.
+- *Reduction* and *map* (loop over the leading dimensions) are special cases of ``scan``.
+- You ``scan`` a function along some input sequence, producing an output at each time-step.
+- The function can see the *previous K time-steps* of your function.
+- ``sum()`` could be computed by scanning the *z + x(i)* function over a list, given an initial state of *z=0*.
+- Often a *for* loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- Advantages of using ``scan`` over *for* loops:
+  - Number of iterations to be part of the symbolic graph.
+  - Minimizes GPU transfers (if GPU is involved).
+  - Computes gradients through sequential steps.
+  - Slightly faster than using a *for* loop in Python with a compiled Theano function.
+  - Can lower the overall memory usage by detecting the actual amount of memory needed.
+The full documentation can be found in the library: :ref:`Scan <lib_scan>`.
+**Scan Example: Computing pow(A,k)**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  theano.config.warn.subtensor_merge_bug = False
+  k = T.iscalar("k")
+  A = T.vector("A")
+  def inner_fct(prior_result, A):
+      return prior_result * A
+  # Symbolic description of the result
+  result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=T.ones_like(A),
+                              non_sequences=A, n_steps=k)
+  # Scan has provided us with A ** 1 through A ** k.  Keep only the last
+  # value. Scan notices this and does not waste memory saving them.
+  final_result = result[-1]
+  power = theano.function(inputs=[A, k], outputs=final_result,
+                        updates=updates)
+  print power(range(10),2)
+  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+**Scan Example: Calculating a Polynomial**
+.. code-block:: python
+  import numpy
+  import theano
+  import theano.tensor as T
+  theano.config.warn.subtensor_merge_bug = False
+  coefficients = theano.tensor.vector("coefficients")
+  x = T.scalar("x")
+  max_coefficients_supported = 10000
+  # Generate the components of the polynomial
+  full_range=theano.tensor.arange(max_coefficients_supported)
+  components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                     coeff * (free_var ** power),
+                                  outputs_info=None,
+                                  sequences=[coefficients, full_range],
+                                  non_sequences=x)
+  polynomial = components.sum()
+  calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial)
+  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+  print calculate_polynomial(test_coeff, 3)
+  # 19.0
+-------------------------------------------
+**Exercise**
+Run both examples.
+Modify and execute the polynomial example to have the reduction done by ``scan``.
+:download:`Solution<loop_solution_1.py>`
--- a/doc/tutorial/loop_solution_1.py
+++ b/doc/tutorial/loop_solution_1.py
+#!/usr/bin/env python
+# Theano tutorial
+# Solution to Exercise in section 'Loop'
+import numpy
+import theano
+import theano.tensor as tt
+# 1. First example
+theano.config.warn.subtensor_merge_bug = False
+k = tt.iscalar("k")
+A = tt.vector("A")
+def inner_fct(prior_result, A):
+    return prior_result * A
+# Symbolic description of the result
+result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=tt.ones_like(A),
+                              non_sequences=A, n_steps=k)
+# Scan has provided us with A ** 1 through A ** k.  Keep only the last
+# value. Scan notices this and does not waste memory saving them.
+final_result = result[-1]
+power = theano.function(inputs=[A, k], outputs=final_result,
+                        updates=updates)
+print power(range(10), 2)
+# [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+# 2. Second example
+coefficients = tt.vector("coefficients")
+x = tt.scalar("x")
+max_coefficients_supported = 10000
+# Generate the components of the polynomial
+full_range = tt.arange(max_coefficients_supported)
+components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                  coeff * (free_var ** power),
+                                  sequences=[coefficients, full_range],
+                                  outputs_info=None,
+                                  non_sequences=x)
+polynomial = components.sum()
+calculate_polynomial1 = theano.function(inputs=[coefficients, x],
+                                        outputs=polynomial)
+test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+print calculate_polynomial1(test_coeff, 3)
+# 19.0
+# 3. Reduction performed inside scan
+theano.config.warn.subtensor_merge_bug = False
+coefficients = tt.vector("coefficients")
+x = tt.scalar("x")
+max_coefficients_supported = 10000
+# Generate the components of the polynomial
+full_range = tt.arange(max_coefficients_supported)
+outputs_info = tt.as_tensor_variable(numpy.asarray(0, 'float64'))
+components, updates = theano.scan(fn=lambda coeff, power, prior_value, free_var:
+                                  prior_value + (coeff * (free_var ** power)),
+                                  sequences=[coefficients, full_range],
+                                  outputs_info=outputs_info,
+                                  non_sequences=x)
+polynomial = components[-1]
+calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial, updates=updates)
+test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+print calculate_polynomial(test_coeff, 3)
+# 19.0
--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
 .. _using_modes:
-===============================
+==========================================
-Using different compiling modes
+Configuration Settings and Compiling Modes
-===============================
+==========================================
+Configuration
+=============
+The ``config`` module contains several *attributes* that modify Theano's behavior.  Many of these
+attributes are examined during the import of the ``theano`` module and several are assumed to be
+read-only.
+*As a rule, the attributes in the* ``config`` *module should not be modified inside the user code.*
+Theano's code comes with default values for these attributes, but you can
+override them from your ``.theanorc`` file, and override those values in turn by
+the :envvar:`THEANO_FLAGS` environment variable.
+The order of precedence is:
+1. an assignment to theano.config.<property>
+2. an assignment in :envvar:`THEANO_FLAGS`
+3. an assignment in the .theanorc file (or the file indicated in :envvar:`THEANORC`)
+You can display the current/effective configuration at any time by printing
+theano.config.  For example, to see a list  of all active configuration
+variables, type this from the command-line:
+.. code-block:: bash
+    python -c 'import theano; print theano.config' | less
+For more detail, see :ref:`Configuration <libdoc_config>` in the library.
+-------------------------------------------
+**Exercise**
+Consider the logistic regression:
+.. code-block:: python
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates={w:w-0.01*gw, b:b-0.01*gb},
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+            train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+              train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+    print "target values for D"
+    print D[1]
+    print "prediction on D"
+    print predict(D[0])
+Modify and execute this example to run on CPU (the default) with floatX=float32 and 
+time the execution using the command line ``time python file.py``.  Save your code 
+as it will be useful later on. 
+.. Note::
+   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
+   * Cast inputs before storing them into a shared variable.
+   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
+     * Insert manual cast in your code or use *[u]int{8,16}*.
+     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
+     * Notice that a new casting mechanism is being developed.
+:download:`Solution<modes_solution_1.py>`
+-------------------------------------------
 Mode
 ====
-Everytime :func:`theano.function <function.function>` is called
+Everytime :func:`theano.function <function.function>` is called,
 the symbolic relationships between the input and output Theano *variables*
 are optimized and compiled. The way this compilation occurs
 is controlled by the value of the ``mode`` parameter.
@@ -17,10 +134,10 @@ Theano defines the following modes by name:
 - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
 - ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python
+- ``'DebugMode``: Verify the correctness of all optimizations, and compare C and Python 
-    implementations. This mode can take much longer than the other modes,
+    implementations. This mode can take much longer than the other modes, but can identify
-    but can identify many kinds of problems.
+    several kinds of problems.
- ``'PROFILE_MODE'``: Same optimization then FAST_RUN, put print some profiling information
+- ``'ProfileMode'``: Same optimization then FAST_RUN, put print some profiling information
 The default mode is typically ``FAST_RUN``, but it can be controlled via
 the configuration variable :attr:`config.mode`,
@@ -30,18 +147,18 @@ which can be overridden by passing the keyword argument to
 ================= =============================================================== ===============================================================================
 short name        Full constructor                                                What does it do?
 ================= =============================================================== ===============================================================================
-FAST_COMPILE      ``compile.mode.Mode(linker='py', optimizer='fast_compile')``    Python implementations only, quick and cheap graph transformations
+``FAST_COMPILE``  ``compile.mode.Mode(linker='py', optimizer='fast_compile')``    Python implementations only, quick and cheap graph transformations
-FAST_RUN          ``compile.mode.Mode(linker='c|py', optimizer='fast_run')``      C implementations where available, all available graph transformations.
+``FAST_RUN``      ``compile.mode.Mode(linker='cvm', optimizer='fast_run')``       C implementations where available, all available graph transformations.
-DEBUG_MODE        ``compile.debugmode.DebugMode()``                               Both implementations where available, all available graph transformations.
+``DebugMode``    ``compile.debugmode.DebugMode()``                                Both implementations where available, all available graph transformations.
-PROFILE_MODE      ``compile.profilemode.ProfileMode()``                           C implementations where available, all available graph transformations, print profile information.
+``ProfileMode``  ``compile.profilemode.ProfileMode()``                            C implementations where available, all available graph transformations, print profile information.
 ================= =============================================================== ===============================================================================
 Linkers
 =======
 A mode is composed of 2 things: an optimizer and a linker. Some modes,
-like PROFILE_MODE and DEBUG_MODE, add logic around the optimizer and
+like ``ProfileMode`` and ``DebugMode``, add logic around the optimizer and
-linker. PROFILE_MODE and DEBUG_MODE use their own linker.
+linker. ``ProfileMode`` and ``DebugMode`` use their own linker.
 You can select witch linker to use with the Theano flag :attr:`config.linker`.
 Here is a table to compare the different linkers.
@@ -49,11 +166,13 @@ Here is a table to compare the different linkers.
 =============  =========  =================  =========  ===
 linker         gc [#gc]_  Raise error by op  Overhead   Definition
 =============  =========  =================  =========  ===
-c|py [#cpy1]_  yes        yes                "+++"      Try c code. If none exist for an op, use python
+cvm            yes        yes                "++"       As c|py, but the runtime algo to execute the code is in c
+cvm_nogc       no         yes                "+"        As cvm, but without gc
+c|py [#cpy1]_  yes        yes                "+++"      Try C code. If none exists for an op, use Python
 c|py_nogc      no         yes                "++"       As c|py, but without gc
-c              no         yes                "+"        Use only c code (if none available for an op, raise an error)
+c              no         yes                "+"        Use only C code (if none available for an op, raise an error)
-py             yes        yes                "+++"      Use only python code
+py             yes        yes                "+++"      Use only Python code
-c&py [#cpy2]_  no         yes                "+++++"    Use c and python code
+c&py [#cpy2]_  no         yes                "+++++"    Use C and Python code
 ProfileMode    no         no                 "++++"     Compute some extra profiling info
 DebugMode      no         yes                VERY HIGH  Make many checks on what Theano computes
 =============  =========  =================  =========  ===
@@ -62,11 +181,14 @@ DebugMode      no         yes                VERY HIGH  Make many checks on what
 .. [#gc] Garbage collection of intermediate results during computation.
         Otherwise, their memory space used by the ops is kept between
         Theano function calls, in order not to
-         reallocate memory, and lower the overhead (make it faster...)
+         reallocate memory, and lower the overhead (make it faster...).
-.. [#cpy1] default
+.. [#cpy1] Default
 .. [#cpy2] Deprecated
+For more detail, see :ref:`Mode<libdoc_compile_mode>` in the library.
 .. _using_debugmode:
 Using DebugMode
@@ -75,11 +197,11 @@ Using DebugMode
 While normally you should use the ``FAST_RUN`` or ``FAST_COMPILE`` mode,
 it is useful at first (especially when you are defining new kinds of
 expressions or new optimizations) to run your code using the DebugMode
-(available via ``mode='DEBUG_MODE'``). The DebugMode is designed to
+(available via ``mode='DebugMode``). The DebugMode is designed to
-do several self-checks and assertations that can help to diagnose
+run several self-checks and assertions that can help diagnose
-possible programming errors that can lead to incorect output. Note that
+possible programming errors leading to incorrect output. Note that
-``DEBUG_MODE`` is much slower then ``FAST_RUN`` or ``FAST_COMPILE`` so
+``DebugMode`` is much slower than ``FAST_RUN`` or ``FAST_COMPILE`` so
-use it only during development (not when you launch 1000 process on a
+use it only during development (not when you launch 1000 processes on a
 cluster!).
@@ -92,7 +214,7 @@ DebugMode is used as follows:
    x = T.dvector('x')
-    f = theano.function([x], 10*x, mode='DEBUG_MODE')
+    f = theano.function([x], 10 * x, mode='DebugMode')
    f([5])
    f([0])
@@ -100,46 +222,51 @@ DebugMode is used as follows:
 If any problem is detected, DebugMode will raise an exception according to
-what went wrong, either at call time (``f(5)``) or compile time (
+what went wrong, either at call time (*f(5)*) or compile time (
-``f = theano.function(x, 10*x, mode='DEBUG_MODE')``). These exceptions
+``f = theano.function(x, 10 * x, mode='DebugMode')``). These exceptions
 should *not* be ignored; talk to your local Theano guru or email the
 users list if you cannot make the exception go away.
 Some kinds of errors can only be detected for certain input value combinations.
-In the example above, there is no way to guarantee that a future call to say,
+In the example above, there is no way to guarantee that a future call to, say
-``f(-1)`` won't cause a problem.  DebugMode is not a silver bullet.
+*f(-1)*, won't cause a problem.  DebugMode is not a silver bullet.
+.. TODO: repair the following link
 If you instantiate DebugMode using the constructor (see :class:`DebugMode`)
-rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
+rather than the keyword ``DebugMode`` you can configure its behaviour via
-constructor arguments.  See :ref:`DebugMode <debugMode>` for details.
+constructor arguments. The keyword version of DebugMode (which you get by using ``mode='DebugMode'``)
-The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
 is quite strict.
+For more detail, see :ref:`DebugMode<debugmode>` in the library.
 .. _using_profilemode:
 ProfileMode
 ===========
-Beside checking for errors, another important task is to profile your
+Besides checking for errors, another important task is to profile your
 code. For this Theano uses a special mode called ProfileMode which has
 to be passed as an argument to :func:`theano.function <function.function>`. 
 Using the ProfileMode is a three-step process.
 .. note::
-    To change the default to it, put the Theano flags
+    To switch the default accordingly, set the Theano flag
-    :attr:`config.mode` to ProfileMode.  In that case, when the python
+    :attr:`config.mode` to ProfileMode.  In that case, when the Python
-    process exit, it will automatically print the profiling
+    process exits, it will automatically print the profiling
-    information on the stdout.
+    information on the standard output.
-    The memory profile of the output of each apply node can be enabled with the 
+    The memory profile of the output of each ``apply`` node can be enabled with the 
    Theano flag :attr:`config.ProfileMode.profile_memory`.
+For more detail, see :ref:`ProfileMode <profilemode>` in the library.
 Creating a ProfileMode Instance
 -------------------------------
-First create a ProfileMode instance.
+First create a ProfileMode instance:
 >>> from theano import ProfileMode
 >>> profmode = theano.ProfileMode(optimizer='fast_run', linker=theano.gof.OpWiseCLinker())
@@ -151,10 +278,10 @@ implementation only, should use the gof.PerformLinker (or "py" for
 short). On the other hand, a user wanting to profile his graph using C
 implementations wherever possible should use the ``gof.OpWiseCLinker``
 (or "c|py"). For testing the speed of your code we would recommend
-using the 'fast_run' optimizer and ``gof.OpWiseCLinker`` linker.
+using the ``fast_run`` optimizer and the ``gof.OpWiseCLinker`` linker.
 Compiling your Graph with ProfileMode
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-------------------------------------
 Once the ProfileMode instance is created, simply compile your graph as you
 would normally, by specifying the mode parameter.
@@ -166,19 +293,15 @@ would normally, by specifying the mode parameter.
 >>> minst = m.make(mode=profmode)
 Retrieving Timing Information
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------------
 Once your graph is compiled, simply run the program or operation you wish to
 profile, then call ``profmode.print_summary()``. This will provide you with
 the desired timing information, indicating where your graph is spending most
-of its time.
+of its time. This is best shown through an example. Let's use our logistic
+regression example.
-This is best shown through an example.
-Lets use the example of logistic
-regression.  (Code for this example is in the file
-``benchmark/regression/regression.py``.)
-Compiling the module with ProfileMode and calling ``profmode.print_summary()``
+Compiling the module with ``ProfileMode`` and calling ``profmode.print_summary()``
 generates the following output:
 .. code-block:: python
@@ -228,16 +351,18 @@ generates the following output:
    """
-The summary has two components to it. In the first section called the
+This output has two components. In the first section called
-Apply-wise summary, timing information is provided for the worst
+*Apply-wise summary*, timing information is provided for the worst
-offending Apply nodes. This corresponds to individual Op applications
+offending ``Apply`` nodes. This corresponds to individual op applications
-within your graph which take the longest to execute (so if you use
+within your graph which took longest to execute (so if you use
 ``dot`` twice, you will see two entries there). In the second portion,
-the Op-wise summary, the execution time of all Apply nodes executing
+the *Op-wise summary*, the execution time of all ``Apply`` nodes executing
-the same Op are grouped together and the total execution time per Op
+the same op are grouped together and the total execution time per op
 is shown (so if you use ``dot`` twice, you will see only one entry
 there corresponding to the sum of the time spent in each of them).
+Finally, notice that the ``ProfileMode`` also shows which ops were running a C
-Note that the ProfileMode also shows which Ops were running a c
 implementation.
+For more detail, see :ref:`ProfileMode<libdoc_compile_mode>` in the library.
--- a/doc/tutorial/modes_solution_1.py
+++ b/doc/tutorial/modes_solution_1.py
+#!/usr/bin/env python
+# Theano tutorial
+# Solution to Exercise in section 'Configuration Settings and Compiling Modes'
+import numpy
+import theano
+import theano.tensor as tt
+theano.config.floatX = 'float32'
+rng = numpy.random
+N = 400
+feats = 784
+D = (rng.randn(N, feats).astype(theano.config.floatX),
+rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
+training_steps = 10000
+# Declare Theano symbolic variables
+x = tt.matrix("x")
+y = tt.vector("y")
+w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+x.tag.test_value = D[0]
+y.tag.test_value = D[1]
+#print "Initial model:"
+#print w.get_value(), b.get_value()
+# Construct Theano expression graph
+p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probabily of having a one
+prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
+xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
+cost = tt.cast(xent.mean(), 'float32') + \
+       0.01 * (w ** 2).sum()  # The cost to optimize
+gw, gb = tt.grad(cost, [w, b])
+# Compile expressions to functions
+train = theano.function(
+            inputs=[x, y],
+            outputs=[prediction, xent],
+            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            name="train")
+predict = theano.function(inputs=[x], outputs=prediction,
+            name="predict")
+if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+train.maker.fgraph.toposort()]):
+    print 'Used the cpu'
+elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+train.maker.fgraph.toposort()]):
+    print 'Used the gpu'
+else:
+    print 'ERROR, not able to tell if theano used the cpu or the gpu'
+    print train.maker.fgraph.toposort()
+for i in range(training_steps):
+    pred, err = train(D[0], D[1])
+#print "Final model:"
+#print w.get_value(), b.get_value()
+print "target values for D"
+print D[1]
+print "prediction on D"
+print predict(D[0])
--- a/doc/tutorial/numpy.txt
+++ b/doc/tutorial/numpy.txt
@@ -24,7 +24,7 @@ where each example has dimension 5. If this would be the input of a
 neural network then the weights from the input to the first hidden
 layer would represent a matrix of size (5, #hid). 
-If I have an array:
+Consider this array:
 >>> numpy.asarray([[1., 2], [3, 4], [5, 6]])
 array([[ 1.,  2.],
@@ -37,7 +37,7 @@ This is a 3x2 matrix, i.e. there are 3 rows and 2 columns.
 To access the entry in the 3rd row (row #2) and the 1st column (column #0):
->>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2,0]
+>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2, 0]
 5.0
@@ -61,5 +61,5 @@ array([2., 4., 6.])
 The smaller array ``b`` (actually a scalar here, which works like a 0-d array) in this case is *broadcasted* to the same size
 as ``a`` during the multiplication. This trick is often useful in
-simplifying how expression are written. More details about *broadcasting*
+simplifying how expression are written. More detail about *broadcasting*
-can be found at `numpy user guide <http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html>`__.
+can be found in the `numpy user guide <http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html>`__.
--- a/doc/tutorial/printing_drawing.txt
+++ b/doc/tutorial/printing_drawing.txt
+.. _tutorial_printing_drawing:
+==============================
+Printing/Drawing Theano graphs
+==============================
+.. TODO: repair the defective links in the next paragraph
+Theano provides two functions (:func:`theano.pp` and
+:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
+compilation.  These two functions print expression graphs in different ways:
+:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
+Theano also provides :func:`pydotprint` that creates a *png* image of the function.
+You can read about them in :ref:`libdoc_printing`.
+Consider again the logistic regression but notice the additional printing instuctions. 
+The following output depicts the pre- and post- compilation graphs.
+.. code-block:: python
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y * T.log(p_1) - (1 - y) * T.log(1 - p_1) # Cross-entropy
+    cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w, b])
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x, y],
+                outputs=[prediction, xent],
+                updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+                name="train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name="predict")
+    if any( [x.op.__class__.__name__=='Gemv' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+    print "target values for D"
+    print D[1]
+    print "prediction on D"
+    print predict(D[0])
+    # Print the picture graphs
+    # after compilation
+    theano.printing.pydotprint(predict,
+                               outfile="pics/logreg_pydotprint_predic.png",
+                               var_with_name_simple=True)
+    # before compilation
+    theano.printing.pydotprint_variables(prediction,
+                               outfile="pics/logreg_pydotprint_prediction.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint(train,
+                               outfile="pics/logreg_pydotprint_train.png",
+                               var_with_name_simple=True)
+Pretty Printing
+===============
+``theano.printing.pprint(variable)``
+>>> theano.printing.pprint(prediction)  # (pre-compilation)
+gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))),TensorConstant{0.5})
+Debug Printing
+==============
+``theano.printing.debugprint({fct, variable, list of variables})``
+>>> theano.printing.debugprint(prediction)  # (pre-compilation)
+Elemwise{gt,no_inplace} [@181772236] ''
+ |Elemwise{true_div,no_inplace} [@181746668] ''
+ | |InplaceDimShuffle{x} [@181746412] ''
+ | | |TensorConstant{1} [@181745836]
+ | |Elemwise{add,no_inplace} [@181745644] ''
+ | | |InplaceDimShuffle{x} [@181745420] ''
+ | | | |TensorConstant{1} [@181744844]
+ | | |Elemwise{exp,no_inplace} [@181744652] ''
+ | | | |Elemwise{sub,no_inplace} [@181744012] ''
+ | | | | |Elemwise{neg,no_inplace} [@181730764] ''
+ | | | | | |dot [@181729676] ''
+ | | | | | | |x [@181563948]
+ | | | | | | |w [@181729964]
+ | | | | |InplaceDimShuffle{x} [@181743788] ''
+ | | | | | |b [@181730156]
+ |InplaceDimShuffle{x} [@181771788] ''
+ | |TensorConstant{0.5} [@181771148]
+>>> theano.printing.debugprint(predict)  # (post-compilation)
+Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ''   2
+ |dot [@183018796] ''   1
+ | |x [@183000780]
+ | |w [@183000812]
+ |InplaceDimShuffle{x} [@183133580] ''   0
+ | |b [@183000876]
+ |TensorConstant{[ 0.5]} [@183084108]
+Picture Printing
+================
+>>> theano.printing.pydotprint_variables(prediction)  # (pre-compilation)
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
+   :width: 800 px
+Notice that ``pydotprint()`` requires *Graphviz* and Python's ``pydot``.
+>>> theano.printing.pydotprint(predict)  # (post-compilation)
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
+   :width: 800 px
+>>> theano.printing.pydotprint(train) # This is a small train example!
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
+   :width: 1500 px
--- a/doc/tutorial/python.txt
+++ b/doc/tutorial/python.txt
@@ -5,7 +5,8 @@
 Python tutorial
 ***************
-In this documentation, we suppose that reader know python. Here is a small list of python tutorials/exercices if you know know it or need a refresher:
+In this documentation, we suppose that the reader knows Python. Here is a small list of Python 
+tutorials/exercises if you need to learn it or only need a refresher:
  * `Python Challenge <http://www.pythonchallenge.com/>`__
  * `Dive into Python <http://diveintopython.net/>`__

--- a/doc/tutorial/remarks.txt
+++ b/doc/tutorial/remarks.txt
-.. _tutorial_general_remarks:
-=====================
-Some general Remarks
-=====================
-Theano offers quite a bit of flexibility, but has some limitations too.
-How should you write your algorithm to make the most of what Theano can do?
-Limitations
-----------
- While- or for-Loops within an expression graph are supported, but only via
-  the :func:`theano.scan` op (which puts restrictions on how the loop body can
-  interact with the rest of the graph).
- Neither ``goto`` nor recursion is supported or planned within expression graphs.
--- a/doc/tutorial/shape_info.txt
+++ b/doc/tutorial/shape_info.txt
 .. _shape_info:
-============================================
+==========================================
-How shape informations are handled by Theano
+How Shape Information is Handled by Theano
-============================================
+==========================================
-It is not possible to enforce strict shape into a Theano variable when
+It is not possible to strictly enforce the shape of a Theano variable when
-building a graph. The given parameter of theano.function can change the
+building a graph since the particular value provided at run-time for a parameter of a
-shape any TheanoVariable in a graph.
+Theano function may condition the shape of the Theano variables in its graph.
-Currently shape informations are used for 2 things in Theano:
+Currently, information regarding shape is used in two ways in Theano:
- When the exact shape is known, we use it to generate faster c code for
+- To generate faster C code for the 2d convolution on the CPU and the GPU,
-  the 2d convolution on the cpu and gpu.
+  when the exact output shape is known in advance.
 - To remove computations in the graph when we only want to know the
  shape, but not the actual value of a variable. This is done with the
  `Op.infer_shape <http://deeplearning.net/software/theano/extending/cop.html#Op.infer_shape>`_
  method.
-  ex:
+  Example:
  .. code-block:: python
     import theano
     x = theano.tensor.matrix('x')
-     f = theano.function([x], (x**2).shape)
+     f = theano.function([x], (x ** 2).shape)
     theano.printing.debugprint(f)
     #MakeVector [@43860304] ''   2
     # |Shape_i{0} [@43424912] ''   1
@@ -32,15 +32,15 @@ Currently shape informations are used for 2 things in Theano:
     # |Shape_i{1} [@43797968] ''   0
     # | |x [@43423568]
-The output of this compiled function do not contain any multiplication
+The output of this compiled function does not contain any multiplication
 or power. Theano has removed them to compute directly the shape of the
 output.
-Shape inference problem
+Shape Inference Problem
 =======================
-Theano propagates shape information in the graph. Sometimes this
+Theano propagates information about shape in the graph. Sometimes this
-can lead to errors. For example:
+can lead to errors. Consider this example:
 .. code-block:: python
@@ -48,9 +48,9 @@ can lead to errors. For example:
   import theano
   x = theano.tensor.matrix('x')
   y = theano.tensor.matrix('y')
-   z = theano.tensor.join(0,x,y)
+   z = theano.tensor.join(0, x, y)
-   xv = numpy.random.rand(5,4)
+   xv = numpy.random.rand(5, 4)
-   yv = numpy.random.rand(3,3)
+   yv = numpy.random.rand(3, 3)
   f = theano.function([x,y], z.shape)
   theano.printing.debugprint(f)
@@ -83,61 +83,61 @@ can lead to errors. For example:
   # |y [@44540304]
   f(xv,yv)
-   # Raise a dimensions mismatch error.
+   # Raises a dimensions mismatch error.
-As you see, when you ask for the shape of some computation (join in the
+As you can see, when asking only for the shape of some computation (``join`` in the
-example), we sometimes compute an inferred shape directly, without executing
+example), an inferred shape is computed directly, without executing
-the computation itself (there is no join in the first output or debugprint).
+the computation itself (there is no ``join`` in the first output or debugprint).
-This makes the computation of the shape faster, but it can hide errors. In
+This makes the computation of the shape faster, but it can also hide errors. In
-the example, the computation of the shape of join is done on the first
+this example, the computation of the shape of the output of ``join`` is done only
-theano variable in the join, not on the other.
+based on the first input Theano variable, which leads to an error.
-This can probably happen with many other op as elemwise, dot, ...
+This might happen with other ops such as ``elemwise`` and ``dot``, for example.
-Indeed, to make some optimizations (for speed or stability, for instance),
+Indeed, to perform some optimizations (for speed or stability, for instance),
-Theano can assume that the computation is correct and consistent
+Theano assumes that the computation is correct and consistent
-in the first place, this is the case here.
+in the first place, as it does here.
-You can detect those problem by running the code without this
+You can detect those problems by running the code without this
-optimization, with the Theano flag
+optimization, using the Theano flag
-`optimizer_excluding=local_shape_to_shape_i`. You can also have the
+``optimizer_excluding=local_shape_to_shape_i``. You can also obtain the
-same effect by running in the mode FAST_COMPILE (it will not apply this
+same effect by running in the modes ``FAST_COMPILE`` (it will not apply this
-optimization, nor most other optimizations) or DEBUG_MODE (it will test
+optimization, nor most other optimizations) or ``DebugMode`` (it will test
 before and after all optimizations (much slower)).
-Specifing exact shape
+Specifing Exact Shape
 =====================
-Currently, specifying a shape is not as easy as we want. We plan some
+Currently, specifying a shape is not as easy and flexible as we wish and we plan some
-upgrade, but this is the current state of what can be done.
+upgrade.  Here is the current state of what can be done:
- You can pass the shape info directly to the `ConvOp` created
+- You can pass the shape info directly to the ``ConvOp`` created
-  when calling conv2d. You must add the parameter image_shape
+  when calling ``conv2d``. You simply set the parameters ``image_shape``
-  and filter_shape to that call. They but most be tuple of 4
+  and ``filter_shape`` inside the call. They must be tuples of 4
-  elements. Ex:
+  elements. For example:
 .. code-block:: python
-    theano.tensor.nnet.conv2d(..., image_shape=(7,3,5,5), filter_shape=(2,3,4,4))
+    theano.tensor.nnet.conv2d(..., image_shape=(7, 3, 5, 5), filter_shape=(2, 3, 4, 4))
- You can use the SpecifyShape op to add shape anywhere in the
+- You can use the ``SpecifyShape`` op to add shape information anywhere in the
-  graph. This allows to do some optimizations. In the following example,
+  graph. This allows to perform some optimizations. In the following example,
-  this allows to precompute the Theano function to a constant.
+  this makes it possible to precompute the Theano function to a constant.
 .. code-block:: python
   import theano
   x = theano.tensor.matrix()
-   x_specify_shape = theano.tensor.specify_shape(x, (2,2))
+   x_specify_shape = theano.tensor.specify_shape(x, (2, 2))
-   f = theano.function([x], (x_specify_shape**2).shape)
+   f = theano.function([x], (x_specify_shape ** 2).shape)
   theano.printing.debugprint(f)
   # [2 2] [@72791376]
-Future plans
+Future Plans
 ============
- Add the parameter "constant shape" to theano.shared(). This is probably
+  The parameter "constant shape" will be added to ``theano.shared()``. This is probably
-  the most frequent use case when we will use it. This will make the code
+  the most frequent occurrence with ``shared`` variables. It will make the code
-  simpler and we will be able to check that the shape does not change when
+  simpler and will make it possible to check that the shape does not change when
-  we update the shared variable.
+  updating the ``shared`` variable.
--- a/doc/tutorial/sparse.txt
+++ b/doc/tutorial/sparse.txt
@@ -4,9 +4,6 @@
 Sparse
 ======
-Sparse Matrices
-===============
 In general, *sparse* matrices provide the same functionality as regular
 matrices. The difference lies in the way the elements of *sparse* matrices are 
 represented and stored in memory. Only the non-zero elements of the latter are stored.

--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -5,27 +5,31 @@
 Graph Structures
 ================
+Theano Graphs
+=============
 Debugging or profiling code written in Theano is not that simple if you
 do not know what goes on under the hood. This chapter is meant to
-introduce you to a required minimum of the inner workings of Theano, 
+introduce you to a required minimum of the inner workings of Theano.  
-for more details see :ref:`extending`.
+For more detail see :ref:`extending`.
 The first step in writing Theano code is to write down all mathematical 
 relations using symbolic placeholders (**variables**). When writing down 
 these expressions you use operations like ``+``, ``-``, ``**``,
 ``sum()``, ``tanh()``. All these are represented internally as **ops**. 
-An **op** represents a certain computation on some type of inputs
+An *op* represents a certain computation on some type of inputs
-producing some type of output. You can see it as a function definition
+producing some type of output. You can see it as a *function definition*
 in most programming languages. 
 Theano builds internally a graph structure composed of interconnected 
 **variable** nodes, **op** nodes and **apply** nodes. An 
-**apply** node represents the application of an **op** to some 
+*apply* node represents the application of an *op* to some 
-**variables**. It is important to make the difference between the
+*variables*. It is important to draw the difference between the
-definition of a computation represented by an **op** and its application
+definition of a computation represented by an *op* and its application
-to some actual data which is represented by the **apply** node. For more
+to some actual data which is represented by the *apply* node. For more
-details about these building blocks see :ref:`variable`, :ref:`op`, 
+detail about these building blocks refer to :ref:`variable`, :ref:`op`, 
-:ref:`apply`. A graph example is the following:
+:ref:`apply`. Here is an example of a graph:
 **Code**
@@ -50,9 +54,9 @@ details about these building blocks see :ref:`variable`, :ref:`op`,
    WARNING: hyper-links and ref's seem to break the PDF build when placed
    into this figure caption.
-Arrows in this :ref:`figure <tutorial-graphfigure>` represent references to the 
+Arrows in this figure represent references to the 
 Python objects pointed at. The blue
-box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
+box is an :ref:`Apply` node. Red boxes are :ref:`Variable` nodes. Green
 circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.
@@ -63,17 +67,17 @@ Take for example the following code:
 .. code-block:: python
    x = T.dmatrix('x')
-    y = x*2.
+    y = x * 2.
-If you print `type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``, 
+If you enter ``type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``, 
 which is the apply node that connects the op and the inputs to get this
 output. You can now print the name of the op that is applied to get 
-``y``:
+*y*:
 >>> y.owner.op.name
 'Elemwise{mul,no_inplace}'
-So a elementwise multiplication is used to compute ``y``. This
+Hence, an elementwise multiplication is used to compute *y*. This
 multiplication is done between the inputs:
 >>> len(y.owner.inputs)
@@ -85,7 +89,7 @@ InplaceDimShuffle{x,x}.0
 Note that the second input is not 2 as we would have expected. This is 
 because 2 was first :term:`broadcasted <broadcasting>` to a matrix of 
-same shape as x. This is done by using the op ``DimShuffle`` :
+same shape as *x*. This is done by using the op ``DimShuffle`` :
 >>> type(y.owner.inputs[1])
 <class 'theano.tensor.basic.TensorVariable'>
@@ -97,9 +101,9 @@ same shape as x. This is done by using the op ``DimShuffle`` :
 [2.0]
-Starting from this graph structure it is easy to understand how 
+Starting from this graph structure it is easier to understand how 
-*automatic differentiation* is done, or how the symbolic relations
+*automatic differentiation* proceeds and how the symbolic relations
-can be optimized for performance or stability.
+can be *optimized* for performance or stability.  
 Automatic Differentiation
@@ -107,16 +111,19 @@ Automatic Differentiation
 Having the graph structure, computing automatic differentiation is
 simple. The only thing :func:`tensor.grad` has to do is to traverse the
-graph from the outputs back towards the inputs through all :ref:`apply`
+graph from the outputs back towards the inputs through all *apply*
-nodes (:ref:`apply` nodes are those that define which computations the
+nodes (*apply* nodes are those that define which computations the
-graph does). For each such :ref:`apply` node, its  :ref:`op` defines 
+graph does). For each such *apply* node, its *op* defines 
-how to compute the gradient of the node's outputs with respect to its
+how to compute the *gradient* of the node's outputs with respect to its
-inputs. Note that if an :ref:`op` does not provide this information, 
+inputs. Note that if an *op* does not provide this information, 
-it is assumed that the gradient is not defined.
+it is assumed that the *gradient* is not defined.
 Using the 
 `chain rule <http://en.wikipedia.org/wiki/Chain_rule>`_ 
 these gradients can be composed in order to obtain the expression of the 
-gradient of the graph's output with respect to the graph's inputs .
+*gradient* of the graph's output with respect to the graph's inputs .
+A following section of this tutorial will examine the topic of :ref:`differentiation<tutcomputinggrads>`
+in greater detail.
 Optimizations
@@ -124,7 +131,7 @@ Optimizations
 When compiling a Theano function, what you give to the
 :func:`theano.function <function.function>` is actually a graph
-(starting from the outputs variables you can traverse the graph up to
+(starting from the output variables you can traverse the graph up to
 the input variables). While this graph structure shows how to compute
 the output from the input, it also offers the possibility to improve the  
 way this computation is carried out. The way optimizations work in 
@@ -135,4 +142,27 @@ identical subgraphs and ensure that the same values are not computed
 twice or reformulate parts of the graph to a GPU specific version.
 For example, one (simple) optimization that Theano uses is to replace 
-the pattern :math:`\frac{xy}{y}` by :math:`x`.
+the pattern :math:`\frac{xy}{y}` by *x.*
+Further information regarding the optimization
+:ref:`process<optimization>` and the specific :ref:`optimizations<optimizations>` that are applicable
+is respectively available in the library and on the entrance page of the documentation.  
+**Example**
+Symbolic programming involves a change of paradigm: it will become clearer
+as we apply it. Consider the following example of optimization:
+>>> import theano
+>>> a = theano.tensor.vector("a")      # declare symbolic variable
+>>> b = a + a ** 10                    # build symbolic expression
+>>> f = theano.function([a], b)        # compile function
+>>> print f([0, 1, 2])                 # prints `array([0,2,1026])`
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -5,13 +5,16 @@
 Using the GPU
 =============
-One of the Theano's design goals is to specify computations at an
+For an introductory discussion of *Graphical Processing Units* (GPU) and their use for
+intensive parallel computation purposes, see `GPGPU <http://en.wikipedia.org/wiki/GPGPU>`_.
+One of Theano's design goals is to specify computations at an
 abstract level, so that the internal function compiler has a lot of flexibility
 about how to carry out those computations.  One of the ways we take advantage of
 this flexibility is in carrying out calculations on an Nvidia graphics card when
-there is a CUDA-enabled device in your computer.
+the device present in the computer is CUDA-enabled.
-Setting up CUDA
+Setting Up CUDA
 ----------------
 If you have not done so already, you will need to install Nvidia's
@@ -41,6 +44,7 @@ file and run it.
    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], T.exp(x))
+    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
@@ -52,38 +56,46 @@ file and run it.
    else:
        print 'Used the gpu'
-The program just computes the exp() of a bunch of random numbers.
+The program just computes the ``exp()`` of a bunch of random numbers.
-Note that we use the `shared` function to
+Note that we use the ``shared`` function to
-make sure that the input `x` are stored on the graphics device.
+make sure that the input *x* is stored on the graphics device.
+.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously 
-If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
+If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
-whereas on the GPU it takes just over 0.4 seconds.  Note that the results are close but not
+whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact 
-identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
-As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.
 .. code-block:: text
-    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py 
+    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
-    Looping 1000 times took 7.17374897003 seconds
+    [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
-    Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]
+    Looping 1000 times took 3.06635117531 seconds
+    Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
+      1.62323284]
+    Used the cpu
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python thing.py 
+    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
-    Using gpu device 0: GeForce GTX 285
+    Using gpu device 0: GeForce GTX 580
-    Looping 1000 times took 0.418929815292 seconds
+    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+    Looping 1000 times took 0.638810873032 seconds
+    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+      1.62323296]
+    Used the gpu
-Note that for now GPU operations in Theano require floatX to be float32 (see below also).
+Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
-Returning a handle to device-allocated data
+Returning a Handle to Device-Allocated Data
 -------------------------------------------
-The speedup is not greater in the example above because the function is
+The speedup is not greater in the preceding example because the function is
-returning its result as a numpy ndarray which has already been copied from the
+returning its result as a NumPy ndarray which has already been copied from the
-device to the host for your convenience.  This is what makes it so easy to swap in device=gpu, but
+device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
-if you don't mind being less portable, you might prefer to see a bigger speedup by changing
+if you don't mind less portability, you might gain a bigger speedup by changing
-the graph to express a computation with a GPU-stored result.  The gpu_from_host
+the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
-Op means "copy the input from the host to the gpu" and it is optimized away
+op means "copy the input from the host to the GPU" and it is optimized away
-after the T.exp(x) is replaced by a GPU version of exp().
+after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_2
@@ -101,6 +113,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
+    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
@@ -117,32 +130,42 @@ The output from this program is
 .. code-block:: text
-    Using gpu device 0: GeForce GTX 285
+    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
-    Looping 1000 times took 0.185714006424 seconds
+    Using gpu device 0: GeForce GTX 580
-    Result is <CudaNdarray object at 0x3e9e970>
+    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+    Looping 1000 times took 0.34898686409 seconds
+    Result is <CudaNdarray object at 0x6a7a5f0>
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+      1.62323296]
+    Used the gpu
 Here we've shaved off about 50% of the run-time by simply not copying the
 resulting array back to the host.
-The object returned by each function call is now not a numpy array but a
+The object returned by each function call is now not a NumPy array but a
-"CudaNdarray" which can be converted to a numpy ndarray by the normal
+"CudaNdarray" which can be converted to a NumPy ndarray by the normal
-numpy casting mechanism.
+NumPy casting mechanism.
 Running the GPU at Full Speed
 ------------------------------
-To really get maximum performance in this simple example, we need to use an :class:`Out`
+To really get maximum performance in this simple example, we need to use an
-instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
+:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
-internal use like a working buffer, but by default it will never return a result that is
+the output it returns to us. This is because Theano pre-allocates memory for internal use 
-allocated in the working buffer.  This is normally what you want, but our example is so simple
+(like working buffers), and by default will never return a result that is aliased to one of
-that it has the un-wanted side-effect of really slowing things down.
+its internal buffers: instead, it will copy the buffers associated to outputs into newly 
+allocated memory at each function call. This is to ensure that subsequent function calls will
+not overwrite previously computed outputs. Although this is normally what you want, our last
+example was so simple that it had the unwanted side-effect of really slowing things down.
 .. 
    TODO:
    The story here about copying and working buffers is misleading and potentially not correct
    ... why exactly does borrow=True cut 75% of the runtime ???
+..  TODO: Answer by Olivier D: it sounds correct to me -- memory allocations must be slow.
 .. If you modify this code, also change :
 .. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_3
 .. code-block:: python
@@ -152,7 +175,7 @@ that it has the un-wanted side-effect of really slowing things down.
    import numpy
    import time
-    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+    vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
    iters = 1000
    rng = numpy.random.RandomState(22)
@@ -160,6 +183,7 @@ that it has the un-wanted side-effect of really slowing things down.
    f = function([], 
            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
                borrow=True))
+    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
@@ -172,34 +196,51 @@ that it has the un-wanted side-effect of really slowing things down.
    else:
        print 'Used the gpu'
-Running this version of the code takes just under 0.05 seconds, over 140x faster than
+Running this version of the code takes just over 0.05 seconds, that is 60x faster than
 the CPU implementation!
 .. code-block:: text
-    Using gpu device 0: GeForce GTX 285
+    With *flag* ``borrow=False``:
-    Looping 1000 times took 0.0497219562531 seconds
-    Result is <CudaNdarray object at 0x31eeaf0>
+    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+    Using gpu device 0: GeForce GTX 580
+    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
+    Looping 1000 times took 0.31614613533 seconds
+    Result is <CudaNdarray object at 0x77e9270>
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+      1.62323296]
+    Used the gpu
+    With *flag* ``borrow=True``:
-This version of the code ``using borrow=True`` is slightly less safe because if we had saved
+    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
-the `r` returned from one function call, we would have to take care and remember that its value might
+    Using gpu device 0: GeForce GTX 580
-be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
+    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
-be careful!  The advantage of
+    Looping 1000 times took 0.0502779483795 seconds
-borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
+    Result is <CudaNdarray object at 0x83e5cb0>
-mistake by failing to account for the resulting memory aliasing.
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+      1.62323296]
+    Used the gpu
-What can be accelerated on the GPU?
+This version of the code including the flag ``borrow=True`` is slightly less safe because if we had saved
------------------------------------
+the *r* returned from one function call, we would have to take care and remember that its value might
+be over-written by a subsequent function call.  Although ``borrow=True`` makes a dramatic difference
+in this example, be careful!  The advantage of ``borrow=True`` is much weaker in larger graphs, and 
+there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.
+What Can Be Accelerated on the GPU
+----------------------------------
 The performance characteristics will change as we continue to optimize our
 implementations, and vary from device to device, but to give a rough idea of
 what to expect right now:
 * Only computations 
-  with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but
+  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
-  float64 computations are still relatively slow (Jan 2010).  
+  *float64* computations are still relatively slow (Jan 2010).  
 * Matrix
  multiplication, convolution, and large element-wise operations can be
  accelerated a lot (5-50x) when arguments are large enough to keep 30
@@ -208,7 +249,7 @@ what to expect right now:
  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
  as on CPU.
 * Summation 
-  over rows/columns of tensors can be a little slower on the GPU than on the CPU
+  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
 * Copying 
  of large quantities of data to and from a device is relatively slow, and
  often cancels most of the advantage of one or two accelerated functions on
@@ -216,38 +257,358 @@ what to expect right now:
  the device pay off.
-Tips for improving performance on GPU
+Tips for Improving Performance on GPU
--------------------------------------
+-------------------------------------
 * Consider 
-  adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
+  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
  GPU work.
 * Prefer  
-  constructors like 'matrix' 'vector' and 'scalar' to 'dmatrix', 'dvector' and
+  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
-  'dscalar' because the former will give you float32 variables when
+  ``dscalar`` because the former will give you *float32* variables when
-  floatX=float32.
+  ``floatX=float32``.
 * Ensure 
-  that your output variables have a float32 dtype and not float64.  The
+  that your output variables have a *float32* dtype and not *float64*.  The
-  more float32 variables are in your graph, the more work the GPU can do for
+  more *float32* variables are in your graph, the more work the GPU can do for
  you.
 * Minimize 
-  tranfers to the GPU device by using shared 'float32' variables to store
+  tranfers to the GPU device by using ``shared`` *float32* variables to store
  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
-  the GPU, 'float32' tensor shared variables are stored on the GPU by default to
+  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
  eliminate transfer time for GPU ops using those variables.
 * If you aren't happy with the performance you see, try building your functions with 
-  mode='PROFILE_MODE'. This should print some timing information at program
+  ``mode='ProfileMode'``. This should print some timing information at program
-  termination (atexit). Is time being used sensibly?   If an Op or Apply is
+  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
-  programming have a look at how it's implemented in theano.sandbox.cuda.
+  programming, have a look at how it's implemented in theano.sandbox.cuda.
-  Check the line like 'Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op'
+  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
-  that can tell you if not enough of your graph is on the gpu or if their
+  This can tell you if not enough of your graph is on the GPU or if there
-  is too much memory transfert.
+  is too much memory transfer.
-Changing the value of shared variables
+Changing the Value of Shared Variables
 --------------------------------------
-To change the value of a shared variable, e.g. to provide new data to process,
+To change the value of a ``shared`` variable, e.g. to provide new data to processes,
 use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
 see :ref:`aliasing`.
+-------------------------------------------
+**Exercise**
+Consider again the logistic regression:
+.. code-block:: python
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates={w:w-0.01*gw, b:b-0.01*gb},
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+            train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+              train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+    print "target values for D"
+    print D[1]
+    print "prediction on D"
+    print predict(D[0])
+Modify and execute this example to run on GPU with ``floatX=float32`` and 
+time it using the command line ``time python file.py``. (Of course, you may use some of your answer
+to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
+Is there an increase in speed from CPU to GPU?
+Where does it come from? (Use ``ProfileMode``)
+What can be done to further increase the speed of the GPU version? Put your ideas to test.
+.. Note::
+   * Only 32 bit floats are currently supported (development is in progress).
+   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
+   * There is a limit of one GPU per process.
+   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
+   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
+   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
+   * ``Cast`` inputs before storing them into a ``shared`` variable.
+   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
+     * Insert manual cast in your code or use *[u]int{8,16}*.
+     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
+     * Notice that a new casting mechanism is being developed.
+:download:`Solution<using_gpu_solution_1.py>`
+-------------------------------------------
+Software for Directly Programming a GPU
+---------------------------------------
+Leaving aside Theano which is a meta-programmer, there are:
+* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C) 
+  * Vendor-specific
+  * Numeric libraries (BLAS, RNG, FFT) are maturing.
+* **OpenCL**: multi-vendor version of CUDA
+  * More general, standardized.
+  * Fewer libraries, lesser spread.
+* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel 
+  computation API from Python
+  * Convenience:
+    Makes it easy to do GPU meta-programming from within Python.
+    Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).
+    GPU memory buffer (``pycuda.gpuarray.GPUArray``).
+    Helpful documentation.
+  * Completeness: Binding to all of CUDA's driver API.
+  * Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
+  * Speed: PyCUDA's base layer is written in C++.
+  * Good memory management of GPU objects:
+    Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').
+    Makes it much easier to write correct, leak- and crash-free code.
+    PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
+    allocated in it is also freed).
+  (This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_ 
+  and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)
+* **PyOpenCL**: PyCUDA for OpenCL
+Learning to Program with PyCUDA
+-------------------------------
+If you already enjoy a good proficiency with the C programming language, you
+may easily leverage your knowledge by learning, first, to program a GPU with the
+CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
+API with a Python wrapper.
+The following resources will assist you in this learning process:
+* **CUDA API and CUDA C: Introductory**
+  * `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_
+  * `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
+* **CUDA API and CUDA C: Advanced**
+  * `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
+    (full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
+  * `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
+    (full lectures, Kirk-Hwu textbook)
+  * `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
+    (extensive coverage, levels from introductory to advanced)
+  * `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
+    (on the relationship between grids, blocks and threads; see also linked and related issues on same page)
+  * `CUDA optimisation <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_
+* **PyCUDA: Introductory**
+  * `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_
+  * `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_ 
+* **PYCUDA: Advanced**
+  * `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_
+The following examples give a foretaste of programming a GPU with PyCUDA. Once
+you feel competent enough, you may try yourself on the corresponding exercises.
+**Example: PyCUDA**
+.. code-block:: python
+  # (from PyCUDA's documentation)
+  import pycuda.autoinit
+  import pycuda.driver as drv
+  import numpy
+  from pycuda.compiler import SourceModule
+  mod = SourceModule("""
+  __global__ void multiply_them(float *dest, float *a, float *b)
+  {
+    const int i = threadIdx.x;
+    dest[i] = a[i] * b[i];
+  }
+  """)
+  multiply_them = mod.get_function("multiply_them")
+  a = numpy.random.randn(400).astype(numpy.float32)
+  b = numpy.random.randn(400).astype(numpy.float32)
+  dest = numpy.zeros_like(a)
+  multiply_them(
+          drv.Out(dest), drv.In(a), drv.In(b),
+          block=(400,1,1), grid=(1,1))
+  assert numpy.allclose(dest, a*b)
+  print dest
+-------------------------------------------
+**Exercise**
+Run the preceding example.
+Modify and execute to work for a matrix of shape (20, 10).
+-------------------------------------------
+.. _pyCUDA_theano:
+**Example: Theano + PyCUDA**
+.. code-block:: python
+    import numpy, theano
+    import theano.misc.pycuda_init
+    from pycuda.compiler import SourceModule
+    import theano.sandbox.cuda as cuda
+    class PyCUDADoubleOp(theano.Op):
+        def __eq__(self, other):
+            return type(self) == type(other)
+        def __hash__(self):
+            return hash(type(self))
+        def __str__(self):
+            return self.__class__.__name__
+        def make_node(self, inp):
+            inp = cuda.basic_ops.gpu_contiguous(
+               cuda.basic_ops.as_cuda_ndarray_variable(inp))
+            assert inp.dtype == "float32"
+            return theano.Apply(self, [inp], [inp.type()])
+        def make_thunk(self, node, storage_map, _, _2):
+            mod = SourceModule("""
+        __global__ void my_fct(float * i0, float * o0, int size) {
+        int i = blockIdx.x*blockDim.x + threadIdx.x;
+        if(i<size){
+            o0[i] = i0[i]*2;
+        }
+      }""")
+            pycuda_fct = mod.get_function("my_fct")
+            inputs = [ storage_map[v] for v in node.inputs]
+            outputs = [ storage_map[v] for v in node.outputs]
+            def thunk():
+                z = outputs[0]
+                if z[0] is None or z[0].shape!=inputs[0][0].shape:
+                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
+                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
+                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
+                           block=(512,1,1), grid=grid)
+            return thunk
+Use this code to test it:
+>>> x = theano.tensor.fmatrix()
+>>> f = theano.function([x], PyCUDADoubleOp()(x))
+>>> xv=numpy.ones((4,5), dtype="float32")
+>>> assert numpy.allclose(f(xv), xv*2)
+>>> print numpy.asarray(f(xv))
+-------------------------------------------
+**Exercise**
+Run the preceding example.
+Modify and execute to multiply two matrices: *x* * *y*.
+Modify and execute to return two outputs: *x + y* and *x - y*.
+(Notice that Theano's current *elemwise fusion* optimization is
+only applicable to computations involving a single output. Hence, to gain
+efficiency over the basic solution that is asked here, the two operations would
+have to be jointly optimized explicitly in the code.)
+Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*).
--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
+#!/usr/bin/env python
+# Theano tutorial
+# Solution to Exercise in section 'Using the GPU'
+# 1. Raw results
+#
+# same code as in mode_solution_1 but run with following command lines:
+# THEANO_FLAGS=mode=FAST_RUN,device=gpu time python program_name.py
+# THEANO_FLAGS=mode=FAST_RUN,device=cpu time python program_name.py
+# for GPU and CPU respectively
+# typical time: 20 sec (CPU), 10 sec (GPU)
+import numpy
+import theano
+import theano.tensor as tt
+from theano import sandbox, Out
+theano.config.floatX = 'float32'
+rng = numpy.random
+N = 400
+feats = 784
+D = (rng.randn(N, feats).astype(theano.config.floatX),
+rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
+training_steps = 10000
+# Declare Theano symbolic variables
+x = tt.matrix("x")
+y = tt.vector("y")
+w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+x.tag.test_value = D[0]
+y.tag.test_value = D[1]
+#print "Initial model:"
+#print w.get_value(), b.get_value()
+# Construct Theano expression graph
+p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probabily of having a one
+prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
+xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
+cost = tt.cast(xent.mean(), 'float32') + \
+       0.01 * (w ** 2).sum()  # The cost to optimize
+gw, gb = tt.grad(cost, [w, b])
+"""
+# Compile expressions to functions
+train = theano.function(
+            inputs=[x, y],
+            outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
+            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            name="train")
+predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
+            name="predict")
+"""
+# Compile expressions to functions
+train = theano.function(
+            inputs=[x, y],
+            outputs=[prediction, xent],
+            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            name="train")
+predict = theano.function(inputs=[x], outputs=prediction,
+            name="predict")
+if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+train.maker.fgraph.toposort()]):
+    print 'Used the cpu'
+elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+train.maker.fgraph.toposort()]):
+    print 'Used the gpu'
+else:
+    print 'ERROR, not able to tell if theano used the cpu or the gpu'
+    print train.maker.fgraph.toposort()
+for i in range(training_steps):
+    pred, err = train(D[0], D[1])
+#print "Final model:"
+#print w.get_value(), b.get_value()
+print "target values for D"
+print D[1]
+print "prediction on D"
+print predict(D[0])
+"""
+# 2. Profiling
+#
+# same code as above but run with following command lines:
+# THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
+# THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py
+# for GPU and CPU
+# 2.1 Profiling output for CPU computations
+$ THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py
+Used the cpu
+target values for D
+prediction on D
+Used the cpu
+target values for D
+prediction on D
+ProfileMode.print_summary()
+---------------------------
+Time since import 12.586s
+Theano compile time: 0.000s (0.0% since import)
+    Optimization time: 0.000s
+    Linker time: 0.000s
+Theano fct call 5.147s (40.9% since import)
+   Theano Op time 3.595s 28.6%(since import) 69.8%(of fct call)
+   Theano function overhead in ProfileMode 1.552s 12.3%(since import) 30.2%(of fct call)
+20002 Theano fct call, 0.000s per call
+Rest of the time since import 7.440s 59.1%
+Theano fct summary:
+<% total fct time> <total time> <time per call> <nb call> <fct name>
+   49.9% 2.567s 2.57e-04s 10000 train
+    0.0% 0.000s 1.24e-04s 1 predict
+    0.0% 0.000s 1.26e-04s 1 predict
+   50.1% 2.579s 2.58e-04s 10000 train
+Single Op-wise summary:
+<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name>
+   59.3%   59.3%  2.133s  2.133s  5.33e-05s * 40002  1  6 <class 'theano.tensor.blas_c.CGemv'>
+   34.4%   93.8%  1.238s  3.371s  6.19e-06s * 200002 11 22 <class 'theano.tensor.elemwise.Elemwise'>
+    2.8%   96.6%  0.100s  3.471s  2.51e-06s * 40002  1  6 <class 'theano.tensor.basic.Alloc'>
+    2.1%   98.7%  0.075s  3.546s  1.26e-06s * 60002  2  8 <class 'theano.tensor.elemwise.DimShuffle'>
+    0.7%   99.3%  0.024s  3.571s  6.11e-07s * 40002  1  6 <class 'theano.tensor.opt.Shape_i'>
+    0.7%  100.0%  0.024s  3.595s  1.18e-06s * 20000  1  2 <class 'theano.tensor.elemwise.Sum'>
+   ... (remaining 0 single Op account for 0.00%(0.00s) of the runtime)
+(*) Op is running a c implementation
+Op-wise summary:
+<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*]  <nb_call> <nb apply> <Op name>
+   59.3%   59.3%  2.133s  2.133s  5.33e-05s * 40002  6 CGemv{inplace}
+   18.1%   77.4%  0.650s  2.783s  3.25e-05s * 20000  2 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}
+    6.4%   83.9%  0.231s  3.014s  1.16e-05s * 20000  2 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)]
+    4.0%   87.8%  0.142s  3.157s  7.11e-06s * 20000  2 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)]
+    2.8%   90.6%  0.100s  3.257s  2.51e-06s * 40002  6 Alloc
+    1.4%   92.1%  0.052s  3.309s  1.30e-06s * 40002  6 InplaceDimShuffle{x}
+    1.1%   93.1%  0.038s  3.347s  1.92e-06s * 20000  2 Elemwise{Cast{float32}}
+    1.1%   94.2%  0.038s  3.386s  1.91e-06s * 20000  2 Elemwise{sub,no_inplace}
+    1.0%   95.2%  0.036s  3.421s  1.79e-06s * 20000  2 Elemwise{gt,no_inplace}
+    0.8%   96.0%  0.029s  3.450s  1.44e-06s * 20000  2 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+    0.8%   96.8%  0.028s  3.479s  1.42e-06s * 20000  2 Elemwise{neg,no_inplace}
+    0.7%   97.5%  0.024s  3.503s  6.11e-07s * 40002  6 Shape_i{0}
+    0.7%   98.1%  0.024s  3.527s  1.18e-06s * 20000  2 Sum
+    0.6%   98.8%  0.023s  3.550s  1.16e-06s * 20000  2 InplaceDimShuffle{1,0}
+    0.6%   99.4%  0.023s  3.573s  1.15e-06s * 20000  2 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
+    0.6%  100.0%  0.022s  3.595s  1.08e-06s * 20000  2 Elemwise{inv,no_inplace}
+    0.0%  100.0%  0.000s  3.595s  1.19e-05s *     2  2 Elemwise{Composite{[Composite{[Composite{[Composite{[GT(scalar_sigmoid(i0), i1)]}(neg(i0), i1)]}(sub(i0, i1), i2)]}(neg(i0), i1, i2)]}}
+   ... (remaining 0 Op account for   0.00%(0.00s) of the runtime)
+(*) Op is running a c implementation
+Apply-wise summary:
+<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name>
+   14.9%   14.9%  0.536s  0.536s 5.36e-05s  * 10000   7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0})
+   14.9%   29.8%  0.534s  1.070s 5.34e-05s  * 10000  18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
+   14.8%   44.6%  0.532s  1.602s 5.32e-05s  * 10000   7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0})
+   14.7%   59.3%  0.530s  2.132s 5.30e-05s  * 10000  18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
+    9.1%   68.4%  0.327s  2.460s 3.27e-05s  * 10000  13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+    9.0%   77.4%  0.323s  2.783s 3.23e-05s  * 10000  13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+    3.2%   80.6%  0.116s  2.899s 1.16e-05s  * 10000  16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
+    3.2%   83.9%  0.116s  3.014s 1.16e-05s  * 10000  16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
+    2.0%   85.8%  0.071s  3.086s 7.12e-06s  * 10000  14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0)
+    2.0%   87.8%  0.071s  3.156s 7.09e-06s  * 10000  14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0)
+    0.9%   88.8%  0.034s  3.190s 3.38e-06s  * 10000  12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+    0.9%   89.7%  0.034s  3.224s 3.37e-06s  * 10000  12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+    0.5%   90.2%  0.019s  3.243s 1.93e-06s  * 10000   8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
+    0.5%   90.8%  0.019s  3.262s 1.92e-06s  * 10000   4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
+    0.5%   91.3%  0.019s  3.282s 1.90e-06s  * 10000   4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
+   ... (remaining 35 Apply instances account for 8.71%(0.31s) of the runtime)
+(*) Op is running a c implementation
+Profile of Theano functions memory:
+(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
+   We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
+Here are tips to potentially make your code run faster
+(if you think of new ones, suggest them on the mailing list).
+Test them first, as they are not guaranteed to always provide a speedup.
+  Sorry, no tip for today.
+# 2.2 Profiling output for GPU computations
+$ THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
+Using gpu device 0: GeForce GTX 580
+Used the gpu
+target values for D
+prediction on D
+Used the gpu
+target values for D
+prediction on D
+ProfileMode.print_summary()
+---------------------------
+Time since import 25.682s
+Theano compile time: 0.000s (0.0% since import)
+    Optimization time: 0.000s
+    Linker time: 0.000s
+Theano fct call 17.052s (66.4% since import)
+   Theano Op time 14.548s 56.6%(since import) 85.3%(of fct call)
+   Theano function overhead in ProfileMode 2.505s 9.8%(since import) 14.7%(of fct call)
+20002 Theano fct call, 0.001s per call
+Rest of the time since import 8.630s 33.6%
+Theano fct summary:
+<% total fct time> <total time> <time per call> <nb call> <fct name>
+   50.0% 8.526s 8.53e-04s 10000 train
+    0.0% 0.001s 1.09e-03s 1 predict
+   50.0% 8.524s 8.52e-04s 10000 train
+    0.0% 0.001s 1.10e-03s 1 predict
+Single Op-wise summary:
+<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name>
+   54.8%   54.8%  7.968s  7.968s  1.33e-04s   60002  1  8 <class 'theano.sandbox.cuda.basic_ops.GpuFromHost'>
+   16.2%   71.0%  2.358s  10.325s  1.47e-05s * 160002  9 18 <class 'theano.sandbox.cuda.basic_ops.GpuElemwise'>
+   12.3%   83.3%  1.795s  12.120s  4.49e-05s * 40002  1  6 <class 'theano.sandbox.cuda.blas.GpuGemv'>
+    7.0%   90.4%  1.024s  13.144s  2.56e-05s   40002  1  6 <class 'theano.sandbox.cuda.basic_ops.HostFromGpu'>
+    5.0%   95.4%  0.728s  13.872s  1.82e-05s * 40002  1  6 <class 'theano.sandbox.cuda.basic_ops.GpuAlloc'>
+    2.1%   97.4%  0.300s  14.171s  1.50e-05s * 20000  1  2 <class 'theano.sandbox.cuda.basic_ops.GpuSum'>
+    1.3%   98.7%  0.189s  14.360s  3.15e-06s * 60002  3  8 <class 'theano.sandbox.cuda.basic_ops.GpuDimShuffle'>
+    0.6%   99.4%  0.094s  14.454s  2.35e-06s * 40002  2  6 <class 'theano.tensor.elemwise.Elemwise'>
+    0.3%   99.7%  0.048s  14.503s  1.21e-06s * 40002  1  6 <class 'theano.tensor.opt.Shape_i'>
+    0.3%  100.0%  0.045s  14.548s  2.25e-06s * 20000  1  2 <class 'theano.tensor.elemwise.DimShuffle'>
+   ... (remaining 0 single Op account for 0.00%(0.00s) of the runtime)
+(*) Op is running a c implementation
+Op-wise summary:
+<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*]  <nb_call> <nb apply> <Op name>
+   54.8%   54.8%  7.968s  7.968s  1.33e-04s   60002  8 GpuFromHost
+   12.3%   67.1%  1.795s  9.763s  4.49e-05s * 40002  6 GpuGemv{inplace}
+    7.0%   74.1%  1.024s  10.786s  2.56e-05s   40002  6 HostFromGpu
+    5.0%   79.1%  0.728s  11.514s  1.82e-05s * 40002  6 GpuAlloc
+    2.3%   81.4%  0.334s  11.848s  1.67e-05s * 20000  2 GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)]
+    2.2%   83.6%  0.319s  12.167s  1.59e-05s * 20000  2 GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}
+    2.1%   85.7%  0.301s  12.468s  1.50e-05s * 20000  2 GpuElemwise{neg,no_inplace}
+    2.1%   87.8%  0.300s  12.768s  1.50e-05s * 20000  2 GpuSum{1}
+    2.0%   89.8%  0.292s  13.060s  1.46e-05s * 20000  2 GpuElemwise{inv,no_inplace}
+    1.9%   91.7%  0.283s  13.343s  1.42e-05s * 20000  2 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+    1.9%   93.7%  0.281s  13.625s  1.41e-05s * 20000  2 GpuElemwise{sub,no_inplace}
+    1.9%   95.5%  0.273s  13.898s  1.37e-05s * 20000  2 GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)]
+    1.9%   97.4%  0.273s  14.171s  1.37e-05s * 20000  2 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
+    1.0%   98.4%  0.141s  14.313s  7.06e-06s * 20002  4 GpuDimShuffle{x}
+    0.4%   98.8%  0.057s  14.370s  2.87e-06s * 20002  4 Elemwise{gt,no_inplace}
+    0.3%   99.1%  0.048s  14.418s  1.21e-06s * 40002  6 Shape_i{0}
+    0.3%   99.4%  0.045s  14.463s  2.25e-06s * 20000  2 InplaceDimShuffle{x}
+    0.3%   99.7%  0.037s  14.500s  1.83e-06s * 20000  2 Elemwise{Cast{float32}}
+    0.2%   99.8%  0.025s  14.525s  1.24e-06s * 20000  2 GpuDimShuffle{0}
+    0.2%  100.0%  0.023s  14.548s  1.14e-06s * 20000  2 GpuDimShuffle{1,0}
+   ... (remaining 1 Op account for   0.00%(0.00s) of the runtime)
+(*) Op is running a c implementation
+Apply-wise summary:
+<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name>
+   24.0%   24.0%  3.493s  3.493s 3.49e-04s    10000   1 GpuFromHost(x)
+   23.9%   47.9%  3.479s  6.972s 3.48e-04s    10000   1 GpuFromHost(x)
+    4.3%   52.3%  0.629s  7.602s 6.29e-05s  * 10000  24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
+    4.3%   56.6%  0.629s  8.231s 6.29e-05s  * 10000  24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
+    1.8%   58.4%  0.269s  8.499s 2.69e-05s  * 10000   9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0})
+    1.8%   60.3%  0.268s  8.767s 2.68e-05s  * 10000   9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0})
+    1.8%   62.1%  0.266s  9.033s 2.66e-05s    10000  18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0)
+    1.8%   63.9%  0.262s  9.296s 2.62e-05s    10000  18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0)
+    1.8%   65.7%  0.260s  9.555s 2.60e-05s    10000   3 GpuFromHost(y)
+    1.8%   67.5%  0.258s  9.813s 2.58e-05s    10000   3 GpuFromHost(y)
+    1.7%   69.2%  0.248s  10.061s 2.48e-05s    10000  20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0)
+    1.7%   70.9%  0.247s  10.309s 2.47e-05s    10000  20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0)
+    1.6%   72.5%  0.238s  10.547s 2.38e-05s    10000  12 GpuFromHost(Elemwise{Cast{float32}}.0)
+    1.6%   74.1%  0.237s  10.785s 2.37e-05s    10000  12 GpuFromHost(Elemwise{Cast{float32}}.0)
+    1.3%   75.4%  0.185s  10.969s 1.85e-05s  * 10000   6 GpuAlloc(CudaNdarrayConstant{[  1.58212732e-09]}, Shape_i{0}.0)
+   ... (remaining 53 Apply instances account for 24.60%(3.58s) of the runtime)
+(*) Op is running a c implementation
+Some info useful for gpu:
+    Spent 1.211s(8.324%) in cpu Op, 13.337s(91.676%) in gpu Op and 0.000s(0.000%) transfert Op
+    Theano function input that are float64
+    <fct name> <input name> <input type> <str input>
+    List of apply that don't have float64 as input but have float64 in outputs
+    (Useful to know if we forgot some cast when using floatX=float32 or gpu code)
+    <Apply> <Apply position> <fct name> <inputs type> <outputs type>
+Profile of Theano functions memory:
+(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
+   We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
+Here are tips to potentially make your code run faster
+(if you think of new ones, suggest them on the mailing list).
+Test them first, as they are not guaranteed to always provide a speedup.
+  Sorry, no tip for today.
+# 3. Conclusions
+Facts:
+Examine and compare 'Single Op-wise' summaries for CPU and GPU. GPU ops 'GpuFromHost' (and 'HostFromGpu') by themselves
+consume a large amount of extra time. Furthermore, notice that each of the GPU ops consumes more time than its CPU counterpart.
+An additional experiment also confirms that adding an 'out' instance in the GPU version only brings about a minor
+improvement in this situation.
+Tentative conclusion:
+The large number of external training steps (10000) generates disproportionate GPU overhead costs.
+Tentative solution:
+Include the training steps inside the definition of the Theano function.
+Implement this solution and put it to test.
+"""
--- a/theano/printing.py
+++ b/theano/printing.py
@@ -30,7 +30,7 @@ _logger = logging.getLogger("theano.printing")
 def debugprint(obj, depth=-1, print_type=False,
               file=None, ids='CHAR', stop_on_name=False):
-    """Print a computation graph to file
+    """Print a computation graph as text to stdout or a file.
    :type obj: Variable, Apply, or Function instance
    :param obj: symbolic thing to print
@@ -56,12 +56,12 @@ def debugprint(obj, depth=-1, print_type=False,
    The first part of the text identifies whether it is an input
    (if a name or type is printed) or the output of some Apply (in which case
    the Op is printed).
-    The second part of the text is the memory location of the Variable.
+    The second part of the text is an identifier of the Variable.
    If print_type is True, we add a part containing the type of the Variable
    If a Variable is encountered multiple times in the depth-first search,
    it is only printed recursively the first time. Later, just the Variable
-    and its memory location are printed.
+    identifier is printed.
    If an Apply has multiple outputs, then a '.N' suffix will be appended
    to the Apply's identifier, to indicate which output a line corresponds to.
@@ -461,7 +461,9 @@ pprint.assign(lambda pstate, r: hasattr(pstate, 'target')
              LeafPrinter())
 pp = pprint
+"""
+Print to the terminal a math-like expression.
+"""
 # colors not used: orange, amber#FFBF00, purple, pink,
 # used by default: green, blue, grey, red
@@ -530,7 +532,7 @@ def pydotprint(fct, outfile=None,
    blue boxes are outputs variables of the graph
    grey boxes are variables that are not outputs and are not used
    red ellipses are transfers from/to the gpu (ops with names GpuFromHost,
-       HostFromGpu)
+    HostFromGpu)
    """
    if colorCodes is None:

--- a/theano/scan_module/scan.py
+++ b/theano/scan_module/scan.py
@@ -197,11 +197,12 @@ def scan(fn,
        * ``initial`` -- Theano variable that represents the initial
          state of a given output. In case the output is not computed
-          recursively (think of a map) and does not require a initial
+          recursively (think of a map) and does not require an initial
-          state this field can be skiped. Given that only the previous
+          state this field can be skipped. Given that (only) the previous
-          time step of the output is used by ``fn`` the initial state
+          time step of the output is used by ``fn``, the initial state
-          should have the same shape as the output. If multiple time
+          **should have the same shape** as the output and **should not
-          taps are used, the initial state should have one extra
+          involve a downcast** of the data type of the output. If multiple
+          time taps are used, the initial state should have one extra
          dimension that should cover all the possible taps. For example
          if we use ``-5``, ``-2`` and ``-1`` as past taps, at step 0,
          ``fn`` will require (by an abuse of notation) ``output[-5]``,

--- a/theano/tests/test_tutorial.py
+++ b/theano/tests/test_tutorial.py
@@ -797,6 +797,7 @@ class T_using_gpu(unittest.TestCase):
        rng = numpy.random.RandomState(22)
        x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
        f = function([], T.exp(x))
+        # print f.maker.fgraph.toposort()
        t0 = time.time()
        for i in xrange(iters):
            r = f()
@@ -813,7 +814,6 @@ class T_using_gpu(unittest.TestCase):
            assert numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()])
    def test_using_gpu_2(self):
        if theano.config.device.find('gpu') > -1:
@@ -829,6 +829,7 @@ class T_using_gpu(unittest.TestCase):
            rng = numpy.random.RandomState(22)
            x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
            f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
+            # print f.maker.fgraph.toposort()
            t0 = time.time()
            for i in xrange(iters):
                r = f()
@@ -844,9 +845,6 @@ class T_using_gpu(unittest.TestCase):
            assert not numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()])
    def test_using_gpu_3(self):
        if theano.config.device.find('gpu') >-1:
@@ -864,6 +862,7 @@ class T_using_gpu(unittest.TestCase):
            f = function([],
                    Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
                        borrow=True))
+            # print f.maker.fgraph.toposort()
            t0 = time.time()
            for i in xrange(iters):
                r = f()