Correct Theano's tutorial: overhaul logic of exposition and integrate exercises…

Correct Theano's tutorial: overhaul logic of exposition and integrate exercises and useful info found in CIFAR SC2011 accordingly

Correct Theano's tutorial: overhaul logic of exposition and integrate exercises…
3bffa49b · Eric Larsen · Frederic · a68ec1de · 3bffa49b · 3bffa49b
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -13,7 +13,7 @@
 Guide
 =====

-The config module contains many attributes that modify Theano's behavior.  Many of these
+The config module contains many ``attributes`` that modify Theano's behavior.  Many of these
 attributes are consulted during the import of the ``theano`` module and many are assumed to be
 read-only.


--- a/doc/tutorial/adding.txt
+++ b/doc/tutorial/adding.txt
 .. _adding:

-========================================
-Baby steps - Adding two numbers together
-========================================
-
+====================
+Baby steps - Algebra
+====================

 Adding two scalars
 ==================
@@ -57,8 +56,6 @@ instruction. Behind the scenes, ``f`` was being compiled into C code.
  type of both ``x`` and ``y`` is ``theano.tensor.ivector``.


-------------------------------------------
-
 **Step 1**

 >>> x = T.dscalar('x')
@@ -91,8 +88,6 @@ given name. If you provide no argument, the symbol will be unnamed. Names
 are not required, but they can help debugging.


-------------------------------------------
-
 **Step 2**

 The second step is to combine ``x`` and ``y`` into their sum ``z``:
@@ -106,7 +101,6 @@ function to pretty-print out the computation associated to ``z``.
 >>> print pp(z)
 (x + y)

-------------------------------------------

 **Step 3**

@@ -174,3 +168,19 @@ with numpy arrays may be found :ref:`here <libdoc_tensor_creation>`.
   You, the user---not the system architecture---have to choose whether your
   program will use 32- or 64-bit integers (``i`` prefix vs. the ``l`` prefix)
   and floats (``f`` prefix vs. the ``d`` prefix).
+
+-------------------------------------------
+
+**Exercise**
+
+.. code-block:: python
+
+  import theano
+  a = theano.tensor.vector() # declare variable
+  out = a + a**10               # build symbolic expression
+  f = theano.function([a], out)   # compile function
+  print f([0,1,2])  # prints `array([0,2,1026])`
+  
+Modify and execute this code to compute this expression: a**2 + b**2 + 2*a*b.
+
+-------------------------------------------
--- a/doc/tutorial/conditions.txt
+++ b/doc/tutorial/conditions.txt
@@ -4,7 +4,9 @@
 Conditions
 ==========

-**IfElse vs switch**
+IfElse vs switch
+================
+

 - Build condition over symbolic variables.
 - IfElse Op takes a `boolean` condition and two variables to compute as input.
@@ -15,6 +17,7 @@ Conditions

 **Example**

+
 .. code-block:: python

  from theano import tensor as T
@@ -49,7 +52,7 @@ Conditions
      f_lazyifelse(val1, val2, big_mat1, big_mat2)
  print 'time spent evaluating one value %f sec'%(time.clock()-tic)

-In this example, IfElse Op spend less time (about an half) than Switch
+In this example, IfElse Op spends less time (about an half) than Switch
 since it computes only one variable instead of both.

 .. code-block:: python

--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -94,8 +94,6 @@ was reformatted for readability):
        [ 1.,  4.]])]


-
-
 Setting a default value for an argument
 =======================================

@@ -368,3 +366,58 @@ Others Random Distributions
 ---------------------------

 There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`. 
+
+
+.. _logistic_regression:
+
+
+A Real example: Logistic Regression
+===================================
+
+The preceding elements are put to work in this more realistic example. It will be used repeatedly.  
+
+.. code-block:: python
+
+  import numpy
+  import theano
+  import theano.tensor as T
+  rng = numpy.random
+  
+  N = 400
+  feats = 784
+  D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
+  training_steps = 10000
+  
+  # Declare Theano symbolic variables
+  x = T.matrix("x")
+  y = T.vector("y")
+  w = theano.shared(rng.randn(feats), name="w")
+  b = theano.shared(0., name="b")
+  print "Initial model:"
+  print w.get_value(), b.get_value()
+
+  # Construct Theano expression graph
+  p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))     # Probability that target = 1
+  prediction = p_1 > 0.5                    # The prediction thresholded
+  xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy loss function
+  cost = xent.mean() + 0.01*(w**2).sum()    # The cost to minimize
+  gw,gb = T.grad(cost, [w,b])		  # Compute the gradient of cost
+
+  # Compile
+  train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            updates={w:w-0.1*gw, b:b-0.1*gb})
+  predict = theano.function(inputs=[x], outputs=prediction)
+
+  # Train
+  for i in range(training_steps):
+      pred, err = train(D[0], D[1])
+
+  print "Final model:"
+  print w.get_value(), b.get_value()
+  print "target values for D:", D[1]
+  print "prediction on D:", predict(D[0])
+
+
+
--- a/doc/tutorial/extending_theano.txt
+++ b/doc/tutorial/extending_theano.txt
@@ -310,8 +310,10 @@ You can also add this at the end of the test file:
       t.setUp()
       t.test_double_rop()

-Exercises 8
-----------
+-------------------------------------------
+
+**Exercise**
+

 - Run the code in the file double_op.py.
 - Modify and execute to compute: x * y

--- a/doc/tutorial/gpu_data_convert.txt
+++ b/doc/tutorial/gpu_data_convert.txt
@@ -7,23 +7,23 @@ PyCUDA/CUDAMat/Gnumpy compatibility
 PyCUDA
 ======

-Currently PyCUDA and Theano have different object to store GPU
+Currently PyCUDA and Theano have different objects to store GPU
 data. The two implementations do not support the same set of features.
 Theano's implementation is called CudaNdarray and supports
-strides. It support only the float32 dtype. PyCUDA's implementation
-is called GPUArray and doesn't support strides. Instead it can deal with all numpy and Cuda dtypes.
+strides. It supports only the float32 dtype. PyCUDA's implementation
+is called GPUArray and doesn't support strides. However, it can deal with all Numpy and Cuda dtypes.

 We are currently working on having the same base object that will
-mimic numpy. Until this is ready, here is some information on how to
-use both Project in the same script.
+mimic Numpy. Until this is ready, here is some information on how to
+use both objects in the same script.

 Transfer
 --------

 You can use the `theano.misc.pycuda_utils` module to convert GPUArray to and
-from CudaNdarray. The function `to_cudandarray(x, copyif=False)` and
-`to_gpuarray(x)` return a new object that share the same memory space
-as the original. Otherwise it raise an ValueError. Because GPUArray don't
+from CudaNdarray. The functions `to_cudandarray(x, copyif=False)` and
+`to_gpuarray(x)` return a new object that occupies the same memory space
+as the original. Otherwise it raises a ValueError. Because GPUArray don't
 support strides, if the CudaNdarray is strided, we could copy it to
 have a non-strided copy. The resulting GPUArray won't share the same
 memory region. If you want this behavior, set `copyif=True` in
@@ -32,29 +32,102 @@ memory region. If you want this behavior, set `copyif=True` in
 Compiling with PyCUDA
 ---------------------

-You can use PyCUDA to compile some CUDA function that work directly on
-CudaNdarray. There is an example in the function `test_pycuda_theano`
-in the file `theano/misc/tests/test_pycuda_theano_simple.py`. Also,
-there is an example that shows how to make an op that calls a pycuda
-function :ref:`here <pyCUDA_theano>`.
+You can use PyCUDA to compile CUDA functions that work directly on
+CudaNdarray. Here is an example from the file `theano/misc/tests/test_pycuda_theano_simple.py`

-Theano op using PyCUDA function
-------------------------------
+.. code-block:: python
+
+  import sys
+  import numpy
+  import theano
+  import theano.sandbox.cuda as cuda_ndarray
+  import theano.misc.pycuda_init
+  import pycuda
+  import pycuda.driver as drv
+  import pycuda.gpuarray
+
+
+  def test_pycuda_theano():
+      """Simple example with pycuda function and Theano CudaNdarray object."""
+      from pycuda.compiler import SourceModule
+      mod = SourceModule("""
+  __global__ void multiply_them(float *dest, float *a, float *b)
+  {
+    const int i = threadIdx.x;
+    dest[i] = a[i] * b[i];
+  }
+  """)

-You can use gpu function compiled with PyCUDA in a Theano op. Look
-into the `HPCS2011 tutorial
-<http://www.iro.umontreal.ca/~lisa/pointeurs/tutorial_hpcs2011_fixed.pdf>`_ for an example.
+      multiply_them = mod.get_function("multiply_them")

+      a = numpy.random.randn(100).astype(numpy.float32)
+      b = numpy.random.randn(100).astype(numpy.float32)
+  
+      # Test with Theano object
+      ga = cuda_ndarray.CudaNdarray(a)
+      gb = cuda_ndarray.CudaNdarray(b)
+      dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
+      multiply_them(dest, ga, gb,
+		    block=(400, 1, 1), grid=(1, 1))
+      assert (numpy.asarray(dest) == a * b).all()
+
+
+Theano op using PyCUDA function
+-------------------------------

+You can use gpu function compiled with PyCUDA in a Theano op. Here is an example..
+
+.. code-block:: python
+
+    import numpy, theano
+    import theano.misc.pycuda_init
+    from pycuda.compiler import SourceModule
+    import theano.sandbox.cuda as cuda
+
+    class PyCUDADoubleOp(theano.Op):
+        def __eq__(self, other):
+            return type(self) == type(other)
+        def __hash__(self):
+            return hash(type(self))
+        def __str__(self):
+            return self.__class__.__name__
+        def make_node(self, inp):
+            inp = cuda.basic_ops.gpu_contiguous(
+               cuda.basic_ops.as_cuda_ndarray_variable(inp))
+            assert inp.dtype == "float32"
+            return theano.Apply(self, [inp], [inp.type()])
+        def make_thunk(self, node, storage_map, _, _2):
+            mod = SourceModule("""
+        __global__ void my_fct(float * i0, float * o0, int size) {
+        int i = blockIdx.x*blockDim.x + threadIdx.x;
+        if(i<size){
+            o0[i] = i0[i]*2;
+        }
+      }""")
+            pycuda_fct = mod.get_function("my_fct")
+            inputs = [ storage_map[v] for v in node.inputs]
+            outputs = [ storage_map[v] for v in node.outputs]
+            def thunk():
+                z = outputs[0]
+                if z[0] is None or z[0].shape!=inputs[0][0].shape:
+                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
+                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
+                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
+                           block=(512,1,1), grid=grid)
+            return thunk
    
 CUDAMat
 =======

-There is conversion function between CUDAMat object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.cudamat_utils.py
+There are functions for conversion between CUDAMat and Theano CudaNdArray objects. 
+They obey the same principles as PyCUDA's functions and can be found in
+theano.misc.cudamat_utils.py

-WARNING: there is a strange problem with stride/shape with those converter. The test to work need a transpose and reshape...
+WARNING: There is a strange problem associated with stride/shape with those converters. 
+To work, the test needs a transpose and reshape...

 Gnumpy
 ======

-There is conversion function between gnumpy garray object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.gnumpy_utils.py
+There are conversion functions between gnumpy garray object and Theano CudaNdArray. 
+They are also similar to PyCUDA's and can be found in theano.misc.gnumpy_utils.py
--- a/doc/tutorial/gradients.txt
+++ b/doc/tutorial/gradients.txt
@@ -264,3 +264,19 @@ or, making use of the *R-operator*:
 >>> f([4,4],[2,2])
 array([ 4.,  4.])

+
+Final notes
+===========
+
+
+* T.grad works symbolically: takes and returns a Theano variable.
+
+* Can be compared to a macro: can be applied multiple times.
+
+* Handles scalar costs only.
+
+* However, a simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian.
+
+* Work is in progress on the missing optimizations to be able to compute efficiently the full
+     Jacobian and Hessian matrices and Jacobian times x vector.
+
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -27,10 +27,11 @@ you out.
    numpy
    adding
    examples
-    gradients
-    loading_and_saving
    symbolic_graphs
+    printing_drawing	
+    gradients
    modes
+    loading_and_saving
    aliasing
    conditions
    loop

--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
@@ -4,4 +4,85 @@
 Loop
 ====

-You can use :ref:`Scan <lib_scan>` to do all type of loop in Theano. All the documentation about it is in the library for now.
+
+Scan
+====
+
+- A general form of **recurrence**, which can be used for looping.
+- **Reduction** and **map** (loop over the leading dimensions) are special cases of scan.
+- You 'scan' a function along some input sequence, producing an output at each time-step.
+- The function can see the **previous K time-steps** of your function.
+- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
+- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- Advantages of using ``scan`` over for loops:
+  
+  - Number of iterations to be part of the symbolic graph.
+  - Minimizes GPU transfers if GPU is involved.
+  - Compute gradients through sequential steps.
+  - Slightly faster than using a for loop in Python with a compiled Theano function.
+  - Can lower the overall memory usage by detecting the actual amount of memory needed.
+
+The full documentation can be found in the library: :ref:`Scan <lib_scan>`.
+
+**Scan Example: Computing pow(A,k)**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  k = T.iscalar("k"); A = T.vector("A")
+
+  def inner_fct(prior_result, A): return prior_result * A
+  # Symbolic description of the result
+  result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=T.ones_like(A),
+                              non_sequences=A, n_steps=k)
+
+  # Scan has provided us with A**1 through A**k.  Keep only the last
+  # value. Scan notices this and does not waste memory saving them.
+  final_result = result[-1]
+  
+  power = theano.function(inputs=[A,k], outputs=final_result,
+                        updates=updates)
+  
+  print power(range(10),2)
+  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+
+
+**Scan Example: Calculating a Polynomial**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  coefficients = theano.tensor.vector("coefficients")
+  x = T.scalar("x"); max_coefficients_supported = 10000
+
+  # Generate the components of the polynomial
+  full_range=theano.tensor.arange(max_coefficients_supported)
+  components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                     coeff * (free_var ** power),
+                                  outputs_info=None,
+                                  sequences=[coefficients, full_range],
+                                  non_sequences=x)
+  polynomial = components.sum()
+  calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial)
+
+  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+  print calculate_polynomial(test_coeff, 3)
+  # 19.0
+
+
+-------------------------------------------
+
+
+**Exercise**
+
+- Run both examples.
+- Modify and execute the polynomial example to have the reduction done by scan.
+
+
+-------------------------------------------
--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt

 .. _using_modes:

-===============================
-Using different compiling modes
-===============================
+==========================================
+Configuration settings and Compiling modes
+==========================================
+
+
+Configuration
+=============
+
+The config module contains many ``attributes`` that modify Theano's behavior.  Many of these
+attributes are consulted during the import of the ``theano`` module and many are assumed to be
+read-only.
+
+*As a rule, the attributes in this module should not be modified by user code.*
+
+Theano's code comes with default values for these attributes, but you can
+override them from your .theanorc file, and override those values in turn by
+the :envvar:`THEANO_FLAGS` environment variable.
+
+The order of precedence is:
+
+1. an assignment to theano.config.<property>
+2. an assignment in :envvar:`THEANO_FLAGS`
+3. an assignment in the .theanorc file (or the file indicated in :envvar:`THEANORC`)
+
+You can print out the current/effective configuration at any time by printing
+theano.config.  For example, to see a list  of all active configuration
+variables, type this from the command-line:
+
+.. code-block:: bash
+
+    python -c 'import theano; print theano.config' | less
+
+-------------------------------------------
+
+**Exercise**
+
+
+Consider once again the logistic regression:
+
+.. code-block:: python
+    
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+
+
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates={w:w-0.01*gw, b:b-0.01*gb},
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+
+    if any( [x.op.__class__.__name__=='Gemv' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+
+
+
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+
+    print "target values for D"
+    print D[1]
+
+    print "prediction on D"
+    print predict(D[0])
+
+   
+Modify and execute this example to run on CPU (the default) with floatX=float32 and 
+time with ``time python file.py``.
+
+????You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
+
+.. Note::
+
+   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
+   * Cast inputs before putting them into a shared variable.
+   * Circumvent the automatic cast of int32 with float32 to float64:
+    
+     * Insert manual cast in your code or use [u]int{8,16}.
+     * Insert manual cast around the mean operator (this involves division by length, which is an int64).
+     * A new casting mechanism is being developed.
+
+-------------------------------------------

 Mode
 ====

--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -5,6 +5,10 @@
 Graph Structures
 ================

+
+Theano Graphs
+=============
+
 Debugging or profiling code written in Theano is not that simple if you
 do not know what goes on under the hood. This chapter is meant to
 introduce you to a required minimum of the inner workings of Theano, 
@@ -136,3 +140,23 @@ twice or reformulate parts of the graph to a GPU specific version.

 For example, one (simple) optimization that Theano uses is to replace 
 the pattern :math:`\frac{xy}{y}` by :math:`x`.
+
+
+**Example**
+
+Consider the following example of optimization:
+
+>>> import theano
+>>> a = theano.tensor.vector("a")      # declare symbolic variable
+>>> b = a + a**10                      # build symbolic expression
+>>> f = theano.function([a], b)        # compile function
+>>> print f([0,1,2])                   # prints `array([0,2,1026])`
+
+
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================
+
+Symbolic programming involves a paradigm shift: people need to use it to understand it.
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -191,7 +191,7 @@ mistake by failing to account for the resulting memory aliasing.


 What can be accelerated on the GPU?
------------------------------------
+-----------------------------------

 The performance characteristics will change as we continue to optimize our
 implementations, and vary from device to device, but to give a rough idea of
@@ -217,7 +217,7 @@ what to expect right now:


 Tips for improving performance on GPU
--------------------------------------
+-------------------------------------

 * Consider 
  adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
@@ -251,3 +251,261 @@ Changing the value of shared variables
 To change the value of a shared variable, e.g. to provide new data to process,
 use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
 see :ref:`aliasing`.
+
+-------------------------------------------
+
+**Exercise**
+
+
+Consider the logistic regression:
+
+.. code-block:: python
+    
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+
+
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates={w:w-0.01*gw, b:b-0.01*gb},
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+
+    if any( [x.op.__class__.__name__=='Gemv' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any( [x.op.__class__.__name__=='GpuGemm' for x in
+    train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+
+
+
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+
+    print "target values for D"
+    print D[1]
+
+    print "prediction on D"
+    print predict(D[0])
+
+   
+
+* Modify and execute this example to run on GPU with floatX=float32 and 
+  time with "time python file.py".
+
+* Is there an increase in speed from CPU to GPU?
+
+* Where does it come from? (Use ProfileMode)
+
+* What can be done to further increase the speed of the GPU version?
+
+
+.. Note::
+
+   * Only 32 bit floats are supported (being worked on)
+   * Only 1 GPU per process
+   * Use the Theano flag ``device=gpu`` to tell to use the GPU device
+   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
+   * Shared variables with float32 dtype are by default moved to the GPU memory space
+
+   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code
+   * Cast inputs before putting them into a shared variable
+   * Circumvent the automatic cast of int32 with float32 to float64:
+    
+     * A new casting mechanism is being developed
+     * Insert manual cast in your code or use [u]int{8,16}
+     * Insert manual cast around the mean operator (which involves a division by the length, which is an int64)
+
+
+-------------------------------------------
+
+Software for Directly Programming a GPU
+---------------------------------------
+
+Theano is a meta-programmer, doesn't really count.
+
+* CUDA: C extension by NVIDIA 
+
+ * Vendor-specific
+
+ * Numeric libraries (BLAS, RNG, FFT) maturing.
+
+* OpenCL: multi-vendor version of CUDA
+
+ * More general, standardized
+
+ * Fewer libraries, less adoption.
+
+* PyCUDA: python bindings to CUDA driver interface
+
+ * Python interface to CUDA:
+   Access Nvidia's CUDA parallel computation API from Python.
+   Convenience: Makes it easy to do GPU meta-programming from within Python. Helpful documentation.
+   (i.e. abstractions to compile low-level CUDA code from Python: ``pycuda.driver.SourceModule``).
+   Completeness: Binding to all of CUDA's driver API.
+   Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
+   Speed: PyCUDA's base layer is written in C++.
+
+ * Memory management of GPU objects:
+   GPU memory buffer: \texttt{pycuda.gpuarray.GPUArray}
+   Object cleanup tied to lifetime of objects (RAII, Resource Acquisition Is Initialization).
+   Makes it much easier to write correct, leak- and crash-free code.
+   PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory allocated in it is also freed).
+
+* PyOpenCL: PyCUDA for OpenCL
+
+
+Example: PyCUDA
+---------------
+
+.. code-block:: python
+
+  import pycuda.autoinit
+  import pycuda.driver as drv
+  import numpy
+  
+  from pycuda.compiler import SourceModule
+  mod = SourceModule("""
+  __global__ void multiply_them(float *dest, float *a, float *b)
+  {
+    const int i = threadIdx.x;
+    dest[i] = a[i] * b[i];
+  }
+  """)
+
+  multiply_them = mod.get_function("multiply_them")
+  
+  a = numpy.random.randn(400).astype(numpy.float32)
+  b = numpy.random.randn(400).astype(numpy.float32)
+  
+  dest = numpy.zeros_like(a)
+  multiply_them(
+          drv.Out(dest), drv.In(a), drv.In(b),
+          block=(400,1,1), grid=(1,1))
+
+  assert numpy.allclose(dest, a*b)
+  print dest
+
+-------------------------------------------
+
+**Exercise**
+
+- Run the preceding example.
+
+- Modify and execute it to work for a matrix of 20 x 10.
+
+
+-------------------------------------------
+
+
+.. _pyCUDA_theano:
+
+Example: Theano + PyCUDA
+------------------------
+
+.. code-block:: python
+
+    import numpy, theano
+    import theano.misc.pycuda_init
+    from pycuda.compiler import SourceModule
+    import theano.sandbox.cuda as cuda
+
+    class PyCUDADoubleOp(theano.Op):
+        def __eq__(self, other):
+            return type(self) == type(other)
+        def __hash__(self):
+            return hash(type(self))
+        def __str__(self):
+            return self.__class__.__name__
+        def make_node(self, inp):
+            inp = cuda.basic_ops.gpu_contiguous(
+               cuda.basic_ops.as_cuda_ndarray_variable(inp))
+            assert inp.dtype == "float32"
+            return theano.Apply(self, [inp], [inp.type()])
+        def make_thunk(self, node, storage_map, _, _2):
+            mod = SourceModule("""
+        __global__ void my_fct(float * i0, float * o0, int size) {
+        int i = blockIdx.x*blockDim.x + threadIdx.x;
+        if(i<size){
+            o0[i] = i0[i]*2;
+        }
+      }""")
+            pycuda_fct = mod.get_function("my_fct")
+            inputs = [ storage_map[v] for v in node.inputs]
+            outputs = [ storage_map[v] for v in node.outputs]
+            def thunk():
+                z = outputs[0]
+                if z[0] is None or z[0].shape!=inputs[0][0].shape:
+                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
+                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
+                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
+                           block=(512,1,1), grid=grid)
+            return thunk
+    
+
+Test it!:
+
+>>> x = theano.tensor.fmatrix()
+>>> f = theano.function([x], PyCUDADoubleOp()(x))
+>>> xv=numpy.ones((4,5), dtype="float32")
+>>> assert numpy.allclose(f(xv), xv*2)
+>>> print numpy.asarray(f(xv))
+
+-------------------------------------------
+
+**Exercise**
+
+
+- Run the preceding example.
+
+- Modify and execute the example to multiple two matrix: x * y.
+
+- Modify and execute the example to return 2 outputs: x + y and x - y.
+
+  - Our current elemwise fusion generates computation with only 1 output.
+
+- Modify and execute the example to support stride (i.e. so as not constrain the input to be c contiguous).
+
+
+-------------------------------------------
+
+
+
+
+
+
+