first version of crei tutorial.

2f506f14 · Frederic Bastien · d5baff99 · 2f506f14 · 2f506f14 · 2f506f14
--- a/doc/crei2013/advanced_theano.txt
+++ b/doc/crei2013/advanced_theano.txt
+.. _advanced_theano:
+***************
+Advanced Theano
+***************
+Conditions
+----------
+**IfElse**
+- Build condition over symbolic variables.
+- IfElse Op takes a boolean condition and two variables to compute as input.
+- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
+  evaluates one variable respect to the condition.
+**IfElse Example: Comparison with Switch**
+.. code-block:: python
+  from theano import tensor as T
+  from theano.ifelse import ifelse
+  import theano, time, numpy
+  a,b = T.scalars('a','b')
+  x,y = T.matrices('x','y')
+  z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
+  z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
+  f_switch = theano.function([a,b,x,y], z_switch, 
+                      mode=theano.Mode(linker='vm'))
+  f_lazyifelse = theano.function([a,b,x,y], z_lazy,
+                      mode=theano.Mode(linker='vm'))
+  val1 = 0.
+  val2 = 1.
+  big_mat1 = numpy.ones((10000,1000))
+  big_mat2 = numpy.ones((10000,1000))
+  n_times = 10
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_switch(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating both values %f sec'%(time.clock()-tic)
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_lazyifelse(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating one value %f sec'%(time.clock()-tic)
+IfElse Op spend less time (about an half) than Switch since it computes only
+one variable instead of both.
+>>> python ifelse_switch.py
+time spent evaluating both values 0.6700 sec
+time spent evaluating one value 0.3500 sec
+Note that IfElse condition is a boolean while Switch condition is a tensor, so
+Switch is more general.
+It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
+otherwise IfElse will compute both variables and take the same computation
+time as the Switch Op. The linker is not currently set by default to 'cvm' but
+it will be in a near future.
+Loops
+-----
+**Scan**
+- General form of **recurrence**, which can be used for looping.
+- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
+- You 'scan' a function along some input sequence, producing an output at each time-step
+- The function can see the **previous K time-steps** of your function
+- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
+- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- The advantage of using ``scan`` over for loops
+  - The number of iterations to be part of the symbolic graph
+  - Minimizes GPU transfers if GPU is involved
+  - Compute gradients through sequential steps
+  - Slightly faster then using a for loop in Python with a compiled Theano function
+  - Can lower the overall memory usage by detecting the actual amount of memory needed
+**Scan Example: Computing pow(A,k)**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  k = T.iscalar("k"); A = T.vector("A")
+  def inner_fct(prior_result, A): return prior_result * A
+  # Symbolic description of the result
+  result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=T.ones_like(A),
+                              non_sequences=A, n_steps=k)
+  # Scan has provided us with A**1 through A**k.  Keep only the last
+  # value. Scan notices this and does not waste memory saving them.
+  final_result = result[-1]
+  power = theano.function(inputs=[A,k], outputs=final_result,
+                        updates=updates)
+  print power(range(10),2)
+  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+**Scan Example: Calculating a Polynomial**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  coefficients = theano.tensor.vector("coefficients")
+  x = T.scalar("x"); max_coefficients_supported = 10000
+  # Generate the components of the polynomial
+  full_range=theano.tensor.arange(max_coefficients_supported)
+  components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                     coeff * (free_var ** power),
+                                  outputs_info=None,
+                                  sequences=[coefficients, full_range],
+                                  non_sequences=x)
+  polynomial = components.sum()
+  calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial)
+  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+  print calculate_polynomial(test_coeff, 3)
+  # 19.0
+Exercise 4
+-----------
+- Run both examples 
+- Modify and execute the polynomial example to have the reduction done by scan
+Compilation pipeline
+--------------------
+.. image:: ../hpcs2011_tutorial/pics/pipeline.png
+   :width: 400 px
+Inplace optimization
+--------------------
+- 2 type of inplace operations:
+  - An op that return a view on its inputs (e.g. reshape, inplace transpose)
+  - An op that write the output on the inputs memory space
+- This allows some memory optimization
+- The Op must tell Theano if they work inplace
+- Inplace Op add constraints to the order of execution
+Profiling
+---------
+- To replace the default mode with this mode, use the Theano flags ``mode=ProfileMode``
+- To enable the memory profiling use the flags ``ProfileMode.profile_memory=True``
+Theano output:
+.. code-block:: python
+    """
+    Time since import 33.456s
+    Theano compile time: 1.023s (3.1% since import)
+      Optimization time: 0.789s
+      Linker time: 0.221s
+    Theano fct call 30.878s (92.3% since import)
+     Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call)
+     Theano function overhead in ProfileMode 1.466s 4.4%(since import)
+                                                  4.7%(of fct call)
+    10001 Theano fct call, 0.003s per call
+    Rest of the time since import 1.555s 4.6%
+    Theano fct summary:
+    <% total fct time> <total time> <time per call> <nb call> <fct name>
+     100.0% 30.877s 3.09e-03s 10000 train
+      0.0% 0.000s 4.06e-04s 1 predict
+    Single Op-wise summary:
+    <% of local_time spent on this kind of Op> <cumulative %>
+        <self seconds> <cumulative seconds> <time per call> <nb_call>
+        <nb_op> <nb_apply> <Op name>
+       87.3%   87.3%  25.672s  25.672s  2.57e-03s   10000  1  1 <Gemv>
+        9.7% s  97.0%  2.843s  28.515s  2.84e-04s   10001  1  2 <Dot>
+        2.4%   99.3%  0.691s  29.206s  7.68e-06s * 90001 10 10 <Elemwise>
+        0.4%   99.7%  0.127s  29.334s  1.27e-05s   10000  1  1 <Alloc>
+        0.2%   99.9%  0.053s  29.386s  1.75e-06s * 30001  2  4 <DimShuffle>
+        0.0%  100.0%  0.014s  29.400s  1.40e-06s * 10000  1  1 <Sum>
+        0.0%  100.0%  0.011s  29.411s  1.10e-06s * 10000  1  1 <Shape_i>
+    (*) Op is running a c implementation
+    Op-wise summary:
+    <% of local_time spent on this kind of Op> <cumulative %>
+        <self seconds> <cumulative seconds> <time per call>
+        <nb_call> <nb apply> <Op name>
+       87.3%   87.3%  25.672s  25.672s  2.57e-03s   10000  1 Gemv{inplace}
+        9.7%   97.0%  2.843s  28.515s  2.84e-04s   10001  2 dot
+        1.3%   98.2%  0.378s  28.893s  3.78e-05s * 10000  1 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}
+        0.4%   98.7%  0.127s  29.021s  1.27e-05s   10000  1 Alloc
+        0.3%   99.0%  0.092s  29.112s  9.16e-06s * 10000  1 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)]
+        0.1%   99.3%  0.033s  29.265s  1.66e-06s * 20001  3 InplaceDimShuffle{x}
+       ... (remaining 11 Apply account for 0.7%(0.00s) of the runtime)
+    (*) Op is running a c implementation
+    Apply-wise summary:
+    <% of local_time spent at this position> <cumulative %%>
+        <apply time> <cumulative seconds> <time per call>
+        <nb_call> <Apply position> <Apply Op name>
+       87.3%   87.3%  25.672s  25.672s 2.57e-03s  10000  15 Gemv{inplace}(w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)].0, TensorConstant{0.9998})
+        9.7%   97.0%  2.843s  28.515s 2.84e-04s  10000   1 dot(x, w)
+        1.3%   98.2%  0.378s  28.893s 3.78e-05s  10000   9 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}(y, Elemwise{Composite{neg,sub}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+        0.4%   98.7%  0.127s  29.020s 1.27e-05s  10000  10 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+        0.3%   99.0%  0.092s  29.112s 9.16e-06s  10000  13 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0,0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{neg,sub}}[(0,0)].0, Elemwise{sub,no_inplace}.0, InplaceDimShuffle{x}.0)
+        0.3%   99.3%  0.080s  29.192s 7.99e-06s  10000  11 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)](Elemwise{neg,no_inplace}.0)
+       ... (remaining 14 Apply instances account for
+           0.7%(0.00s) of the runtime)
+    Profile of Theano functions memory:
+    (This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
+    Theano fct: train
+        Max without gc, inplace and view (KB) 2481
+        Max FAST_RUN_NO_GC (KB) 16
+        Max FAST_RUN (KB) 16
+        Memory saved by view (KB) 2450
+        Memory saved by inplace (KB) 15
+        Memory saved by GC (KB) 0
+        <Sum apply outputs (bytes)> <Apply outputs memory size(bytes)>
+            <created/inplace/view> <Apply node>
+        <created/inplace/view> is taked from the op declaration, not ...
+             2508800B  [2508800] v InplaceDimShuffle{1,0}(x)
+                6272B  [6272] i Gemv{inplace}(w, ...)
+                3200B  [3200] c Elemwise{Composite{...}}(y, ...)
+    Here are tips to potentially make your code run faster (if you think of new ones, suggest them on the mailing list).
+    Test them first, as they are not guaranteed to always provide a speedup.
+      - Try the Theano flag floatX=float32
+    """
+Exercise 5
+-----------
+- In the last exercises, do you see a speed up with the GPU?
+- Where does it come from? (Use ProfileMode)
+- Is there something we can do to speed up the GPU version?
+Printing/Drawing Theano graphs
+------------------------------
+- Pretty Printing
+``theano.printing.pprint(variable)``
+>>> theano.printing.pprint(prediction)
+gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))),TensorConstant{0.5})
+- Debug Print
+``theano.printing.debugprint({fct, variable, list of variables})``
+>>> theano.printing.debugprint(prediction)
+Elemwise{gt,no_inplace} [@181772236] ''
+ |Elemwise{true_div,no_inplace} [@181746668] ''
+ | |InplaceDimShuffle{x} [@181746412] ''
+ | | |TensorConstant{1} [@181745836]
+ | |Elemwise{add,no_inplace} [@181745644] ''
+ | | |InplaceDimShuffle{x} [@181745420] ''
+ | | | |TensorConstant{1} [@181744844]
+ | | |Elemwise{exp,no_inplace} [@181744652] ''
+ | | | |Elemwise{sub,no_inplace} [@181744012] ''
+ | | | | |Elemwise{neg,no_inplace} [@181730764] ''
+ | | | | | |dot [@181729676] ''
+ | | | | | | |x [@181563948]
+ | | | | | | |w [@181729964]
+ | | | | |InplaceDimShuffle{x} [@181743788] ''
+ | | | | | |b [@181730156]
+ |InplaceDimShuffle{x} [@181771788] ''
+ | |TensorConstant{0.5} [@181771148]
+>>> theano.printing.debugprint(predict)
+Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ''   2
+ |dot [@183018796] ''   1
+ | |x [@183000780]
+ | |w [@183000812]
+ |InplaceDimShuffle{x} [@183133580] ''   0
+ | |b [@183000876]
+ |TensorConstant{[ 0.5]} [@183084108]
+- Picture Printing of Graphs
+>>> theano.printing.pydotprint_variables(prediction)
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
+   :width: 800 px
+All pydotprint* requires graphviz and pydot
+>>> theano.printing.pydotprint(predict)
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
+   :width: 800 px
+>>> theano.printing.pydotprint(train) # This is a small train example!
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
+   :width: 1500 px
+Debugging
+---------
+- Run with the Theano flag ``compute_test_value = {``off'',``ignore'', ``warn'', ``raise''}``
+  - Run the code as we create the graph
+  - Allows you to find the bug earlier (ex: shape mismatch)
+  - Makes it easier to identify where the problem is in *your* code
+  - Use the value of constants and shared variables directly
+  - For pure symbolic variables uses ``x.tag.test_value = numpy.random.rand(5,10)``
+- Run with the flag ``mode=FAST_COMPILE``
+  - Few optimizations
+  - Run Python code (better error messages and can be debugged interactively in the Python debugger)
+- Run with the flag ``mode=DebugMode``
+  - 100-1000x slower
+  - Test all optimization steps from the original graph to the final graph
+  - Checks many things that Op should/shouldn't do
+  - Executes both the Python and C code versions
+Known limitations
+-----------------
+- Compilation phase distinct from execution phase
+  - Use ``a_tensor_variable.eval()`` to make this less visible
+- Compilation time can be significant
+  - Amortize it with functions over big input or reuse functions
+- Execution overhead
+  - We have worked on this, but more work needed
+  - So needs a certain number of operations to be useful
+- Compilation time superlinear in the size of the graph.
+  - A few hundreds nodes is fine
+  - Disabling a few optimizations can speed up compilation
+  - Usually too many nodes indicates a problem with the graph
--- a/doc/crei2013/index.txt
+++ b/doc/crei2013/index.txt
+.. _crei2013_index:
+===========================
+Theano Tutorial @ CREI 2013
+===========================
+July 19, 2013, Sherbrook, Québec, Canada.
+Theano is python software for evaluating complicated array expressions.
+What does it do?
+ * aggressive expression optimizations,
+ * automatic GPU use,
+ * symbolic differentiation and R op.
+It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
+scikits, matplotlib, PIL.)
+Design and feature set has been driven by machine learning research
+at the University of
+Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
+The result is a very good library for doing research in deep
+learning and neural network training, and a flexible framework for
+many other models and algorithms in machine learning more generally.
+It has proven to be useful for implementing:
+ - linear and nonlinear neural network classifiers
+ - convolutional models
+ - Energy models: RBM, DBN, GRBM, ssRBM, AIS
+ - Auto-encoders: DAE, CAE
+ - GP regression
+ - sparse coding
+ - recurrent neural networks, echo state, (HMM?)
+ - online and batch learning and optimization
+ - Even SVM!
+As people's needs change this list will grow, but Theano is built
+around vector, matrix, and tensor expressions; there is little reason
+to use it for calculations on other data structures except. There is
+also some sparse matrix support.
+Contents
+--------
+The structured part of these lab sessions will be a walk-through of the following
+material. Interleaved with this structured part will be blocks of time for
+individual or group work.  The idea is that you can try out Theano and get help
+from gurus on hand if you get stuck.
+.. toctree::
+    introduction
+    theano
+    advanced_theano
+    /tutorial/extending_theano
+    pyCUDA
+    gpundarray
--- a/doc/crei2013/introduction.txt
+++ b/doc/crei2013/introduction.txt
+.. _cifarSS2011_Introduction:
+************
+Introduction
+************
+Background Questionaire
+-----------------------
+* Who has used Theano before?
+ * What did you do with it?
+* Who has used Python? NumPy? SciPy? matplotlib?
+* Who has used iPython?
+ * Who has used it as a distributed computing engine?
+* Who has done C/C++ programming?
+* Who has organized computation around a particular physical memory layout?
+* Who has used a multidimensional array of >2 dimensions?
+* Who has written a Python module in C before?
+ * Who has written a program to *generate* Python modules in C?
+* Who has used a templating engine?
+* Who has programmed a GPU before?
+ * Using OpenGL / shaders ?
+ * Using CUDA (runtime? / driver?)
+ * Using PyCUDA ?
+ * Using OpenCL / PyOpenCL ?
+ * Using cudamat / gnumpy ?
+ * Other?
+* Who has used Cython?
+Python in one slide
+-------------------
+* General-purpose high-level OO interpreted language
+* Emphasizes code readability
+* Comprehensive standard library
+* Dynamic type and memory management
+* Built-in types: int, float, str, list, dict, tuple, object
+* Slow execution
+* Popular in web-dev and scientific communities
+.. code-block:: python
+    #######################
+    # PYTHON SYNTAX EXAMPLE
+    #######################
+    a = 1                     # no type declaration required!
+    b = (1, 2, 3)             # tuple of three int literals
+    c = [1, 2, 3]             # list of three int literals
+    d = {'a': 5, b: None}     # dictionary of two elements
+                              # N.B. string literal, None
+    print d['a']              # square brackets index
+    # -> 5
+    print d[(1, 2, 3)]        # new tuple == b, retrieves None
+    # -> None
+    print d[6]
+    # raises KeyError Exception
+    x, y, z = 10, 100, 100    # multiple assignment from tuple
+    x, y, z = b               # unpacking a sequence
+    b_squared = [b_i**2 for b_i in b]  # list comprehension
+    def foo(b, c=3):          # function w default param c
+        return a + b + c      # note scoping, indentation
+    foo(5)                    # calling a function
+    # -> 1 + 5 + 3 == 9       # N.B. scoping
+    foo(b=6, c=2)             # calling with named args
+    # -> 1 + 6 + 2 == 9
+    print b[1:3]              # slicing syntax
+    class Foo(object):        # Defining a class
+        def __init__(self):
+            self.a = 5
+        def hello(self):
+            return self.a
+    f = Foo()                 # Creating a class instance
+    print f.hello()           # Calling methods of objects
+    # -> 5 
+    class Bar(Foo):           # Defining a subclass
+        def __init__(self, a):
+            self.a = a
+    print Bar(99).hello()     # Creating an instance of Bar
+    # -> 99
+NumPy in one slide
+------------------
+* Python floats are full-fledged objects on the heap
+ * Not suitable for high-performance computing!
+* NumPy provides a N-dimensional numeric array in Python
+ * Perfect for high-performance computing.
+ * Slice are return view (no copy)
+* NumPy provides
+ * elementwise computations
+ * linear algebra, Fourier transforms
+ * pseudorandom numbers from many distributions
+* SciPy provides lots more, including
+ * more linear algebra
+ * solvers and optimization algorithms
+ * matlab-compatible I/O
+ * I/O and signal processing for images and audio
+.. code-block:: python
+    ##############################
+    # Properties of NumPy arrays
+    # that you really need to know
+    ##############################
+    import numpy as np          # import can rename
+    a = np.random.rand(3, 4, 5) # random generators
+    a32 = a.astype('float32')   # arrays are strongly typed
+    a.ndim                      # int: 3
+    a.shape                     # tuple: (3, 4, 5)
+    a.size                      # int: 60
+    a.dtype                     # np.dtype object: 'float64'
+    a32.dtype                   # np.dtype object: 'float32'
+    assert a[1, 1, 1] != 10     # a[1, 1, 1] is a view
+    a[1, 1, 1] = 10             # So affectation to it change the
+    assert a[1, 1, 1] == 10     # original array
+Arrays can be combined with numeric operators, standard mathematical
+functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
+Training an MNIST-ready classification neural network in pure NumPy might look like this:
+.. code-block:: python
+    #########################
+    # NumPy for Training a
+    # Neural Network on MNIST
+    #########################
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    w = np.random.normal(
+        avg=0,
+        std=.1,
+        size=(784, 500))
+    b = np.zeros((500,))
+    v = np.zeros((500, 10))
+    c = np.zeros((10,))
+    batchsize = 100
+    for i in xrange(1000):
+        x_i = x[i * batchsize: (i + 1) * batchsize]
+        y_i = y[i * batchsize: (i + 1) * batchsize]
+        hidin = np.dot(x_i, w) + b
+        hidout = np.tanh(hidin)
+        outin = np.dot(hidout, v) + c
+        outout = (np.tanh(outin) + 1) / 2.0
+        g_outout = outout - y_i
+        err = 0.5 * np.sum(g_outout) ** 2
+        g_outin = g_outout * outout * (1.0 - outout)
+        g_hidout = np.dot(g_outin, v.T)
+        g_hidin = g_hidout * (1 - hidout ** 2)
+        b -= lr * np.sum(g_hidin, axis=0)
+        c -= lr * np.sum(g_outin, axis=0)
+        w -= lr * np.dot(x_i.T, g_hidin)
+        v -= lr * np.dot(hidout.T, g_outin)
+What's missing?
+---------------
+* Non-lazy evaluation (required by Python) hurts performance
+* NumPy is bound to the CPU
+* NumPy lacks symbolic or automatic differentiation
+Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
+you have GPU (I'm skipping some dtype-details which we'll come back to).
+.. code-block:: python
+    #########################
+    # Theano for Training a
+    # Neural Network on MNIST
+    #########################
+    import numpy as np
+    import theano
+    import theano.tensor as tensor
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    # symbol declarations
+    sx = tensor.matrix()
+    sy = tensor.matrix()
+    w = theano.shared(np.random.normal(avg=0, std=.1,
+                                       size=(784, 500)))
+    b = theano.shared(np.zeros(500))
+    v = theano.shared(np.zeros((500, 10)))
+    c = theano.shared(np.zeros(10))
+    # symbolic expression-building
+    hid = tensor.tanh(tensor.dot(sx, w) + b)
+    out = tensor.tanh(tensor.dot(hid, v) + c)
+    err = 0.5 * tensor.sum(out - sy) ** 2
+    gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
+    # compile a fast training function
+    train = theano.function([sx, sy], err,
+        updates={
+            w: w - lr * gw,
+            b: b - lr * gb,
+            v: v - lr * gv,
+            c: c - lr * gc})
+    # now do the computations
+    batchsize = 100
+    for i in xrange(1000):
+        x_i = x[i * batchsize: (i + 1) * batchsize]
+        y_i = y[i * batchsize: (i + 1) * batchsize]
+        err_i = train(x_i, y_i)
+Theano in one slide
+-------------------
+* High-level domain-specific language tailored to numeric computation
+* Compiles most common expressions to C for CPU and GPU.
+* Limited expressivity means lots of opportunities for expression-level optimizations
+ * No function call -> global optimization
+ * Strongly typed -> compiles to machine instructions
+ * Array oriented -> parallelizable across cores
+ * Support for looping and branching in expressions
+* Expression substitution optimizations automatically draw
+  on many backend technologies for best performance.
+ * FFTW, MKL, ATLAS, SciPy, Cython, CUDA
+ * Slower fallbacks always available
+* Automatic differentiation and R op
+* Sparse matrices
+Project status
+--------------
+* Mature: theano has been developed and used since January 2008 (5.5 yrs old)
+* Driven over 87 research papers
+* Good user documentation
+* Active mailing list with participants from outside our lab
+* Core technology for a funded Silicon-Valley startup
+* Many contributors (some from outside our lab)
+* Used to teach IFT6266 for many years
+* Used for research at Google and Yahoo.
+* Downloads (January 2011 -  June 8 2011):
+ * Pypi (16 July 2013): 60k total, 159 last day, 823 last week 
+ * Github (`bleeding edge` repository): unknown
+TODO: Do I keep the GPU section?
+Why scripting for GPUs?
+-----------------------
+They *Complement each other*:
+* GPUs are everything that scripting/high level languages are not
+ * Highly parallel
+ * Very architecture-sensitive
+ * Built for maximum FP/memory throughput
+ * So hard to program that meta-programming is easier.
+* CPU: largely restricted to control
+ * Optimized for sequential code and low latency (rather than high throughput)
+ * Tasks (1000/sec)
+ * Scripting fast enough
+Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
+How Fast are GPUs?
+------------------
+* Theory
+ * Intel Core i7 980 XE (107Gf/s float64) 6 cores
+ * NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
+ * NVIDIA GTX580 (1.5Tf/s float32) 512 cores
+ * GPUs are faster, cheaper, more power-efficient
+* Practice (our experience)
+ * Depends on algorithm and implementation!
+ * Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
+ * Matrix-matrix multiply speedup: usually about 10-20x.
+ * Convolution speedup: usually about 15x.
+ * Elemwise speedup: slower or up to 100x (depending on operation and layout)
+ * Sum: can be faster or slower depending on layout.
+* Benchmarking is delicate work...
+ * How to control quality of implementation?
+  * How much time was spent optimizing CPU vs GPU code?
+ * Theano goes up to 100x faster on GPU because it uses only one CPU core
+ * Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+* If you see speedup > 100x, the benchmark is probably not fair.
+Software for Directly Programming a GPU
+---------------------------------------
+Theano is a meta-programmer, doesn't really count.
+* CUDA: C extension by NVIDIA 
+ * Vendor-specific
+ * Numeric libraries (BLAS, RNG, FFT) maturing.
+* OpenCL: multi-vendor version of CUDA
+ * More general, standardized
+ * Fewer libraries, less adoption.
+* PyCUDA: python bindings to CUDA driver interface
+ * Python interface to CUDA
+ * Memory management of GPU objects
+ * Compilation of code for the low-level driver
+ * Makes it easy to do GPU meta-programming from within Python
+* PyOpenCL: PyCUDA for PyOpenCL
--- a/doc/crei2013/theano.txt
+++ b/doc/crei2013/theano.txt
+.. _theano:
+******
+Theano
+******
+Pointers
+--------
+* http://deeplearning.net/software/theano/
+* Announcements mailing list: http://groups.google.com/group/theano-announce
+* User mailing list: http://groups.google.com/group/theano-users
+* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
+* Installation: https://deeplearning.net/software/theano/install.html
+Description
+-----------
+* Mathematical symbolic expression compiler
+* Dynamic C/CUDA code generation
+* Efficient symbolic differentiation
+  * Theano computes derivatives of functions with one or many inputs.
+* Speed and stability optimizations
+  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
+* Works on Linux, Mac and Windows
+* Transparent use of a GPU
+  * float32 only for now (working on other data types)
+  * Still in experimental state on Windows
+  * On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
+* Extensive unit-testing and self-verification
+  * Detects and diagnoses many types of errors
+* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
+  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
+* Expressions mimic NumPy's syntax & semantics
+* Statically typed and purely functional
+* Some sparse operations (CPU only)
+Simple example
+--------------
+>>> import theano
+>>> a = theano.tensor.vector("a")      # declare symbolic variable
+>>> b = a + a ** 10                    # build symbolic expression
+>>> f = theano.function([a], b)        # compile function
+>>> print f([0, 1, 2])                 # prints `array([0, 2, 1026])`
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================
+Symbolic programming = *Paradigm shift*: people need to use it to understand it.
+Exercise 1
+-----------
+.. code-block:: python
+  import theano
+  a = theano.tensor.vector()      # declare variable
+  out = a + a ** 10               # build symbolic expression
+  f = theano.function([a], out)   # compile function
+  print f([0, 1, 2])
+  # prints `array([0, 2, 1026])`
+  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
+  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
+Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
+Real example
+------------
+**Logistic Regression**
+* GPU-ready
+* Symbolic differentiation
+* Speed optimizations
+* Stability optimizations
+.. code-block:: python
+  import numpy
+  import theano
+  import theano.tensor as tt
+  rng = numpy.random
+  N = 400
+  feats = 784
+  D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
+  training_steps = 10000
+  # Declare Theano symbolic variables
+  x = tt.matrix("x")
+  y = tt.vector("y")
+  w = theano.shared(rng.randn(feats), name="w")
+  b = theano.shared(0., name="b")
+  print "Initial model:"
+  print w.get_value(), b.get_value()
+  # Construct Theano expression graph
+  p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))   # Probability that target = 1
+  prediction = p_1 > 0.5                    # The prediction thresholded
+  xent = -y*tt.log(p_1) - (1-y)*tt.log(1-p_1) # Cross-entropy loss function
+  cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to minimize
+  gw,gb = tt.grad(cost, [w, b])
+  # Compile
+  train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            updates={w: w - 0.1 * gw,
+                     b: b - 0.1 * gb})
+  predict = theano.function(inputs=[x], outputs=prediction)
+  # Train
+  for i in range(training_steps):
+      pred, err = train(D[0], D[1])
+  print "Final model:"
+  print w.get_value(), b.get_value()
+  print "target values for D:", D[1]
+  print "prediction on D:", predict(D[0])
+**Optimizations:**
+Where are those optimization applied?
+* ``log(1+exp(x))``
+* ``1 / (1 + tt.exp(var))`` (sigmoid)
+* ``log(1-sigmoid(var))`` (softplus, stabilisation)
+* GEMV (matrix-vector multiply from BLAS)
+* Loop fusion
+.. code-block:: python
+  p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
+  # 1 / (1 + tt.exp(var)) -> sigmoid(var)
+  xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
+  # Log(1-sigmoid(var)) -> -sigmoid(var)
+  prediction = p_1 > 0.5
+  cost = xent.mean() + 0.01 * (w**2).sum()
+  gw,gb = tt.grad(cost, [w, b])
+  train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            # w - 0.1 * gw: GEMV with the dot in the grad
+            updates={w: w - 0.1 * gw,
+                     b: b - 0.1 * gb})
+Theano flags
+------------
+Theano can be configured with flags. They can be defined in two ways
+* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
+* With a configuration file that defaults to ``~/.theanorc``
+Exercise 2
+-----------
+.. code-block:: python
+    import numpy
+    import theano
+    import theano.tensor as tt
+    rng = numpy.random
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+    # Declare Theano symbolic variables
+    x = tt.matrix("x")
+    y = tt.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
+    prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
+    xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
+    cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to optimize
+    gw,gb = tt.grad(cost, [w, b])
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x, y],
+                outputs=[prediction, xent],
+                updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+                name="train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                              name="predict")
+    if any([x.op.__class__.__name__=='Gemv' for x in
+            train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any([x.op.__class__.__name__=='GpuGemm' for x in
+              train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+    print "target values for D"
+    print D[1]
+    print "prediction on D"
+    print predict(D[0])
+    # Print the graph used in the slides
+    theano.printing.pydotprint(predict,
+                               outfile="pics/logreg_pydotprint_predic.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint_variables(prediction,
+                               outfile="pics/logreg_pydotprint_prediction.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint(train,
+                               outfile="pics/logreg_pydotprint_train.png",
+                               var_with_name_simple=True)
+Modify and execute the example to run on CPU with floatX=float32
+* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
+GPU
+---
+* Only 32 bit floats are supported (being worked on)
+* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
+* Use the Theano flag ``device=gpu`` to tell to use the GPU device
+ * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
+ * Shared variables with float32 dtype are by default moved to the GPU memory space
+* Use the Theano flag ``floatX=float32``
+ * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
+ * Cast inputs before putting them into a shared variable
+ * Cast "problem": int32 with float32 to float64
+  * Insert manual cast in your code or use [u]int{8,16}
+  * The mean operator is worked on to make the output stay in float32.
+* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
+  * Theano 0.6rc4 will have the combination of ``force_device=True``
+    and ``device=cpu`` disable the GPU.
+Exercise 3
+-----------
+* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
+* Time with: ``time python file.py``
+Symbolic variables
+------------------
+* # Dimensions
+ * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
+* Dtype
+ * tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
+ * tt.vector to floatX dtype
+ * floatX: configurable dtype that can be float32 or float64.
+* Custom variable
+ * All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
+ * Other dtype: uint[8,16,32,64], floatX
+Creating symbolic variables: Broadcastability
+* Remember what I said about broadcasting?
+* How to add a row to all rows of a matrix?
+* How to add a column to all columns of a matrix?
+Details regarding symbolic broadcasting...
+* Broadcastability must be specified when creating the variable
+* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
+* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
+Differentiation details
+-----------------------
+>>> gw,gb = tt.grad(cost, [w,b])
+* tt.grad works symbolically: takes and returns a Theano variable
+* tt.grad can be compared to a macro: it can be applied multiple times
+* tt.grad takes scalar costs only
+* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
+* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
+TODO: update the benchmark
+Benchmarks
+----------
+Example:
+* Multi-layer perceptron
+* Convolutional Neural Networks
+* Misc Elemwise operations
+Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
+* EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
+* numexpr: similar to Theano, 'virtual machine' for elemwise expressions
+**Multi-Layer Perceptron**:
+60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation
+.. image:: ../hpcs2011_tutorial/pics/mlp.png
+**Convolutional Network**: 
+256x256 images convolved with 6 7x7 filters,
+downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
+tanh, matrix multiply, softmax elementwise, then in reverse
+.. image:: ../hpcs2011_tutorial/pics/conv.png
+**Elemwise**
+* All on CPU
+* Solid blue: Theano
+* Dashed Red: numexpr (without MKL)
+.. image:: ../hpcs2011_tutorial/pics/multiple_graph.png