cifarSC2011 editing by Fred and James

4fd68401 · James Bergstra · 215328da · 4fd68401 · 4fd68401 · 4fd68401
--- a/doc/cifarSC2011/advanced_theano.txt
+++ b/doc/cifarSC2011/advanced_theano.txt

 .. _advanced_theano:

-
 ***************
 Advanced Theano
 ***************

+Conditions
+----------
+**IfElse**
+
+- Build condition over symbolic variables.
+- IfElse Op takes a boolean condition and two variables to compute as input.
+- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
+  evaluates one variable respect to the condition.
+
+**IfElse Example: Comparison with Switch**
+
+.. code-block:: python
+
+  from theano import tensor as T
+  from theano.lazycond import ifelse
+  import theano, time, numpy
+
+  a,b = T.scalars('a','b')
+  x,y = T.matrices('x','y')
+  
+  z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
+  z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
+
+  f_switch = theano.function([a,b,x,y], z_switch, 
+                      mode=theano.Mode(linker='vm'))
+  f_lazyifelse = theano.function([a,b,x,y], z_lazy,
+                      mode=theano.Mode(linker='vm'))
+
+  val1 = 0.
+  val2 = 1.
+  big_mat1 = numpy.ones((10000,1000))
+  big_mat2 = numpy.ones((10000,1000))
+
+  n_times = 10
+
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_switch(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating both values %f sec'%(time.clock()-tic)
+
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_lazyifelse(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating one value %f sec'%(time.clock()-tic)
+
+IfElse Op spend less time (about an half) than Switch since it computes only
+one variable instead of both.
+
+>>> python ifelse_switch.py
+time spent evaluating both values 0.6700 sec
+time spent evaluating one value 0.3500 sec
+
+Note that IfElse condition is a boolean while Switch condition is a tensor, so
+Switch is more general.
+
+It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
+otherwise IfElse will compute both variables and take the same computation
+time as the Switch Op. The linker is not currently set by default to 'cvm' but
+it will be in a near future.
+
+Loops
+-----
+
+**Scan**
+
+- General form of **recurrence**, which can be used for looping.
+- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
+- You 'scan' a function along some input sequence, producing an output at each time-step
+- The function can see the **previous K time-steps** of your function
+- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
+- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- The advantage of using ``scan`` over for loops
+  
+  - The number of iterations to be part of the symbolic graph
+  - Minimizes GPU transfers if GPU is involved
+  - Compute gradients through sequential steps
+  - Slightly faster then using a for loop in Python with a compiled Theano function
+  - Can lower the overall memory usage by detecting the actual amount of memory needed
+
+**Scan Example: Computing pow(A,k)**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  k = T.iscalar("k"); A = T.vector("A")
+
+  def inner_fct(prior_result, A): return prior_result * A
+  # Symbolic description of the result
+  result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=T.ones_like(A),
+                              non_sequences=A, n_steps=k)
+
+  # Scan has provided us with A**1 through A**k.  Keep only the last
+  # value. Scan notices this and does not waste memory saving them.
+  final_result = result[-1]
+  
+  power = theano.function(inputs=[A,k], outputs=final_result,
+                        updates=updates)
+  
+  print power(range(10),2)
+  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+
+
+**Scan Example: Calculating a Polynomial**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  coefficients = theano.tensor.vector("coefficients")
+  x = T.scalar("x"); max_coefficients_supported = 10000
+
+  # Generate the components of the polynomial
+  full_range=theano.tensor.arange(max_coefficients_supported)
+  components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                     coeff * (free_var ** power),
+                                  outputs_info=None,
+                                  sequences=[coefficients, full_range],
+                                  non_sequences=x)
+  polynomial = components.sum()
+  calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial)
+
+  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+  print calculate_polynomial(test_coeff, 3)
+  # 19.0
+
+
+
+Exercise 4
+-----------
+
+- Run both examples 
+- Modify and execute the polynomial example to have the reduction done by scan
+
+
+
 Compilation pipeline
 --------------------

@@ -113,7 +252,7 @@ Theano output:
      - Try the Theano flag floatX=float32
    """

-Exercise 4
+Exercise 5
 -----------

 - In the last exercises, do you see a speed up with the GPU?
@@ -206,145 +345,6 @@ Debugging
  - Few optimizations
  - Run Python code (better error messages and can be debugged interactively in the Python debugger)

-
-Conditions
----------
-**IfElse**
-
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
-  evaluates one variable respect to the condition.
-
-**IfElse Example: Comparison with Switch**
-
-.. code-block:: python
-
-  from theano import tensor as T
-  from theano.lazycond import ifelse
-  import theano, time, numpy
-
-  a,b = T.scalars('a','b')
-  x,y = T.matrices('x','y')
-  
-  z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
-  z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
-
-  f_switch = theano.function([a,b,x,y], z_switch, 
-                      mode=theano.Mode(linker='vm'))
-  f_lazyifelse = theano.function([a,b,x,y], z_lazy,
-                      mode=theano.Mode(linker='vm'))
-
-  val1 = 0.
-  val2 = 1.
-  big_mat1 = numpy.ones((10000,1000))
-  big_mat2 = numpy.ones((10000,1000))
-
-  n_times = 10
-
-  tic = time.clock()
-  for i in xrange(n_times):
-      f_switch(val1, val2, big_mat1, big_mat2)
-  print 'time spent evaluating both values %f sec'%(time.clock()-tic)
-
-  tic = time.clock()
-  for i in xrange(n_times):
-      f_lazyifelse(val1, val2, big_mat1, big_mat2)
-  print 'time spent evaluating one value %f sec'%(time.clock()-tic)
-
-IfElse Op spend less time (about an half) than Switch since it computes only
-one variable instead of both.
-
->>> python ifelse_switch.py
-time spent evaluating both values 0.6700 sec
-time spent evaluating one value 0.3500 sec
-
-Note that IfElse condition is a boolean while Switch condition is a tensor, so
-Switch is more general.
-
-It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
-otherwise IfElse will compute both variables and take the same computation
-time as the Switch Op. The linker is not currently set by default to 'cvm' but
-it will be in a near future.
-
-Loops
-----
-
-**Scan**
-
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
-  
-  - The number of iterations to be part of the symbolic graph
-  - Minimizes GPU transfers if GPU is involved
-  - Compute gradients through sequential steps
-  - Slightly faster then using a for loop in Python with a compiled Theano function
-  - Can lower the overall memory usage by detecting the actual amount of memory needed
-
-**Scan Example: Computing pow(A,k)**
-
-.. code-block:: python
-
-  import theano
-  import theano.tensor as T
-
-  k = T.iscalar("k"); A = T.vector("A")
-
-  def inner_fct(prior_result, A): return prior_result * A
-  # Symbolic description of the result
-  result, updates = theano.scan(fn=inner_fct,
-                              outputs_info=T.ones_like(A),
-                              non_sequences=A, n_steps=k)
-
-  # Scan has provided us with A**1 through A**k.  Keep only the last
-  # value. Scan notices this and does not waste memory saving them.
-  final_result = result[-1]
-  
-  power = theano.function(inputs=[A,k], outputs=final_result,
-                        updates=updates)
-  
-  print power(range(10),2)
-  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
-
-
-**Scan Example: Calculating a Polynomial**
-
-.. code-block:: python
-
-  import theano
-  import theano.tensor as T
-
-  coefficients = theano.tensor.vector("coefficients")
-  x = T.scalar("x"); max_coefficients_supported = 10000
-
-  # Generate the components of the polynomial
-  full_range=theano.tensor.arange(max_coefficients_supported)
-  components, updates = theano.scan(fn=lambda coeff, power, free_var:
-                                     coeff * (free_var ** power),
-                                  outputs_info=None,
-                                  sequences=[coefficients, full_range],
-                                  non_sequences=x)
-  polynomial = components.sum()
-  calculate_polynomial = theano.function(inputs=[coefficients, x],
-                                       outputs=polynomial)
-
-  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
-  print calculate_polynomial(test_coeff, 3)
-  # 19.0
-
-
-
-Exercise 5
-----------
-
- Run both examples 
- Modify and execute the polynomial example to have the reduction done by scan
-
 Known limitations
 -----------------

@@ -364,5 +364,3 @@ Known limitations
  - Disabling a few optimizations can speed up compilation
  - Usually too many nodes indicates a problem with the graph

- Lazy evaluation in a branch (We will try to merge this summer)
-
--- a/doc/cifarSC2011/introduction.txt
+++ b/doc/cifarSC2011/introduction.txt
@@ -41,6 +41,8 @@ Background Questionaire

 * Using OpenCL / PyOpenCL ?

+ * Using cudamat / gnumpy ?
+
 * Other?

 * Who has used Cython?
@@ -98,17 +100,21 @@ Python in one slide
    print b[1:3]              # slicing syntax

    class Foo(object):        # Defining a class
-        a = 1
+        def __init__(self):
+            self.a = 5
        def hello(self):
            return self.a

+    f = Foo()                 # Creating a class instance
+    print f.hello()           # Calling methods of objects
+    # -> 5 
+
    class Bar(Foo):           # Defining a subclass
-        def __init__(self):
-            self.a = 6
+        def __init__(self, a):
+            self.a = a

-    f = Foo()                 # Creating a class instance
-    b = Bar()                 # Creating an instance of Bar
-    f.hello(); b.hello()      # Calling methods of objects
+    print Bar(99).hello()     # Creating an instance of Bar
+    # -> 99

 Numpy in one slide
 ------------------
@@ -308,7 +314,14 @@ Project status

 * Unofficial RPMs for Mandriva

-* Downloads (January 2011 -  June 8 2011): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
+* Downloads (January 2011 -  June 8 2011):
+
+ * Pypi 780
+
+ * MLOSS: 483
+
+ * Assembla (`bleeding edge` repository): unknown
+


 Why scripting for GPUs?

--- a/doc/cifarSC2011/theano.txt
+++ b/doc/cifarSC2011/theano.txt
@@ -8,46 +8,46 @@ Theano
 Pointers
 --------

- http://deeplearning.net/software/theano/
- Announcements mailing list: http://groups.google.com/group/theano-announce
- User mailing list: http://groups.google.com/group/theano-users
- Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
- Installation: https://deeplearning.net/software/theano/install.html
+* http://deeplearning.net/software/theano/
+* Announcements mailing list: http://groups.google.com/group/theano-announce
+* User mailing list: http://groups.google.com/group/theano-users
+* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
+* Installation: https://deeplearning.net/software/theano/install.html


 Description
 -----------

- Mathematical symbolic expression compiler
- Dynamic C/CUDA code generation
- Efficient symbolic differentiation
+* Mathematical symbolic expression compiler
+* Dynamic C/CUDA code generation
+* Efficient symbolic differentiation
 
-  - Theano computes derivatives of functions with one or many inputs.
+  * Theano computes derivatives of functions with one or many inputs.

- Speed and stability optimizations
+* Speed and stability optimizations

-  - Gives the right answer for ``log(1+x)`` even if x is really tiny.
+  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
  
- Works on Linux, Mac and Windows
- Transparent use of a GPU
+* Works on Linux, Mac and Windows
+* Transparent use of a GPU

-  - float32 only for now (working on other data types)
-  - Doesn't work on Windows for now
-  - On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
+  * float32 only for now (working on other data types)
+  * Doesn't work on Windows for now
+  * On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x

- Extensive unit-testing and self-verification
+* Extensive unit-testing and self-verification

-  - Detects and diagnoses many types of errors
+  * Detects and diagnoses many types of errors
  
- On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
+* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives

-  - including specialized implementations in C/C++, NumPy, SciPy, and Matlab
+  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab

- Expressions mimic NumPy's syntax & semantics
- Statically typed and purely functional
- Some sparse operations (CPU only)
- The project was started by James Bergstra and Olivier Breuleux
- For the past 1-2 years, I have replaced Olivier as lead contributor
+* Expressions mimic NumPy's syntax & semantics
+* Statically typed and purely functional
+* Some sparse operations (CPU only)
+* The project was started by James Bergstra and Olivier Breuleux
+* For the past 1-2 years, I have replaced Olivier as lead contributor

 Simple example
 --------------
@@ -59,15 +59,13 @@ Simple example
 >>> print f([0,1,2])                   # prints `array([0,2,1026])`


-==================================  ==================================
-        Unoptimized graph                    Optimized graph
-==================================  ==================================
-.. image:: pics/f_unoptimized.png   .. image:: pics/f_optimized.png
-==================================  ==================================
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================

-Symbolic programming
-
- Paradigm shift: people need to use it to understand it
+Symbolic programming = *Paradigm shift*: people need to use it to understand it.

 Exercise 1
 -----------
@@ -91,10 +89,10 @@ Real example

 **Logistic Regression**

- GPU-ready
- Symbolic differentiation
- Speed optimizations
- Stability optimizations
+* GPU-ready
+* Symbolic differentiation
+* Speed optimizations
+* Stability optimizations

 .. code-block:: python

@@ -142,6 +140,19 @@ Real example

 **Optimizations:**

+Where are those optimization applied?
+
+* ``log(1+exp(x))``
+
+* ``1 / (1 + T.exp(var))`` (sigmoid)
+
+* ``log(1-sigmoid(var))`` (softplus, stabilisation)
+
+* GEMV (matrix-vector multiply from BLAS)
+
+* Loop fusion
+
+
 .. code-block:: python

  p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
@@ -159,22 +170,14 @@ Real example
            updates={w:w-0.1*gw, b:b-0.1*gb})


-Where are those optimization applied?
-
- ``log(1+exp(x))``
- ``1 / (1 + T.exp(var))`` (sigmoid)
- ``log(1-sigmoid(var))`` (softplus, stabilisation)
- GEMV (matrix-vector multiply from BLAS)
- Loop fusion
-
-
 Theano flags
 ------------

 Theano can be configured with flags. They can be defined in two ways

- With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
- With a configuration file that defaults to ``~.theanorc``
+* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
+
+* With a configuration file that defaults to ``~/.theanorc``


 Exercise 2
@@ -261,57 +264,69 @@ Modify and execute the example to run on CPU with floatX=float32
 GPU
 ---

- Only 32 bit floats are supported (being worked on)
- Only 1 GPU per process
- Use the Theano flag ``device=gpu`` to tell to use the GPU device
+* Only 32 bit floats are supported (being worked on)
+* Only 1 GPU per process
+* Use the Theano flag ``device=gpu`` to tell to use the GPU device
  
-  - Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
-  - Shared variables with float32 dtype are by default moved to the GPU memory space
+ * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
+ * Shared variables with float32 dtype are by default moved to the GPU memory space

- Use the Theano flag ``floatX=float32``
+* Use the Theano flag ``floatX=float32``

-  - Be sure to use ``floatX`` (``theano.config.floatX``) in your code
-  - Cast inputs before putting them into a shared variable
-  - Cast "problem": int32 with float32 to float64
+ * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
+ * Cast inputs before putting them into a shared variable
+ * Cast "problem": int32 with float32 to float64
    
-    - A new casting mechanism is being developed
-    - Insert manual cast in your code or use [u]int{8,16}
-    - Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
+  * A new casting mechanism is being developed
+  * Insert manual cast in your code or use [u]int{8,16}
+  * Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)



 Exercise 3
 -----------

- Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
- Time with: ``time python file.py``
+* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
+
+* Time with: ``time python file.py``

 Symbolic variables
 ------------------

- # Dimensions
+* # Dimensions
    
-  - T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
+ * T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4

- Dtype
+* Dtype

-  - T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
-  - T.vector to floatX dtype
-  - floatX: configurable dtype that can be float32 or float64.
+ * T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)

- Custom variable
-  - All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
-  - Other dtype: uint[8,16,32,64], floatX
+ * T.vector to floatX dtype
+
+ * floatX: configurable dtype that can be float32 or float64.
+
+* Custom variable
+
+ * All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
+
+ * Other dtype: uint[8,16,32,64], floatX

 Creating symbolic variables: Broadcastability
-  - Remember what I said about broadcasting?
-  - How to add a row to all rows of a matrix?
-  - How to add a column to all columns of a matrix?
+
+* Remember what I said about broadcasting?
+
+* How to add a row to all rows of a matrix?
+
+* How to add a column to all columns of a matrix?
+
+
+Details regarding symbolic broadcasting...
  
+* Broadcastability must be specified when creating the variable

- Broadcastability must be specified when creating the variable
- The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
- For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``
+* The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
+
+* For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``


 Differentiation details
@@ -319,11 +334,15 @@ Differentiation details

 >>> gw,gb = T.grad(cost, [w,b])

- T.grad works symbolically: takes and returns a Theano variable
- T.grad can be compared to a macro: it can be applied multiple times
- T.grad takes scalar costs only
- Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
- We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
+* T.grad works symbolically: takes and returns a Theano variable
+
+* T.grad can be compared to a macro: it can be applied multiple times
+
+* T.grad takes scalar costs only
+
+* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
+
+* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector



@@ -332,20 +351,20 @@ Benchmarks

 Example:

- Multi-layer perceptron
- Convolutional Neural Networks
- Misc Elemwise operations
+* Multi-layer perceptron
+* Convolutional Neural Networks
+* Misc Elemwise operations

 Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr

- EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
- numexpr: similar to Theano, 'virtual machine' for elemwise expressions
+* EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
+* numexpr: similar to Theano, 'virtual machine' for elemwise expressions

 **Multi-Layer Perceptron**:

 60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation

-.. image:: pics/mlp.png
+.. image:: ../hpcs2011_tutorial/pics/mlp.png

 **Convolutional Network**: 

@@ -353,12 +372,12 @@ Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
 downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
 tanh, matrix multiply, softmax elementwise, then in reverse

-.. image:: pics/conv.png
+.. image:: ../hpcs2011_tutorial/pics/conv.png

 **Elemwise**

- All on CPU
- Solid blue: Theano
- Dashed Red: numexpr (without MKL)
+* All on CPU
+* Solid blue: Theano
+* Dashed Red: numexpr (without MKL)

-.. image:: pics/multiple_graph.png
+.. image:: ../hpcs2011_tutorial/pics/multiple_graph.png