revising cifar10SC intro

54bc197e · James Bergstra · add20871 · 54bc197e · 54bc197e
--- a/doc/cifarSC2011/boot_camp_overview.txt
+++ b/doc/cifarSC2011/boot_camp_overview.txt
@@ -15,51 +15,58 @@ Day 1

 * Show of hands - what is your background?

- * Overview/Motivation
+ * Python & Numpy in a nutshell
+
+ * Theano basics
+
+ * Quick tour through Deep Learning Tutorials (think about projects)
+
+.. :
+    day 1:
+    I think that I could cover those 2 pages:
+     * http://deeplearning.net/software/theano/hpcs2011_tutorial/introduction.html
+     * http://deeplearning.net/software/theano/hpcs2011_tutorial/theano.html
+    That include:
+     simple example
+     linear regression example with shared var
+     theano flags
+     grad detail
+     Symbolic variables
+     gpu
+     benchmarck

- * python/numpy crash course
+Day 2
+-----

- * Theano beggining
+  * Loop/Condition in Theano (10-20m)

- * Example with recent ML models (DLT)
+  * Propose/discuss projects

-day 1:
-I think that I could cover those 2 pages:
- * http://deeplearning.net/software/theano/hpcs2011_tutorial/introduction.html
- * http://deeplearning.net/software/theano/hpcs2011_tutorial/theano.html
-That include:
- simple example
- linear regression example with shared var
- theano flags
- grad detail
- Symbolic variables
- gpu
- benchmarck
+  * Form groups and start projects!

-Day 2
+Day 3
 -----

- * Day 2:
-   * Loop/Condition in Theano (10-20m)
-   * Propose/discuss projects
-   * For groups and start projects!
+ * Advanced Theano (30 minutes)

-Day 3
-----
+   * Debugging, profiling, compilation pipeline

- * Day 3: 
-   * Advanced Theano(30 minutes)
-      *  Debuging, profiling, compilation pipeline, inplace optimization
-   * Projects / General hacking / code-sprinting.
+ * Projects / General hacking / code-sprinting.

 Day 4
 -----

- * Day 4: *You choose* (we can split the group)
-     * Extending Theano or 
-       * How to wrap code in an op
-       * How to use pycuda code in Theano
-     * Projects / General hacking / code-sprinting.
+ * *You choose* (we can split the group)
+
+   * Extending Theano

+     * How to write an Op

+     * How to use pycuda code in Theano
+
+   * Projects / General hacking / code-sprinting.

+Note - the schedule here is a guideline.
+We can adapt it in reponse to developments in the hands-on work.
+The point is for you to learn something about the practice of machine
+learning.
--- a/doc/cifarSC2011/introduction.txt
+++ b/doc/cifarSC2011/introduction.txt
@@ -51,18 +51,19 @@ Python in one slide

 Features:

- General-purpose high-level OO interpreted language
- Emphasizes code readability
- Comprehensive standard library
- Dynamic type and memory management
+ * General-purpose high-level OO interpreted language
+ 
+ * Emphasizes code readability
+ 
+ * Comprehensive standard library
+ 
+ * Dynamic type and memory management

-Language things:
+ * builtin types: int, float, str, list, dict, tuple, object

- Indentation for block delimiters
- Dictionary ``d={'var1':'value1', 'var2':42, ...}``
- List comprehension: ``[i+3 for i in range(10)]``
+Syntax sample:

-.. code-block: python
+.. code-block:: python

    a = {'a': 5, 'b': None}   # dictionary of two elements
    b = [1,2,3]               # list of three int literals
@@ -71,112 +72,183 @@ Language things:
        return a + b + c      # note scoping, indentation


+
+ * List comprehension: ``[i+3 for i in range(10)]``
+
 Numpy in one slide
 ------------------

- Numpy provides a N-dimensional numeric array in Python
- Well known basis for scientific computing
- provides:
-  - elementwise computations
-  - linear algebra, Fourier transforms
-  - pseudorandom numbers from many distributions
+ * Python floats are full-fledged objects on the heap

- Base scientific computing package in Python on the CPU
- A powerful N-dimensional array object
+   * Not suitable for high-performance computing!

-  - ndarray.{ndim, shape, size, dtype, itemsize, stride}
-  
- Sophisticated broadcasting functions
-    
-  - ``numpy.random.rand(4,5) * numpy.random.rand(1,5)`` -> mat(4,5)
-  - ``numpy.random.rand(4,5) * numpy.random.rand(4,1)`` -> mat(4,5)
-  - ``numpy.random.rand(4,5) * numpy.random.rand(5)`` -> mat(4,5)
+ * Numpy provides a N-dimensional numeric array in Python
+
+   * Perfect for high-performance computing.
+
+ * Numpy provides:
+
+  * elementwise computations
+
+  * linear algebra, Fourier transforms
+
+  * pseudorandom numbers from many distributions
+
+ * Scipy provides lots more, including:
+
+  * more linear algebra
+
+  * solvers and optimization algorithms
+
+  * matlab-compatible I/O
+
+  * I/O and signal processing for images and audio
+
+Here are the properties of numpy arrays that you really need to know.
+
+.. code-block:: python
+
+    import numpy as np
+    a = np.random.rand(3,4,5)
+    a32 = a.astype('float32')
+
+    a.ndim     # int: 3
+    a.shape    # tuple: (3,4,5)
+    a.size     # int: 60
+    a.dtype    # np.dtype object: 'float64'
+    a32.dtype  # np.dtype object: 'float32'
+
+These arrays can be combined with numeric operators, standard mathematical
+functions. Numpy has XXX great documentation XXX.
+
+Training an MNIST-ready classification neural network in pure numpy might look like this:
+
+.. code-block:: python
+
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    w = np.random.normal(avg=0, std=.1,
+        size=(784, 500))
+    b = np.zeros(500)
+    v = np.zeros((500, 10))
+    c = np.zeros(10)
+
+    for i in xrange(1000):
+        x_i = x[i*batchsize:(i+1)*batchsize]
+        y_i = y[i*batchsize:(i+1)*batchsize]

- Tools for integrating C/C++ and Fortran code
- Linear algebra, Fourier transform and pseudorandom number generation
+        hidin = N.dot(x_i, w) + b

+        hidout = N.tanh(hidin)

+        outin = N.dot(hidout, v) + c
+        outout = (N.tanh(outin)+1)/2.0
+
+        g_outout = outout - y_i
+        err = 0.5 * N.sum(g_outout**2)
+
+        g_outin = g_outout * outout * (1.0 - outout)
+
+        g_hidout = N.dot(g_outin, v.T)
+        g_hidin = g_hidout * (1 - hidout**2)
+
+        b -= lr * N.sum(g_hidin, axis=0)
+        c -= lr * N.sum(g_outin, axis=0)
+        w -= lr * N.dot(x_i.T, g_hidin)
+        v -= lr * N.dot(hidout.T, g_outin)


 What's missing?
 ---------------

+ * Non-lazy evaluation (required by Python) hurts performance

-.. :
+ * Numpy is bound to the CPU

-    Theano tries to be the **holy grail** in computing: *easy to code* and *it fast to execute* !
+ * Numpy lacks symbolic or automatic differentiation

-    It works only on mathematical expressions, so you won't have:
+Here's how the algorithm above looks in Theano, and it runs 15 times faster if
+you have GPU (I'm skipping some dtype-details which we'll come back to):

-      - Function call inside a theano function
-      - Structure, enum
-      - Dynamic type (Theano is Fully typed)
+.. code-block:: python

-    Unfortunately it doesn't do coffee... yet.
+    import theano as T
+    import theano.tensor as TT

-    .. image:: pics/Caffeine_Machine_no_background_red.png
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')

+    # symbol declarations
+    sx = TT.matrix()
+    sy = TT.matrix()
+    w = T.shared(np.random.normal(avg=0, std=.1,
+        size=(784, 500)))
+    b = T.shared(np.zeros(500))
+    v = T.shared(np.zeros((500, 10)))
+    c = T.shared(np.zeros(10))

-Theano status
-------------
+    # symbolic expression-building
+    outout = TT.tanh(TT.dot(TT.tanh(TT.dot(sx, w.T) + b), v.T) + c)
+    err = 0.5 * TT.sum(outout - sy)**2
+    gw, gb, gv, gc = TT.grad(err, [w,b,v,c])

-Why you can rely on Theano:
+    # compile a fast training function
+    train = function([sx, sy], cost,
+        updates={
+            w:w - lr * gw,
+            b:b - lr * gb,
+            v:v - lr * gv,
+            c:c - lr * gc})

- - Theano has been developed and used since January 2008 (3.5 yrs old)
- - Core technology for a funded Silicon-Valley startup
- - Driven over 40 research papers in the last few years
- - Good user documentation
- - Active mailing list with participants from outside our lab
- - Many contributors (some from outside our lab)
- - Used to teach IFT6266 for two years
- - Used by everyone in our lab (~ 30 people)
- - Deep Learning Tutorials
- - Unofficial RPMs for Mandriva
- - Downloads (June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
+    # now do the computations
+    for i in xrange(1000):
+        x_i = x[i*batchsize:(i+1)*batchsize]
+        y_i = y[i*batchsize:(i+1)*batchsize]
+        err_i = train(x_i, y_i)

-Why Theano is better ?
----------------------
+    
+Theano in one slide
+-------------------

-Executing the code is faster because Theano:
-  - Rearranges high-level expressions
-  - Produces customized low-level code
-  - Uses a variety of backend technologies (GPU,...)
+ * High-level domain-specific language tailored to numeric computation

-Writing the code is faster because:
-  - High-level language allows to **concentrate on the algorithm**
-  - Theano does **automatic optimization**
+ * Compiles most common expressions to C for CPU and GPU.

-    - No need to manually optimize for each algorithm you want to test
-  - Theano does **automatic efficient symbolic differentiation**
-    
-    - No need to manually differentiate your functions (tedious & error-prone for complicated expressions!)
+ * Limited expressivity means lots of opportunities for expression-level optimizations
+   * No function call -> global optimization

-Why scripting for GPUs ?
------------------------
+   * Strongly typed -> compiles to machine instructions

-**GPUs?**
+   * Array oriented -> parallelizable across cores

- Faster, cheaper, more efficient power usage
- How much faster? I have seen numbers from 100x slower to 1000x faster.
+ * Expression substitution optimizations automatically draw
+   on many backend technologies for best performance.

-  - It depends on the algorithms
-  - How the benchmark is done
-      
-    - Quality of implementation
-    - How much time was spent optimizing CPU vs GPU code
+   * FFTW, MKL, ATLAS, Scipy, Cython, CUDA

-  - In Theory:
+   * Slower fallbacks always available

-    - Intel Core i7 980 XE (107Gf/s float64) 6 cores
-    - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
-    - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
-  
-  - Theano goes up to 100x faster on th GPU because we don't use multiple core on CPU
-    
-    - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
-  - If you see 1000x, it probably means the benchmark is not fair
+ * It used to have no/poor support for internal looping and conditional
+   expressions, but these are now quite usable.
+ 
+
+Project status
+--------------

-**Scripting for GPUs?**
+ * Mature: theano has been developed and used since January 2008 (3.5 yrs old)
+ * Driven over 40 research papers in the last few years
+ * Core technology for a funded Silicon-Valley startup
+ * Good user documentation
+ * Active mailing list with participants from outside our lab
+ * Many contributors (some from outside our lab)
+ * Used to teach IFT6266 for two years
+ * Used for research at Google and Yahoo.
+ * Unofficial RPMs for Mandriva
+ * Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
+
+
+Why scripting for GPUs ?
+------------------------

 They *Complement each other*:

@@ -185,31 +257,58 @@ They *Complement each other*:
  - Highly parallel
  - Very architecture-sensitive
  - Built for maximum FP/memory throughput
+  - So hard to program that meta-programming is easier.
+
 - CPU: largely restricted to control

  - Optimized for sequential code and low latency (rather than high throughput)
  - Tasks (1000/sec)
  - Scripting fast enough

-Theano vs PyCUDA vs PyOpenCL vs CUDA
------------------------------------
-
- Theano
+Best of both: scripted CPU invokes JIT-compiled kernels on GPU.

-  - Mathematical expression compiler
-  - Generates costum C and CUDA code
-  - Uses Python code when performance is not critical

- CUDA
-
-  - C extension by NVIDA that allow to code and use GPU
-
- PyCUDA (Python + CUDA)
-
-  - Python interface to CUDA
-  - Memory management of GPU objects
-  - Compilation of code for the low-level driver
-
- PyOpenCL (Python + OpenCL)
-  - PyCUDA for OpenCL
+How Fast are GPUs?
+------------------

+ - Theory:
+
+  - Intel Core i7 980 XE (107Gf/s float64) 6 cores
+  - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
+  - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
+  - GPUs are faster, cheaper, more power-efficient
+
+ - Practice: 
+  - Depends on algorithm and implementation!
+  - Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
+  - Matrix-matrix multiply speedup: usually about 10-20x.
+  - Convolution speedup: usually about 15x.
+  - Elemwise speedup: slower or up to 100x (depending on operation and layout)
+  - Sum: can be faster or slower depending on layout.
+
+ - Benchmarking is delicate work...
+   - How to control quality of implementation?
+     - How much time was spent optimizing CPU vs GPU code?
+   - Theano goes up to 100x faster on GPU because it uses only one CPU core
+   - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+
+ - If you see speedup > 100x, the benchmark is probably not fair.
+
+
+Software for Directly Programming a GPU
+---------------------------------------
+
+Theano is a meta-programmer, doesn't really count.
+
+ - CUDA: C extension by NVIDIA 
+   - Vendor-specific
+   - Numeric libraries (BLAS, RNG, FFT) maturing.
+ - OpenCL: multi-vendor version of CUDA
+   - More general, standardized
+   - Fewer libraries, less adoption.
+ - PyCUDA: python bindings to CUDA driver interface
+   - Python interface to CUDA
+   - Memory management of GPU objects
+   - Compilation of code for the low-level driver
+   - Makes it easy to do GPU meta-programming from within Python
+ - PyOpenCL: PyCUDA for PyOpenCL