revising cifar10SC intro

54bc197e · James Bergstra · add20871 · 54bc197e · 54bc197e
--- a/doc/cifarSC2011/boot_camp_overview.txt
+++ b/doc/cifarSC2011/boot_camp_overview.txt
@@ -15,19 +15,18 @@ Day 1
 * Show of hands - what is your background?
- * Overview/Motivation
+ * Python & Numpy in a nutshell
- * python/numpy crash course
+ * Theano basics
- * Theano beggining
+ * Quick tour through Deep Learning Tutorials (think about projects)
- * Example with recent ML models (DLT)
+.. :
+    day 1:
-day 1:
+    I think that I could cover those 2 pages:
-I think that I could cover those 2 pages:
     * http://deeplearning.net/software/theano/hpcs2011_tutorial/introduction.html
     * http://deeplearning.net/software/theano/hpcs2011_tutorial/theano.html
-That include:
+    That include:
     simple example
     linear regression example with shared var
     theano flags
@@ -39,27 +38,35 @@ That include:
 Day 2
 -----
- * Day 2:
  * Loop/Condition in Theano (10-20m)
  * Propose/discuss projects
-   * For groups and start projects!
+  * Form groups and start projects!
 Day 3
 -----
- * Day 3: 
+ * Advanced Theano (30 minutes)
-   * Advanced Theano(30 minutes)
-      *  Debuging, profiling, compilation pipeline, inplace optimization
+   * Debugging, profiling, compilation pipeline
 * Projects / General hacking / code-sprinting.
 Day 4
 -----
- * Day 4: *You choose* (we can split the group)
+ * *You choose* (we can split the group)
-     * Extending Theano or 
-       * How to wrap code in an op
+   * Extending Theano
-       * How to use pycuda code in Theano
-     * Projects / General hacking / code-sprinting.
+     * How to write an Op
+     * How to use pycuda code in Theano
+   * Projects / General hacking / code-sprinting.
+Note - the schedule here is a guideline.
+We can adapt it in reponse to developments in the hands-on work.
+The point is for you to learn something about the practice of machine
+learning.
--- a/doc/cifarSC2011/introduction.txt
+++ b/doc/cifarSC2011/introduction.txt
@@ -51,18 +51,19 @@ Python in one slide
 Features:
- General-purpose high-level OO interpreted language
+ * General-purpose high-level OO interpreted language
- Emphasizes code readability
- Comprehensive standard library
- Dynamic type and memory management
-Language things:
+ * Emphasizes code readability
- Indentation for block delimiters
+ * Comprehensive standard library
- Dictionary ``d={'var1':'value1', 'var2':42, ...}``
- List comprehension: ``[i+3 for i in range(10)]``
-.. code-block: python
+ * Dynamic type and memory management
+ * builtin types: int, float, str, list, dict, tuple, object
+Syntax sample:
+.. code-block:: python
    a = {'a': 5, 'b': None}   # dictionary of two elements
    b = [1,2,3]               # list of three int literals
@@ -71,112 +72,183 @@ Language things:
        return a + b + c      # note scoping, indentation
+ * List comprehension: ``[i+3 for i in range(10)]``
 Numpy in one slide
 ------------------
- Numpy provides a N-dimensional numeric array in Python
+ * Python floats are full-fledged objects on the heap
- Well known basis for scientific computing
- provides:
+   * Not suitable for high-performance computing!
-  - elementwise computations
-  - linear algebra, Fourier transforms
+ * Numpy provides a N-dimensional numeric array in Python
-  - pseudorandom numbers from many distributions
+   * Perfect for high-performance computing.
+ * Numpy provides:
+  * elementwise computations
+  * linear algebra, Fourier transforms
+  * pseudorandom numbers from many distributions
+ * Scipy provides lots more, including:
+  * more linear algebra
+  * solvers and optimization algorithms
+  * matlab-compatible I/O
+  * I/O and signal processing for images and audio
+Here are the properties of numpy arrays that you really need to know.
+.. code-block:: python
+    import numpy as np
+    a = np.random.rand(3,4,5)
+    a32 = a.astype('float32')
+    a.ndim     # int: 3
+    a.shape    # tuple: (3,4,5)
+    a.size     # int: 60
+    a.dtype    # np.dtype object: 'float64'
+    a32.dtype  # np.dtype object: 'float32'
+These arrays can be combined with numeric operators, standard mathematical
+functions. Numpy has XXX great documentation XXX.
+Training an MNIST-ready classification neural network in pure numpy might look like this:
+.. code-block:: python
- Base scientific computing package in Python on the CPU
+    x = np.load('data_x.npy')
- A powerful N-dimensional array object
+    y = np.load('data_y.npy')
+    w = np.random.normal(avg=0, std=.1,
+        size=(784, 500))
+    b = np.zeros(500)
+    v = np.zeros((500, 10))
+    c = np.zeros(10)
-  - ndarray.{ndim, shape, size, dtype, itemsize, stride}
+    for i in xrange(1000):
+        x_i = x[i*batchsize:(i+1)*batchsize]
+        y_i = y[i*batchsize:(i+1)*batchsize]
- Sophisticated broadcasting functions
+        hidin = N.dot(x_i, w) + b
-  - ``numpy.random.rand(4,5) * numpy.random.rand(1,5)`` -> mat(4,5)
+        hidout = N.tanh(hidin)
-  - ``numpy.random.rand(4,5) * numpy.random.rand(4,1)`` -> mat(4,5)
-  - ``numpy.random.rand(4,5) * numpy.random.rand(5)`` -> mat(4,5)
- Tools for integrating C/C++ and Fortran code
+        outin = N.dot(hidout, v) + c
- Linear algebra, Fourier transform and pseudorandom number generation
+        outout = (N.tanh(outin)+1)/2.0
+        g_outout = outout - y_i
+        err = 0.5 * N.sum(g_outout**2)
+        g_outin = g_outout * outout * (1.0 - outout)
+        g_hidout = N.dot(g_outin, v.T)
+        g_hidin = g_hidout * (1 - hidout**2)
+        b -= lr * N.sum(g_hidin, axis=0)
+        c -= lr * N.sum(g_outin, axis=0)
+        w -= lr * N.dot(x_i.T, g_hidin)
+        v -= lr * N.dot(hidout.T, g_outin)
 What's missing?
 ---------------
+ * Non-lazy evaluation (required by Python) hurts performance
-.. :
+ * Numpy is bound to the CPU
-    Theano tries to be the **holy grail** in computing: *easy to code* and *it fast to execute* !
+ * Numpy lacks symbolic or automatic differentiation
-    It works only on mathematical expressions, so you won't have:
+Here's how the algorithm above looks in Theano, and it runs 15 times faster if
+you have GPU (I'm skipping some dtype-details which we'll come back to):
-      - Function call inside a theano function
+.. code-block:: python
-      - Structure, enum
-      - Dynamic type (Theano is Fully typed)
-    Unfortunately it doesn't do coffee... yet.
+    import theano as T
+    import theano.tensor as TT
-    .. image:: pics/Caffeine_Machine_no_background_red.png
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    # symbol declarations
+    sx = TT.matrix()
+    sy = TT.matrix()
+    w = T.shared(np.random.normal(avg=0, std=.1,
+        size=(784, 500)))
+    b = T.shared(np.zeros(500))
+    v = T.shared(np.zeros((500, 10)))
+    c = T.shared(np.zeros(10))
-Theano status
+    # symbolic expression-building
-------------
+    outout = TT.tanh(TT.dot(TT.tanh(TT.dot(sx, w.T) + b), v.T) + c)
+    err = 0.5 * TT.sum(outout - sy)**2
+    gw, gb, gv, gc = TT.grad(err, [w,b,v,c])
-Why you can rely on Theano:
+    # compile a fast training function
+    train = function([sx, sy], cost,
+        updates={
+            w:w - lr * gw,
+            b:b - lr * gb,
+            v:v - lr * gv,
+            c:c - lr * gc})
- - Theano has been developed and used since January 2008 (3.5 yrs old)
+    # now do the computations
- - Core technology for a funded Silicon-Valley startup
+    for i in xrange(1000):
- - Driven over 40 research papers in the last few years
+        x_i = x[i*batchsize:(i+1)*batchsize]
- - Good user documentation
+        y_i = y[i*batchsize:(i+1)*batchsize]
- - Active mailing list with participants from outside our lab
+        err_i = train(x_i, y_i)
- - Many contributors (some from outside our lab)
- - Used to teach IFT6266 for two years
- - Used by everyone in our lab (~ 30 people)
- - Deep Learning Tutorials
- - Unofficial RPMs for Mandriva
- - Downloads (June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
-Why Theano is better ?
----------------------
-Executing the code is faster because Theano:
+Theano in one slide
-  - Rearranges high-level expressions
+-------------------
-  - Produces customized low-level code
-  - Uses a variety of backend technologies (GPU,...)
-Writing the code is faster because:
+ * High-level domain-specific language tailored to numeric computation
-  - High-level language allows to **concentrate on the algorithm**
-  - Theano does **automatic optimization**
-    - No need to manually optimize for each algorithm you want to test
+ * Compiles most common expressions to C for CPU and GPU.
-  - Theano does **automatic efficient symbolic differentiation**
-    - No need to manually differentiate your functions (tedious & error-prone for complicated expressions!)
+ * Limited expressivity means lots of opportunities for expression-level optimizations
+   * No function call -> global optimization
-Why scripting for GPUs ?
+   * Strongly typed -> compiles to machine instructions
------------------------
-**GPUs?**
+   * Array oriented -> parallelizable across cores
- Faster, cheaper, more efficient power usage
+ * Expression substitution optimizations automatically draw
- How much faster? I have seen numbers from 100x slower to 1000x faster.
+   on many backend technologies for best performance.
-  - It depends on the algorithms
+   * FFTW, MKL, ATLAS, Scipy, Cython, CUDA
-  - How the benchmark is done
-    - Quality of implementation
+   * Slower fallbacks always available
-    - How much time was spent optimizing CPU vs GPU code
-  - In Theory:
+ * It used to have no/poor support for internal looping and conditional
+   expressions, but these are now quite usable.
-    - Intel Core i7 980 XE (107Gf/s float64) 6 cores
-    - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
-    - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
-  - Theano goes up to 100x faster on th GPU because we don't use multiple core on CPU
+Project status
+--------------
-    - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+ * Mature: theano has been developed and used since January 2008 (3.5 yrs old)
-  - If you see 1000x, it probably means the benchmark is not fair
+ * Driven over 40 research papers in the last few years
+ * Core technology for a funded Silicon-Valley startup
+ * Good user documentation
+ * Active mailing list with participants from outside our lab
+ * Many contributors (some from outside our lab)
+ * Used to teach IFT6266 for two years
+ * Used for research at Google and Yahoo.
+ * Unofficial RPMs for Mandriva
+ * Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
-**Scripting for GPUs?**
+Why scripting for GPUs ?
+------------------------
 They *Complement each other*:
@@ -185,31 +257,58 @@ They *Complement each other*:
  - Highly parallel
  - Very architecture-sensitive
  - Built for maximum FP/memory throughput
+  - So hard to program that meta-programming is easier.
 - CPU: largely restricted to control
  - Optimized for sequential code and low latency (rather than high throughput)
  - Tasks (1000/sec)
  - Scripting fast enough
-Theano vs PyCUDA vs PyOpenCL vs CUDA
+Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
------------------------------------
- Theano
-  - Mathematical expression compiler
+How Fast are GPUs?
-  - Generates costum C and CUDA code
+------------------
-  - Uses Python code when performance is not critical
- CUDA
+ - Theory:
-  - C extension by NVIDA that allow to code and use GPU
+  - Intel Core i7 980 XE (107Gf/s float64) 6 cores
+  - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
+  - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
+  - GPUs are faster, cheaper, more power-efficient
+ - Practice: 
+  - Depends on algorithm and implementation!
+  - Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
+  - Matrix-matrix multiply speedup: usually about 10-20x.
+  - Convolution speedup: usually about 15x.
+  - Elemwise speedup: slower or up to 100x (depending on operation and layout)
+  - Sum: can be faster or slower depending on layout.
+ - Benchmarking is delicate work...
+   - How to control quality of implementation?
+     - How much time was spent optimizing CPU vs GPU code?
+   - Theano goes up to 100x faster on GPU because it uses only one CPU core
+   - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
- PyCUDA (Python + CUDA)
+ - If you see speedup > 100x, the benchmark is probably not fair.
+Software for Directly Programming a GPU
+---------------------------------------
+Theano is a meta-programmer, doesn't really count.
+ - CUDA: C extension by NVIDIA 
+   - Vendor-specific
+   - Numeric libraries (BLAS, RNG, FFT) maturing.
+ - OpenCL: multi-vendor version of CUDA
+   - More general, standardized
+   - Fewer libraries, less adoption.
+ - PyCUDA: python bindings to CUDA driver interface
   - Python interface to CUDA
   - Memory management of GPU objects
   - Compilation of code for the low-level driver
+   - Makes it easy to do GPU meta-programming from within Python
- PyOpenCL (Python + OpenCL)
+ - PyOpenCL: PyCUDA for PyOpenCL
-  - PyCUDA for OpenCL