merge - conflict in cifarSC2011/introduction.txt

1192e645 · James Bergstra · a60e4df2 · 489b14c4 · 1192e645
--- a/doc/cifarSC2011/introduction.txt
+++ b/doc/cifarSC2011/introduction.txt
@@ -9,29 +9,29 @@ Introduction
 Background Questionaire
 -----------------------
- * Who has used Theano before?
+* Who has used Theano before?
 * What did you do with it?
- * Who has used Python? numpy? scipy? matplotlib?
+* Who has used Python? numpy? scipy? matplotlib?
- * Who has used iPython?
+* Who has used iPython?
 * Who has used it as a distributed computing engine?
- * Who has done C/C++ programming?
+* Who has done C/C++ programming?
- * Who has organized computation around a particular physical memory layout?
+* Who has organized computation around a particular physical memory layout?
- * Who has used a multidimensional array of >2 dimensions?
+* Who has used a multidimensional array of >2 dimensions?
- * Who has written a Python module in C before?
+* Who has written a Python module in C before?
 * Who has written a program to *generate* Python modules in C?
- * Who has used a templating engine?
+* Who has used a templating engine?
- * Who has programmed a GPU before?
+* Who has programmed a GPU before?
 * Using OpenGL / shaders ?
@@ -43,7 +43,7 @@ Background Questionaire
 * Other?
- * Who has used Cython?
+* Who has used Cython?
 Python in one slide
@@ -51,15 +51,15 @@ Python in one slide
 Features:
- * General-purpose high-level OO interpreted language
+* General-purpose high-level OO interpreted language
- * Emphasizes code readability
+* Emphasizes code readability
- * Comprehensive standard library
+* Comprehensive standard library
- * Dynamic type and memory management
+* Dynamic type and memory management
- * builtin types: int, float, str, list, dict, tuple, object
+* builtin types: int, float, str, list, dict, tuple, object
 Syntax sample:
@@ -71,22 +71,22 @@ Syntax sample:
    def foo(b, c=3):          # function w default param c
        return a + b + c      # note scoping, indentation
+    b_squared = [b_i**2 for b_i in b]  # list comprehension
+    print b[1:3]              # slicing syntax
- * List comprehension: ``[i+3 for i in range(10)]``
 Numpy in one slide
 ------------------
- * Python floats are full-fledged objects on the heap
+* Python floats are full-fledged objects on the heap
 * Not suitable for high-performance computing!
- * Numpy provides a N-dimensional numeric array in Python
+* Numpy provides a N-dimensional numeric array in Python
 * Perfect for high-performance computing.
- * Numpy provides:
+* Numpy provides
 * elementwise computations
@@ -94,7 +94,7 @@ Numpy in one slide
 * pseudorandom numbers from many distributions
- * Scipy provides lots more, including:
+* Scipy provides lots more, including
 * more linear algebra
@@ -161,11 +161,11 @@ Training an MNIST-ready classification neural network in pure numpy might look l
 What's missing?
 ---------------
- * Non-lazy evaluation (required by Python) hurts performance
+* Non-lazy evaluation (required by Python) hurts performance
- * Numpy is bound to the CPU
+* Numpy is bound to the CPU
- * Numpy lacks symbolic or automatic differentiation
+* Numpy lacks symbolic or automatic differentiation
 Here's how the algorithm above looks in Theano, and it runs 15 times faster if
 you have GPU (I'm skipping some dtype-details which we'll come back to):
@@ -210,43 +210,52 @@ you have GPU (I'm skipping some dtype-details which we'll come back to):
 Theano in one slide
 -------------------
- * High-level domain-specific language tailored to numeric computation
+* High-level domain-specific language tailored to numeric computation
+* Compiles most common expressions to C for CPU and GPU.
- * Compiles most common expressions to C for CPU and GPU.
+* Limited expressivity means lots of opportunities for expression-level optimizations
- * Limited expressivity means lots of opportunities for expression-level optimizations
 * No function call -> global optimization
 * Strongly typed -> compiles to machine instructions
 * Array oriented -> parallelizable across cores
- * Expression substitution optimizations automatically draw
+ * Support for looping and branching in expressions
+* Expression substitution optimizations automatically draw
  on many backend technologies for best performance.
 * FFTW, MKL, ATLAS, Scipy, Cython, CUDA
 * Slower fallbacks always available
- * It used to have no/poor support for internal looping and conditional
+* Automatic differentiation
-   expressions, but these are now quite usable.
- * Automatic differentiation
 Project status
 --------------
- * Mature: theano has been developed and used since January 2008 (3.5 yrs old)
+* Mature: theano has been developed and used since January 2008 (3.5 yrs old)
- * Driven over 40 research papers in the last few years
- * Core technology for a funded Silicon-Valley startup
+* Driven over 40 research papers in the last few years
- * Good user documentation
- * Active mailing list with participants from outside our lab
+* Core technology for a funded Silicon-Valley startup
- * Many contributors (some from outside our lab)
- * Used to teach IFT6266 for two years
+* Good user documentation
- * Used for research at Google and Yahoo.
- * Unofficial RPMs for Mandriva
+* Active mailing list with participants from outside our lab
- * Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
+* Many contributors (some from outside our lab)
+* Used to teach IFT6266 for two years
+* Used for research at Google and Yahoo.
+* Unofficial RPMs for Mandriva
+* Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
 Why scripting for GPUs ?
@@ -254,18 +263,23 @@ Why scripting for GPUs ?
 They *Complement each other*:
- GPUs are everything that scripting/high level languages are not
+* GPUs are everything that scripting/high level languages are not
+ * Highly parallel
+ * Very architecture-sensitive
+ * Built for maximum FP/memory throughput
+ * So hard to program that meta-programming is easier.
-  - Highly parallel
+* CPU: largely restricted to control
-  - Very architecture-sensitive
-  - Built for maximum FP/memory throughput
-  - So hard to program that meta-programming is easier.
- CPU: largely restricted to control
+ * Optimized for sequential code and low latency (rather than high throughput)
-  - Optimized for sequential code and low latency (rather than high throughput)
+ * Tasks (1000/sec)
-  - Tasks (1000/sec)
-  - Scripting fast enough
+ * Scripting fast enough
 Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
@@ -273,28 +287,41 @@ Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
 How Fast are GPUs?
 ------------------
- - Theory:
+* Theory
+ * Intel Core i7 980 XE (107Gf/s float64) 6 cores
+ * NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
+ * NVIDIA GTX580 (1.5Tf/s float32) 512 cores
+ * GPUs are faster, cheaper, more power-efficient
-  - Intel Core i7 980 XE (107Gf/s float64) 6 cores
+* Practice (our experience)
-  - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
-  - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
-  - GPUs are faster, cheaper, more power-efficient
- - Practice (with Theano): 
+ * Depends on algorithm and implementation!
-  - Depends on algorithm and implementation!
-  - Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
-  - Matrix-matrix multiply speedup: usually about 10-20x.
-  - Convolution speedup: usually about 15x.
-  - Elemwise speedup: slower or up to 100x (depending on operation and layout)
-  - Sum: can be faster or slower depending on layout.
- - Benchmarking is delicate work...
+ * Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
-   - How to control quality of implementation?
-     - How much time was spent optimizing CPU vs GPU code?
-   - Theano goes up to 100x faster on GPU because it uses only one CPU core
-   - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
- - If you see speedup > 100x, the benchmark is probably not fair.
+ * Matrix-matrix multiply speedup: usually about 10-20x.
+ * Convolution speedup: usually about 15x.
+ * Elemwise speedup: slower or up to 100x (depending on operation and layout)
+ * Sum: can be faster or slower depending on layout.
+* Benchmarking is delicate work...
+ * How to control quality of implementation?
+  * How much time was spent optimizing CPU vs GPU code?
+ * Theano goes up to 100x faster on GPU because it uses only one CPU core
+ * Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+* If you see speedup > 100x, the benchmark is probably not fair.
 Software for Directly Programming a GPU
@@ -302,15 +329,27 @@ Software for Directly Programming a GPU
 Theano is a meta-programmer, doesn't really count.
- - CUDA: C extension by NVIDIA 
+* CUDA: C extension by NVIDIA 
-   - Vendor-specific
-   - Numeric libraries (BLAS, RNG, FFT) maturing.
+ * Vendor-specific
- - OpenCL: multi-vendor version of CUDA
-   - More general, standardized
+ * Numeric libraries (BLAS, RNG, FFT) maturing.
-   - Fewer libraries, less adoption.
- - PyCUDA: python bindings to CUDA driver interface
+* OpenCL: multi-vendor version of CUDA
-   - Python interface to CUDA
-   - Memory management of GPU objects
+ * More general, standardized
-   - Compilation of code for the low-level driver
-   - Makes it easy to do GPU meta-programming from within Python
+ * Fewer libraries, less adoption.
- - PyOpenCL: PyCUDA for PyOpenCL
+* PyCUDA: python bindings to CUDA driver interface
+ * Python interface to CUDA
+ * Memory management of GPU objects
+ * Compilation of code for the low-level driver
+ * Makes it easy to do GPU meta-programming from within Python
+* PyOpenCL: PyCUDA for PyOpenCL