提交 54bc197e authored 作者: James Bergstra's avatar James Bergstra

revising cifar10SC intro

上级 add20871
......@@ -15,51 +15,58 @@ Day 1
* Show of hands - what is your background?
* Overview/Motivation
* Python & Numpy in a nutshell
* Theano basics
* Quick tour through Deep Learning Tutorials (think about projects)
.. :
day 1:
I think that I could cover those 2 pages:
* http://deeplearning.net/software/theano/hpcs2011_tutorial/introduction.html
* http://deeplearning.net/software/theano/hpcs2011_tutorial/theano.html
That include:
simple example
linear regression example with shared var
theano flags
grad detail
Symbolic variables
gpu
benchmarck
* python/numpy crash course
Day 2
-----
* Theano beggining
* Loop/Condition in Theano (10-20m)
* Example with recent ML models (DLT)
* Propose/discuss projects
day 1:
I think that I could cover those 2 pages:
* http://deeplearning.net/software/theano/hpcs2011_tutorial/introduction.html
* http://deeplearning.net/software/theano/hpcs2011_tutorial/theano.html
That include:
simple example
linear regression example with shared var
theano flags
grad detail
Symbolic variables
gpu
benchmarck
* Form groups and start projects!
Day 2
Day 3
-----
* Day 2:
* Loop/Condition in Theano (10-20m)
* Propose/discuss projects
* For groups and start projects!
* Advanced Theano (30 minutes)
Day 3
-----
* Debugging, profiling, compilation pipeline
* Day 3:
* Advanced Theano(30 minutes)
* Debuging, profiling, compilation pipeline, inplace optimization
* Projects / General hacking / code-sprinting.
* Projects / General hacking / code-sprinting.
Day 4
-----
* Day 4: *You choose* (we can split the group)
* Extending Theano or
* How to wrap code in an op
* How to use pycuda code in Theano
* Projects / General hacking / code-sprinting.
* *You choose* (we can split the group)
* Extending Theano
* How to write an Op
* How to use pycuda code in Theano
* Projects / General hacking / code-sprinting.
Note - the schedule here is a guideline.
We can adapt it in reponse to developments in the hands-on work.
The point is for you to learn something about the practice of machine
learning.
......@@ -51,18 +51,19 @@ Python in one slide
Features:
- General-purpose high-level OO interpreted language
- Emphasizes code readability
- Comprehensive standard library
- Dynamic type and memory management
* General-purpose high-level OO interpreted language
* Emphasizes code readability
* Comprehensive standard library
* Dynamic type and memory management
Language things:
* builtin types: int, float, str, list, dict, tuple, object
- Indentation for block delimiters
- Dictionary ``d={'var1':'value1', 'var2':42, ...}``
- List comprehension: ``[i+3 for i in range(10)]``
Syntax sample:
.. code-block: python
.. code-block:: python
a = {'a': 5, 'b': None} # dictionary of two elements
b = [1,2,3] # list of three int literals
......@@ -71,112 +72,183 @@ Language things:
return a + b + c # note scoping, indentation
* List comprehension: ``[i+3 for i in range(10)]``
Numpy in one slide
------------------
- Numpy provides a N-dimensional numeric array in Python
- Well known basis for scientific computing
- provides:
- elementwise computations
- linear algebra, Fourier transforms
- pseudorandom numbers from many distributions
* Python floats are full-fledged objects on the heap
- Base scientific computing package in Python on the CPU
- A powerful N-dimensional array object
* Not suitable for high-performance computing!
- ndarray.{ndim, shape, size, dtype, itemsize, stride}
- Sophisticated broadcasting functions
- ``numpy.random.rand(4,5) * numpy.random.rand(1,5)`` -> mat(4,5)
- ``numpy.random.rand(4,5) * numpy.random.rand(4,1)`` -> mat(4,5)
- ``numpy.random.rand(4,5) * numpy.random.rand(5)`` -> mat(4,5)
* Numpy provides a N-dimensional numeric array in Python
* Perfect for high-performance computing.
* Numpy provides:
* elementwise computations
* linear algebra, Fourier transforms
* pseudorandom numbers from many distributions
* Scipy provides lots more, including:
* more linear algebra
* solvers and optimization algorithms
* matlab-compatible I/O
* I/O and signal processing for images and audio
Here are the properties of numpy arrays that you really need to know.
.. code-block:: python
import numpy as np
a = np.random.rand(3,4,5)
a32 = a.astype('float32')
a.ndim # int: 3
a.shape # tuple: (3,4,5)
a.size # int: 60
a.dtype # np.dtype object: 'float64'
a32.dtype # np.dtype object: 'float32'
These arrays can be combined with numeric operators, standard mathematical
functions. Numpy has XXX great documentation XXX.
Training an MNIST-ready classification neural network in pure numpy might look like this:
.. code-block:: python
x = np.load('data_x.npy')
y = np.load('data_y.npy')
w = np.random.normal(avg=0, std=.1,
size=(784, 500))
b = np.zeros(500)
v = np.zeros((500, 10))
c = np.zeros(10)
for i in xrange(1000):
x_i = x[i*batchsize:(i+1)*batchsize]
y_i = y[i*batchsize:(i+1)*batchsize]
- Tools for integrating C/C++ and Fortran code
- Linear algebra, Fourier transform and pseudorandom number generation
hidin = N.dot(x_i, w) + b
hidout = N.tanh(hidin)
outin = N.dot(hidout, v) + c
outout = (N.tanh(outin)+1)/2.0
g_outout = outout - y_i
err = 0.5 * N.sum(g_outout**2)
g_outin = g_outout * outout * (1.0 - outout)
g_hidout = N.dot(g_outin, v.T)
g_hidin = g_hidout * (1 - hidout**2)
b -= lr * N.sum(g_hidin, axis=0)
c -= lr * N.sum(g_outin, axis=0)
w -= lr * N.dot(x_i.T, g_hidin)
v -= lr * N.dot(hidout.T, g_outin)
What's missing?
---------------
* Non-lazy evaluation (required by Python) hurts performance
.. :
* Numpy is bound to the CPU
Theano tries to be the **holy grail** in computing: *easy to code* and *it fast to execute* !
* Numpy lacks symbolic or automatic differentiation
It works only on mathematical expressions, so you won't have:
Here's how the algorithm above looks in Theano, and it runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to):
- Function call inside a theano function
- Structure, enum
- Dynamic type (Theano is Fully typed)
.. code-block:: python
Unfortunately it doesn't do coffee... yet.
import theano as T
import theano.tensor as TT
.. image:: pics/Caffeine_Machine_no_background_red.png
x = np.load('data_x.npy')
y = np.load('data_y.npy')
# symbol declarations
sx = TT.matrix()
sy = TT.matrix()
w = T.shared(np.random.normal(avg=0, std=.1,
size=(784, 500)))
b = T.shared(np.zeros(500))
v = T.shared(np.zeros((500, 10)))
c = T.shared(np.zeros(10))
Theano status
-------------
# symbolic expression-building
outout = TT.tanh(TT.dot(TT.tanh(TT.dot(sx, w.T) + b), v.T) + c)
err = 0.5 * TT.sum(outout - sy)**2
gw, gb, gv, gc = TT.grad(err, [w,b,v,c])
Why you can rely on Theano:
# compile a fast training function
train = function([sx, sy], cost,
updates={
w:w - lr * gw,
b:b - lr * gb,
v:v - lr * gv,
c:c - lr * gc})
- Theano has been developed and used since January 2008 (3.5 yrs old)
- Core technology for a funded Silicon-Valley startup
- Driven over 40 research papers in the last few years
- Good user documentation
- Active mailing list with participants from outside our lab
- Many contributors (some from outside our lab)
- Used to teach IFT6266 for two years
- Used by everyone in our lab (~ 30 people)
- Deep Learning Tutorials
- Unofficial RPMs for Mandriva
- Downloads (June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
# now do the computations
for i in xrange(1000):
x_i = x[i*batchsize:(i+1)*batchsize]
y_i = y[i*batchsize:(i+1)*batchsize]
err_i = train(x_i, y_i)
Why Theano is better ?
----------------------
Theano in one slide
-------------------
Executing the code is faster because Theano:
- Rearranges high-level expressions
- Produces customized low-level code
- Uses a variety of backend technologies (GPU,...)
* High-level domain-specific language tailored to numeric computation
Writing the code is faster because:
- High-level language allows to **concentrate on the algorithm**
- Theano does **automatic optimization**
* Compiles most common expressions to C for CPU and GPU.
- No need to manually optimize for each algorithm you want to test
- Theano does **automatic efficient symbolic differentiation**
- No need to manually differentiate your functions (tedious & error-prone for complicated expressions!)
* Limited expressivity means lots of opportunities for expression-level optimizations
* No function call -> global optimization
Why scripting for GPUs ?
------------------------
* Strongly typed -> compiles to machine instructions
**GPUs?**
* Array oriented -> parallelizable across cores
- Faster, cheaper, more efficient power usage
- How much faster? I have seen numbers from 100x slower to 1000x faster.
* Expression substitution optimizations automatically draw
on many backend technologies for best performance.
- It depends on the algorithms
- How the benchmark is done
- Quality of implementation
- How much time was spent optimizing CPU vs GPU code
* FFTW, MKL, ATLAS, Scipy, Cython, CUDA
- In Theory:
* Slower fallbacks always available
- Intel Core i7 980 XE (107Gf/s float64) 6 cores
- NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
- NVIDIA GTX580 (1.5Tf/s float32) 512 cores
- Theano goes up to 100x faster on th GPU because we don't use multiple core on CPU
- Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
- If you see 1000x, it probably means the benchmark is not fair
* It used to have no/poor support for internal looping and conditional
expressions, but these are now quite usable.
Project status
--------------
**Scripting for GPUs?**
* Mature: theano has been developed and used since January 2008 (3.5 yrs old)
* Driven over 40 research papers in the last few years
* Core technology for a funded Silicon-Valley startup
* Good user documentation
* Active mailing list with participants from outside our lab
* Many contributors (some from outside our lab)
* Used to teach IFT6266 for two years
* Used for research at Google and Yahoo.
* Unofficial RPMs for Mandriva
* Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
Why scripting for GPUs ?
------------------------
They *Complement each other*:
......@@ -185,31 +257,58 @@ They *Complement each other*:
- Highly parallel
- Very architecture-sensitive
- Built for maximum FP/memory throughput
- So hard to program that meta-programming is easier.
- CPU: largely restricted to control
- Optimized for sequential code and low latency (rather than high throughput)
- Tasks (1000/sec)
- Scripting fast enough
Theano vs PyCUDA vs PyOpenCL vs CUDA
------------------------------------
- Theano
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
- Mathematical expression compiler
- Generates costum C and CUDA code
- Uses Python code when performance is not critical
- CUDA
- C extension by NVIDA that allow to code and use GPU
- PyCUDA (Python + CUDA)
- Python interface to CUDA
- Memory management of GPU objects
- Compilation of code for the low-level driver
- PyOpenCL (Python + OpenCL)
- PyCUDA for OpenCL
How Fast are GPUs?
------------------
- Theory:
- Intel Core i7 980 XE (107Gf/s float64) 6 cores
- NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
- NVIDIA GTX580 (1.5Tf/s float32) 512 cores
- GPUs are faster, cheaper, more power-efficient
- Practice:
- Depends on algorithm and implementation!
- Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
- Matrix-matrix multiply speedup: usually about 10-20x.
- Convolution speedup: usually about 15x.
- Elemwise speedup: slower or up to 100x (depending on operation and layout)
- Sum: can be faster or slower depending on layout.
- Benchmarking is delicate work...
- How to control quality of implementation?
- How much time was spent optimizing CPU vs GPU code?
- Theano goes up to 100x faster on GPU because it uses only one CPU core
- Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
- If you see speedup > 100x, the benchmark is probably not fair.
Software for Directly Programming a GPU
---------------------------------------
Theano is a meta-programmer, doesn't really count.
- CUDA: C extension by NVIDIA
- Vendor-specific
- Numeric libraries (BLAS, RNG, FFT) maturing.
- OpenCL: multi-vendor version of CUDA
- More general, standardized
- Fewer libraries, less adoption.
- PyCUDA: python bindings to CUDA driver interface
- Python interface to CUDA
- Memory management of GPU objects
- Compilation of code for the low-level driver
- Makes it easy to do GPU meta-programming from within Python
- PyOpenCL: PyCUDA for PyOpenCL
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论