提交 4fd68401 authored 作者: James Bergstra's avatar James Bergstra

cifarSC2011 editing by Fred and James

上级 215328da
.. _advanced_theano: .. _advanced_theano:
*************** ***************
Advanced Theano Advanced Theano
*************** ***************
Conditions
----------
**IfElse**
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
**IfElse Example: Comparison with Switch**
.. code-block:: python
from theano import tensor as T
from theano.lazycond import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.
>>> python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Loops
-----
**Scan**
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
- The number of iterations to be part of the symbolic graph
- Minimizes GPU transfers if GPU is involved
- Compute gradients through sequential steps
- Slightly faster then using a for loop in Python with a compiled Theano function
- Can lower the overall memory usage by detecting the actual amount of memory needed
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
Exercise 4
-----------
- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan
Compilation pipeline Compilation pipeline
-------------------- --------------------
...@@ -113,7 +252,7 @@ Theano output: ...@@ -113,7 +252,7 @@ Theano output:
- Try the Theano flag floatX=float32 - Try the Theano flag floatX=float32
""" """
Exercise 4 Exercise 5
----------- -----------
- In the last exercises, do you see a speed up with the GPU? - In the last exercises, do you see a speed up with the GPU?
...@@ -206,145 +345,6 @@ Debugging ...@@ -206,145 +345,6 @@ Debugging
- Few optimizations - Few optimizations
- Run Python code (better error messages and can be debugged interactively in the Python debugger) - Run Python code (better error messages and can be debugged interactively in the Python debugger)
Conditions
----------
**IfElse**
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
**IfElse Example: Comparison with Switch**
.. code-block:: python
from theano import tensor as T
from theano.lazycond import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.
>>> python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Loops
-----
**Scan**
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
- The number of iterations to be part of the symbolic graph
- Minimizes GPU transfers if GPU is involved
- Compute gradients through sequential steps
- Slightly faster then using a for loop in Python with a compiled Theano function
- Can lower the overall memory usage by detecting the actual amount of memory needed
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
Exercise 5
-----------
- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan
Known limitations Known limitations
----------------- -----------------
...@@ -364,5 +364,3 @@ Known limitations ...@@ -364,5 +364,3 @@ Known limitations
- Disabling a few optimizations can speed up compilation - Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph - Usually too many nodes indicates a problem with the graph
- Lazy evaluation in a branch (We will try to merge this summer)
...@@ -41,6 +41,8 @@ Background Questionaire ...@@ -41,6 +41,8 @@ Background Questionaire
* Using OpenCL / PyOpenCL ? * Using OpenCL / PyOpenCL ?
* Using cudamat / gnumpy ?
* Other? * Other?
* Who has used Cython? * Who has used Cython?
...@@ -98,17 +100,21 @@ Python in one slide ...@@ -98,17 +100,21 @@ Python in one slide
print b[1:3] # slicing syntax print b[1:3] # slicing syntax
class Foo(object): # Defining a class class Foo(object): # Defining a class
a = 1 def __init__(self):
self.a = 5
def hello(self): def hello(self):
return self.a return self.a
f = Foo() # Creating a class instance
print f.hello() # Calling methods of objects
# -> 5
class Bar(Foo): # Defining a subclass class Bar(Foo): # Defining a subclass
def __init__(self): def __init__(self, a):
self.a = 6 self.a = a
f = Foo() # Creating a class instance print Bar(99).hello() # Creating an instance of Bar
b = Bar() # Creating an instance of Bar # -> 99
f.hello(); b.hello() # Calling methods of objects
Numpy in one slide Numpy in one slide
------------------ ------------------
...@@ -308,7 +314,14 @@ Project status ...@@ -308,7 +314,14 @@ Project status
* Unofficial RPMs for Mandriva * Unofficial RPMs for Mandriva
* Downloads (January 2011 - June 8 2011): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown * Downloads (January 2011 - June 8 2011):
* Pypi 780
* MLOSS: 483
* Assembla (`bleeding edge` repository): unknown
Why scripting for GPUs? Why scripting for GPUs?
......
...@@ -8,46 +8,46 @@ Theano ...@@ -8,46 +8,46 @@ Theano
Pointers Pointers
-------- --------
- http://deeplearning.net/software/theano/ * http://deeplearning.net/software/theano/
- Announcements mailing list: http://groups.google.com/group/theano-announce * Announcements mailing list: http://groups.google.com/group/theano-announce
- User mailing list: http://groups.google.com/group/theano-users * User mailing list: http://groups.google.com/group/theano-users
- Deep Learning Tutorials: http://www.deeplearning.net/tutorial/ * Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
- Installation: https://deeplearning.net/software/theano/install.html * Installation: https://deeplearning.net/software/theano/install.html
Description Description
----------- -----------
- Mathematical symbolic expression compiler * Mathematical symbolic expression compiler
- Dynamic C/CUDA code generation * Dynamic C/CUDA code generation
- Efficient symbolic differentiation * Efficient symbolic differentiation
- Theano computes derivatives of functions with one or many inputs. * Theano computes derivatives of functions with one or many inputs.
- Speed and stability optimizations * Speed and stability optimizations
- Gives the right answer for ``log(1+x)`` even if x is really tiny. * Gives the right answer for ``log(1+x)`` even if x is really tiny.
- Works on Linux, Mac and Windows * Works on Linux, Mac and Windows
- Transparent use of a GPU * Transparent use of a GPU
- float32 only for now (working on other data types) * float32 only for now (working on other data types)
- Doesn't work on Windows for now * Doesn't work on Windows for now
- On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x * On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
- Extensive unit-testing and self-verification * Extensive unit-testing and self-verification
- Detects and diagnoses many types of errors * Detects and diagnoses many types of errors
- On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives * On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
- including specialized implementations in C/C++, NumPy, SciPy, and Matlab * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
- Expressions mimic NumPy's syntax & semantics * Expressions mimic NumPy's syntax & semantics
- Statically typed and purely functional * Statically typed and purely functional
- Some sparse operations (CPU only) * Some sparse operations (CPU only)
- The project was started by James Bergstra and Olivier Breuleux * The project was started by James Bergstra and Olivier Breuleux
- For the past 1-2 years, I have replaced Olivier as lead contributor * For the past 1-2 years, I have replaced Olivier as lead contributor
Simple example Simple example
-------------- --------------
...@@ -59,15 +59,13 @@ Simple example ...@@ -59,15 +59,13 @@ Simple example
>>> print f([0,1,2]) # prints `array([0,2,1026])` >>> print f([0,1,2]) # prints `array([0,2,1026])`
================================== ================================== ====================================================== =====================================================
Unoptimized graph Optimized graph Unoptimized graph Optimized graph
================================== ================================== ====================================================== =====================================================
.. image:: pics/f_unoptimized.png .. image:: pics/f_optimized.png .. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
================================== ================================== ====================================================== =====================================================
Symbolic programming Symbolic programming = *Paradigm shift*: people need to use it to understand it.
- Paradigm shift: people need to use it to understand it
Exercise 1 Exercise 1
----------- -----------
...@@ -91,10 +89,10 @@ Real example ...@@ -91,10 +89,10 @@ Real example
**Logistic Regression** **Logistic Regression**
- GPU-ready * GPU-ready
- Symbolic differentiation * Symbolic differentiation
- Speed optimizations * Speed optimizations
- Stability optimizations * Stability optimizations
.. code-block:: python .. code-block:: python
...@@ -142,6 +140,19 @@ Real example ...@@ -142,6 +140,19 @@ Real example
**Optimizations:** **Optimizations:**
Where are those optimization applied?
* ``log(1+exp(x))``
* ``1 / (1 + T.exp(var))`` (sigmoid)
* ``log(1-sigmoid(var))`` (softplus, stabilisation)
* GEMV (matrix-vector multiply from BLAS)
* Loop fusion
.. code-block:: python .. code-block:: python
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
...@@ -159,22 +170,14 @@ Real example ...@@ -159,22 +170,14 @@ Real example
updates={w:w-0.1*gw, b:b-0.1*gb}) updates={w:w-0.1*gw, b:b-0.1*gb})
Where are those optimization applied?
- ``log(1+exp(x))``
- ``1 / (1 + T.exp(var))`` (sigmoid)
- ``log(1-sigmoid(var))`` (softplus, stabilisation)
- GEMV (matrix-vector multiply from BLAS)
- Loop fusion
Theano flags Theano flags
------------ ------------
Theano can be configured with flags. They can be defined in two ways Theano can be configured with flags. They can be defined in two ways
- With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"`` * With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
- With a configuration file that defaults to ``~.theanorc``
* With a configuration file that defaults to ``~/.theanorc``
Exercise 2 Exercise 2
...@@ -261,57 +264,69 @@ Modify and execute the example to run on CPU with floatX=float32 ...@@ -261,57 +264,69 @@ Modify and execute the example to run on CPU with floatX=float32
GPU GPU
--- ---
- Only 32 bit floats are supported (being worked on) * Only 32 bit floats are supported (being worked on)
- Only 1 GPU per process * Only 1 GPU per process
- Use the Theano flag ``device=gpu`` to tell to use the GPU device * Use the Theano flag ``device=gpu`` to tell to use the GPU device
- Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
- Shared variables with float32 dtype are by default moved to the GPU memory space * Shared variables with float32 dtype are by default moved to the GPU memory space
- Use the Theano flag ``floatX=float32`` * Use the Theano flag ``floatX=float32``
- Be sure to use ``floatX`` (``theano.config.floatX``) in your code * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
- Cast inputs before putting them into a shared variable * Cast inputs before putting them into a shared variable
- Cast "problem": int32 with float32 to float64 * Cast "problem": int32 with float32 to float64
- A new casting mechanism is being developed * A new casting mechanism is being developed
- Insert manual cast in your code or use [u]int{8,16} * Insert manual cast in your code or use [u]int{8,16}
- Insert manual cast around the mean operator (which involves a division by the length, which is an int64!) * Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
Exercise 3 Exercise 3
----------- -----------
- Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU * Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
- Time with: ``time python file.py``
* Time with: ``time python file.py``
Symbolic variables Symbolic variables
------------------ ------------------
- # Dimensions * # Dimensions
- T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4 * T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
- Dtype * Dtype
- T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64) * T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
- T.vector to floatX dtype
- floatX: configurable dtype that can be float32 or float64.
- Custom variable * T.vector to floatX dtype
- All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
- Other dtype: uint[8,16,32,64], floatX * floatX: configurable dtype that can be float32 or float64.
* Custom variable
* All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
* Other dtype: uint[8,16,32,64], floatX
Creating symbolic variables: Broadcastability Creating symbolic variables: Broadcastability
- Remember what I said about broadcasting?
- How to add a row to all rows of a matrix? * Remember what I said about broadcasting?
- How to add a column to all columns of a matrix?
* How to add a row to all rows of a matrix?
* How to add a column to all columns of a matrix?
Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable
- Broadcastability must be specified when creating the variable * The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
- The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
- For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)`` * For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``
Differentiation details Differentiation details
...@@ -319,11 +334,15 @@ Differentiation details ...@@ -319,11 +334,15 @@ Differentiation details
>>> gw,gb = T.grad(cost, [w,b]) >>> gw,gb = T.grad(cost, [w,b])
- T.grad works symbolically: takes and returns a Theano variable * T.grad works symbolically: takes and returns a Theano variable
- T.grad can be compared to a macro: it can be applied multiple times
- T.grad takes scalar costs only * T.grad can be compared to a macro: it can be applied multiple times
- Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
- We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector * T.grad takes scalar costs only
* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
...@@ -332,20 +351,20 @@ Benchmarks ...@@ -332,20 +351,20 @@ Benchmarks
Example: Example:
- Multi-layer perceptron * Multi-layer perceptron
- Convolutional Neural Networks * Convolutional Neural Networks
- Misc Elemwise operations * Misc Elemwise operations
Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
- EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks * EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
- numexpr: similar to Theano, 'virtual machine' for elemwise expressions * numexpr: similar to Theano, 'virtual machine' for elemwise expressions
**Multi-Layer Perceptron**: **Multi-Layer Perceptron**:
60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation 60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation
.. image:: pics/mlp.png .. image:: ../hpcs2011_tutorial/pics/mlp.png
**Convolutional Network**: **Convolutional Network**:
...@@ -353,12 +372,12 @@ Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr ...@@ -353,12 +372,12 @@ Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
tanh, matrix multiply, softmax elementwise, then in reverse tanh, matrix multiply, softmax elementwise, then in reverse
.. image:: pics/conv.png .. image:: ../hpcs2011_tutorial/pics/conv.png
**Elemwise** **Elemwise**
- All on CPU * All on CPU
- Solid blue: Theano * Solid blue: Theano
- Dashed Red: numexpr (without MKL) * Dashed Red: numexpr (without MKL)
.. image:: pics/multiple_graph.png .. image:: ../hpcs2011_tutorial/pics/multiple_graph.png
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论