提交 2f506f14 authored 作者: Frederic Bastien's avatar Frederic Bastien

first version of crei tutorial.

上级 d5baff99
.. _advanced_theano:
***************
Advanced Theano
***************
Conditions
----------
**IfElse**
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
**IfElse Example: Comparison with Switch**
.. code-block:: python
from theano import tensor as T
from theano.ifelse import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.
>>> python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Loops
-----
**Scan**
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
- The number of iterations to be part of the symbolic graph
- Minimizes GPU transfers if GPU is involved
- Compute gradients through sequential steps
- Slightly faster then using a for loop in Python with a compiled Theano function
- Can lower the overall memory usage by detecting the actual amount of memory needed
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
Exercise 4
-----------
- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan
Compilation pipeline
--------------------
.. image:: ../hpcs2011_tutorial/pics/pipeline.png
:width: 400 px
Inplace optimization
--------------------
- 2 type of inplace operations:
- An op that return a view on its inputs (e.g. reshape, inplace transpose)
- An op that write the output on the inputs memory space
- This allows some memory optimization
- The Op must tell Theano if they work inplace
- Inplace Op add constraints to the order of execution
Profiling
---------
- To replace the default mode with this mode, use the Theano flags ``mode=ProfileMode``
- To enable the memory profiling use the flags ``ProfileMode.profile_memory=True``
Theano output:
.. code-block:: python
"""
Time since import 33.456s
Theano compile time: 1.023s (3.1% since import)
Optimization time: 0.789s
Linker time: 0.221s
Theano fct call 30.878s (92.3% since import)
Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call)
Theano function overhead in ProfileMode 1.466s 4.4%(since import)
4.7%(of fct call)
10001 Theano fct call, 0.003s per call
Rest of the time since import 1.555s 4.6%
Theano fct summary:
<% total fct time> <total time> <time per call> <nb call> <fct name>
100.0% 30.877s 3.09e-03s 10000 train
0.0% 0.000s 4.06e-04s 1 predict
Single Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %>
<self seconds> <cumulative seconds> <time per call> <nb_call>
<nb_op> <nb_apply> <Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 1 <Gemv>
9.7% s 97.0% 2.843s 28.515s 2.84e-04s 10001 1 2 <Dot>
2.4% 99.3% 0.691s 29.206s 7.68e-06s * 90001 10 10 <Elemwise>
0.4% 99.7% 0.127s 29.334s 1.27e-05s 10000 1 1 <Alloc>
0.2% 99.9% 0.053s 29.386s 1.75e-06s * 30001 2 4 <DimShuffle>
0.0% 100.0% 0.014s 29.400s 1.40e-06s * 10000 1 1 <Sum>
0.0% 100.0% 0.011s 29.411s 1.10e-06s * 10000 1 1 <Shape_i>
(*) Op is running a c implementation
Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %>
<self seconds> <cumulative seconds> <time per call>
<nb_call> <nb apply> <Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 Gemv{inplace}
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10001 2 dot
1.3% 98.2% 0.378s 28.893s 3.78e-05s * 10000 1 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}
0.4% 98.7% 0.127s 29.021s 1.27e-05s 10000 1 Alloc
0.3% 99.0% 0.092s 29.112s 9.16e-06s * 10000 1 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)]
0.1% 99.3% 0.033s 29.265s 1.66e-06s * 20001 3 InplaceDimShuffle{x}
... (remaining 11 Apply account for 0.7%(0.00s) of the runtime)
(*) Op is running a c implementation
Apply-wise summary:
<% of local_time spent at this position> <cumulative %%>
<apply time> <cumulative seconds> <time per call>
<nb_call> <Apply position> <Apply Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 15 Gemv{inplace}(w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)].0, TensorConstant{0.9998})
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10000 1 dot(x, w)
1.3% 98.2% 0.378s 28.893s 3.78e-05s 10000 9 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}(y, Elemwise{Composite{neg,sub}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
0.4% 98.7% 0.127s 29.020s 1.27e-05s 10000 10 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
0.3% 99.0% 0.092s 29.112s 9.16e-06s 10000 13 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0,0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{neg,sub}}[(0,0)].0, Elemwise{sub,no_inplace}.0, InplaceDimShuffle{x}.0)
0.3% 99.3% 0.080s 29.192s 7.99e-06s 10000 11 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)](Elemwise{neg,no_inplace}.0)
... (remaining 14 Apply instances account for
0.7%(0.00s) of the runtime)
Profile of Theano functions memory:
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
Theano fct: train
Max without gc, inplace and view (KB) 2481
Max FAST_RUN_NO_GC (KB) 16
Max FAST_RUN (KB) 16
Memory saved by view (KB) 2450
Memory saved by inplace (KB) 15
Memory saved by GC (KB) 0
<Sum apply outputs (bytes)> <Apply outputs memory size(bytes)>
<created/inplace/view> <Apply node>
<created/inplace/view> is taked from the op declaration, not ...
2508800B [2508800] v InplaceDimShuffle{1,0}(x)
6272B [6272] i Gemv{inplace}(w, ...)
3200B [3200] c Elemwise{Composite{...}}(y, ...)
Here are tips to potentially make your code run faster (if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32
"""
Exercise 5
-----------
- In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use ProfileMode)
- Is there something we can do to speed up the GPU version?
Printing/Drawing Theano graphs
------------------------------
- Pretty Printing
``theano.printing.pprint(variable)``
>>> theano.printing.pprint(prediction)
gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))),TensorConstant{0.5})
- Debug Print
``theano.printing.debugprint({fct, variable, list of variables})``
>>> theano.printing.debugprint(prediction)
Elemwise{gt,no_inplace} [@181772236] ''
|Elemwise{true_div,no_inplace} [@181746668] ''
| |InplaceDimShuffle{x} [@181746412] ''
| | |TensorConstant{1} [@181745836]
| |Elemwise{add,no_inplace} [@181745644] ''
| | |InplaceDimShuffle{x} [@181745420] ''
| | | |TensorConstant{1} [@181744844]
| | |Elemwise{exp,no_inplace} [@181744652] ''
| | | |Elemwise{sub,no_inplace} [@181744012] ''
| | | | |Elemwise{neg,no_inplace} [@181730764] ''
| | | | | |dot [@181729676] ''
| | | | | | |x [@181563948]
| | | | | | |w [@181729964]
| | | | |InplaceDimShuffle{x} [@181743788] ''
| | | | | |b [@181730156]
|InplaceDimShuffle{x} [@181771788] ''
| |TensorConstant{0.5} [@181771148]
>>> theano.printing.debugprint(predict)
Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] '' 2
|dot [@183018796] '' 1
| |x [@183000780]
| |w [@183000812]
|InplaceDimShuffle{x} [@183133580] '' 0
| |b [@183000876]
|TensorConstant{[ 0.5]} [@183084108]
- Picture Printing of Graphs
>>> theano.printing.pydotprint_variables(prediction)
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
:width: 800 px
All pydotprint* requires graphviz and pydot
>>> theano.printing.pydotprint(predict)
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
:width: 800 px
>>> theano.printing.pydotprint(train) # This is a small train example!
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
:width: 1500 px
Debugging
---------
- Run with the Theano flag ``compute_test_value = {``off'',``ignore'', ``warn'', ``raise''}``
- Run the code as we create the graph
- Allows you to find the bug earlier (ex: shape mismatch)
- Makes it easier to identify where the problem is in *your* code
- Use the value of constants and shared variables directly
- For pure symbolic variables uses ``x.tag.test_value = numpy.random.rand(5,10)``
- Run with the flag ``mode=FAST_COMPILE``
- Few optimizations
- Run Python code (better error messages and can be debugged interactively in the Python debugger)
- Run with the flag ``mode=DebugMode``
- 100-1000x slower
- Test all optimization steps from the original graph to the final graph
- Checks many things that Op should/shouldn't do
- Executes both the Python and C code versions
Known limitations
-----------------
- Compilation phase distinct from execution phase
- Use ``a_tensor_variable.eval()`` to make this less visible
- Compilation time can be significant
- Amortize it with functions over big input or reuse functions
- Execution overhead
- We have worked on this, but more work needed
- So needs a certain number of operations to be useful
- Compilation time superlinear in the size of the graph.
- A few hundreds nodes is fine
- Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph
.. _crei2013_index:
===========================
Theano Tutorial @ CREI 2013
===========================
July 19, 2013, Sherbrook, Québec, Canada.
Theano is python software for evaluating complicated array expressions.
What does it do?
* aggressive expression optimizations,
* automatic GPU use,
* symbolic differentiation and R op.
It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
scikits, matplotlib, PIL.)
Design and feature set has been driven by machine learning research
at the University of
Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
The result is a very good library for doing research in deep
learning and neural network training, and a flexible framework for
many other models and algorithms in machine learning more generally.
It has proven to be useful for implementing:
- linear and nonlinear neural network classifiers
- convolutional models
- Energy models: RBM, DBN, GRBM, ssRBM, AIS
- Auto-encoders: DAE, CAE
- GP regression
- sparse coding
- recurrent neural networks, echo state, (HMM?)
- online and batch learning and optimization
- Even SVM!
As people's needs change this list will grow, but Theano is built
around vector, matrix, and tensor expressions; there is little reason
to use it for calculations on other data structures except. There is
also some sparse matrix support.
Contents
--------
The structured part of these lab sessions will be a walk-through of the following
material. Interleaved with this structured part will be blocks of time for
individual or group work. The idea is that you can try out Theano and get help
from gurus on hand if you get stuck.
.. toctree::
introduction
theano
advanced_theano
/tutorial/extending_theano
pyCUDA
gpundarray
.. _cifarSS2011_Introduction:
************
Introduction
************
Background Questionaire
-----------------------
* Who has used Theano before?
* What did you do with it?
* Who has used Python? NumPy? SciPy? matplotlib?
* Who has used iPython?
* Who has used it as a distributed computing engine?
* Who has done C/C++ programming?
* Who has organized computation around a particular physical memory layout?
* Who has used a multidimensional array of >2 dimensions?
* Who has written a Python module in C before?
* Who has written a program to *generate* Python modules in C?
* Who has used a templating engine?
* Who has programmed a GPU before?
* Using OpenGL / shaders ?
* Using CUDA (runtime? / driver?)
* Using PyCUDA ?
* Using OpenCL / PyOpenCL ?
* Using cudamat / gnumpy ?
* Other?
* Who has used Cython?
Python in one slide
-------------------
* General-purpose high-level OO interpreted language
* Emphasizes code readability
* Comprehensive standard library
* Dynamic type and memory management
* Built-in types: int, float, str, list, dict, tuple, object
* Slow execution
* Popular in web-dev and scientific communities
.. code-block:: python
#######################
# PYTHON SYNTAX EXAMPLE
#######################
a = 1 # no type declaration required!
b = (1, 2, 3) # tuple of three int literals
c = [1, 2, 3] # list of three int literals
d = {'a': 5, b: None} # dictionary of two elements
# N.B. string literal, None
print d['a'] # square brackets index
# -> 5
print d[(1, 2, 3)] # new tuple == b, retrieves None
# -> None
print d[6]
# raises KeyError Exception
x, y, z = 10, 100, 100 # multiple assignment from tuple
x, y, z = b # unpacking a sequence
b_squared = [b_i**2 for b_i in b] # list comprehension
def foo(b, c=3): # function w default param c
return a + b + c # note scoping, indentation
foo(5) # calling a function
# -> 1 + 5 + 3 == 9 # N.B. scoping
foo(b=6, c=2) # calling with named args
# -> 1 + 6 + 2 == 9
print b[1:3] # slicing syntax
class Foo(object): # Defining a class
def __init__(self):
self.a = 5
def hello(self):
return self.a
f = Foo() # Creating a class instance
print f.hello() # Calling methods of objects
# -> 5
class Bar(Foo): # Defining a subclass
def __init__(self, a):
self.a = a
print Bar(99).hello() # Creating an instance of Bar
# -> 99
NumPy in one slide
------------------
* Python floats are full-fledged objects on the heap
* Not suitable for high-performance computing!
* NumPy provides a N-dimensional numeric array in Python
* Perfect for high-performance computing.
* Slice are return view (no copy)
* NumPy provides
* elementwise computations
* linear algebra, Fourier transforms
* pseudorandom numbers from many distributions
* SciPy provides lots more, including
* more linear algebra
* solvers and optimization algorithms
* matlab-compatible I/O
* I/O and signal processing for images and audio
.. code-block:: python
##############################
# Properties of NumPy arrays
# that you really need to know
##############################
import numpy as np # import can rename
a = np.random.rand(3, 4, 5) # random generators
a32 = a.astype('float32') # arrays are strongly typed
a.ndim # int: 3
a.shape # tuple: (3, 4, 5)
a.size # int: 60
a.dtype # np.dtype object: 'float64'
a32.dtype # np.dtype object: 'float32'
assert a[1, 1, 1] != 10 # a[1, 1, 1] is a view
a[1, 1, 1] = 10 # So affectation to it change the
assert a[1, 1, 1] == 10 # original array
Arrays can be combined with numeric operators, standard mathematical
functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
Training an MNIST-ready classification neural network in pure NumPy might look like this:
.. code-block:: python
#########################
# NumPy for Training a
# Neural Network on MNIST
#########################
x = np.load('data_x.npy')
y = np.load('data_y.npy')
w = np.random.normal(
avg=0,
std=.1,
size=(784, 500))
b = np.zeros((500,))
v = np.zeros((500, 10))
c = np.zeros((10,))
batchsize = 100
for i in xrange(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
hidin = np.dot(x_i, w) + b
hidout = np.tanh(hidin)
outin = np.dot(hidout, v) + c
outout = (np.tanh(outin) + 1) / 2.0
g_outout = outout - y_i
err = 0.5 * np.sum(g_outout) ** 2
g_outin = g_outout * outout * (1.0 - outout)
g_hidout = np.dot(g_outin, v.T)
g_hidin = g_hidout * (1 - hidout ** 2)
b -= lr * np.sum(g_hidin, axis=0)
c -= lr * np.sum(g_outin, axis=0)
w -= lr * np.dot(x_i.T, g_hidin)
v -= lr * np.dot(hidout.T, g_outin)
What's missing?
---------------
* Non-lazy evaluation (required by Python) hurts performance
* NumPy is bound to the CPU
* NumPy lacks symbolic or automatic differentiation
Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to).
.. code-block:: python
#########################
# Theano for Training a
# Neural Network on MNIST
#########################
import numpy as np
import theano
import theano.tensor as tensor
x = np.load('data_x.npy')
y = np.load('data_y.npy')
# symbol declarations
sx = tensor.matrix()
sy = tensor.matrix()
w = theano.shared(np.random.normal(avg=0, std=.1,
size=(784, 500)))
b = theano.shared(np.zeros(500))
v = theano.shared(np.zeros((500, 10)))
c = theano.shared(np.zeros(10))
# symbolic expression-building
hid = tensor.tanh(tensor.dot(sx, w) + b)
out = tensor.tanh(tensor.dot(hid, v) + c)
err = 0.5 * tensor.sum(out - sy) ** 2
gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
# compile a fast training function
train = theano.function([sx, sy], err,
updates={
w: w - lr * gw,
b: b - lr * gb,
v: v - lr * gv,
c: c - lr * gc})
# now do the computations
batchsize = 100
for i in xrange(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
err_i = train(x_i, y_i)
Theano in one slide
-------------------
* High-level domain-specific language tailored to numeric computation
* Compiles most common expressions to C for CPU and GPU.
* Limited expressivity means lots of opportunities for expression-level optimizations
* No function call -> global optimization
* Strongly typed -> compiles to machine instructions
* Array oriented -> parallelizable across cores
* Support for looping and branching in expressions
* Expression substitution optimizations automatically draw
on many backend technologies for best performance.
* FFTW, MKL, ATLAS, SciPy, Cython, CUDA
* Slower fallbacks always available
* Automatic differentiation and R op
* Sparse matrices
Project status
--------------
* Mature: theano has been developed and used since January 2008 (5.5 yrs old)
* Driven over 87 research papers
* Good user documentation
* Active mailing list with participants from outside our lab
* Core technology for a funded Silicon-Valley startup
* Many contributors (some from outside our lab)
* Used to teach IFT6266 for many years
* Used for research at Google and Yahoo.
* Downloads (January 2011 - June 8 2011):
* Pypi (16 July 2013): 60k total, 159 last day, 823 last week
* Github (`bleeding edge` repository): unknown
TODO: Do I keep the GPU section?
Why scripting for GPUs?
-----------------------
They *Complement each other*:
* GPUs are everything that scripting/high level languages are not
* Highly parallel
* Very architecture-sensitive
* Built for maximum FP/memory throughput
* So hard to program that meta-programming is easier.
* CPU: largely restricted to control
* Optimized for sequential code and low latency (rather than high throughput)
* Tasks (1000/sec)
* Scripting fast enough
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
How Fast are GPUs?
------------------
* Theory
* Intel Core i7 980 XE (107Gf/s float64) 6 cores
* NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
* NVIDIA GTX580 (1.5Tf/s float32) 512 cores
* GPUs are faster, cheaper, more power-efficient
* Practice (our experience)
* Depends on algorithm and implementation!
* Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
* Matrix-matrix multiply speedup: usually about 10-20x.
* Convolution speedup: usually about 15x.
* Elemwise speedup: slower or up to 100x (depending on operation and layout)
* Sum: can be faster or slower depending on layout.
* Benchmarking is delicate work...
* How to control quality of implementation?
* How much time was spent optimizing CPU vs GPU code?
* Theano goes up to 100x faster on GPU because it uses only one CPU core
* Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
* If you see speedup > 100x, the benchmark is probably not fair.
Software for Directly Programming a GPU
---------------------------------------
Theano is a meta-programmer, doesn't really count.
* CUDA: C extension by NVIDIA
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) maturing.
* OpenCL: multi-vendor version of CUDA
* More general, standardized
* Fewer libraries, less adoption.
* PyCUDA: python bindings to CUDA driver interface
* Python interface to CUDA
* Memory management of GPU objects
* Compilation of code for the low-level driver
* Makes it easy to do GPU meta-programming from within Python
* PyOpenCL: PyCUDA for PyOpenCL
.. _theano:
******
Theano
******
Pointers
--------
* http://deeplearning.net/software/theano/
* Announcements mailing list: http://groups.google.com/group/theano-announce
* User mailing list: http://groups.google.com/group/theano-users
* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
* Installation: https://deeplearning.net/software/theano/install.html
Description
-----------
* Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation
* Efficient symbolic differentiation
* Theano computes derivatives of functions with one or many inputs.
* Speed and stability optimizations
* Gives the right answer for ``log(1+x)`` even if x is really tiny.
* Works on Linux, Mac and Windows
* Transparent use of a GPU
* float32 only for now (working on other data types)
* Still in experimental state on Windows
* On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
* Extensive unit-testing and self-verification
* Detects and diagnoses many types of errors
* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
* including specialized implementations in C/C++, NumPy, SciPy, and Matlab
* Expressions mimic NumPy's syntax & semantics
* Statically typed and purely functional
* Some sparse operations (CPU only)
Simple example
--------------
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print f([0, 1, 2]) # prints `array([0, 2, 1026])`
====================================================== =====================================================
Unoptimized graph Optimized graph
====================================================== =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
====================================================== =====================================================
Symbolic programming = *Paradigm shift*: people need to use it to understand it.
Exercise 1
-----------
.. code-block:: python
import theano
a = theano.tensor.vector() # declare variable
out = a + a ** 10 # build symbolic expression
f = theano.function([a], out) # compile function
print f([0, 1, 2])
# prints `array([0, 2, 1026])`
theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
Real example
------------
**Logistic Regression**
* GPU-ready
* Symbolic differentiation
* Speed optimizations
* Stability optimizations
.. code-block:: python
import numpy
import theano
import theano.tensor as tt
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y*tt.log(p_1) - (1-y)*tt.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w**2).sum() # The cost to minimize
gw,gb = tt.grad(cost, [w, b])
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w: w - 0.1 * gw,
b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
**Optimizations:**
Where are those optimization applied?
* ``log(1+exp(x))``
* ``1 / (1 + tt.exp(var))`` (sigmoid)
* ``log(1-sigmoid(var))`` (softplus, stabilisation)
* GEMV (matrix-vector multiply from BLAS)
* Loop fusion
.. code-block:: python
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
# 1 / (1 + tt.exp(var)) -> sigmoid(var)
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
# Log(1-sigmoid(var)) -> -sigmoid(var)
prediction = p_1 > 0.5
cost = xent.mean() + 0.01 * (w**2).sum()
gw,gb = tt.grad(cost, [w, b])
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
# w - 0.1 * gw: GEMV with the dot in the grad
updates={w: w - 0.1 * gw,
b: b - 0.1 * gb})
Theano flags
------------
Theano can be configured with flags. They can be defined in two ways
* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
* With a configuration file that defaults to ``~/.theanorc``
Exercise 2
-----------
.. code-block:: python
import numpy
import theano
import theano.tensor as tt
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = xent.mean() + 0.01 * (w**2).sum() # The cost to optimize
gw,gb = tt.grad(cost, [w, b])
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
# Print the graph used in the slides
theano.printing.pydotprint(predict,
outfile="pics/logreg_pydotprint_predic.png",
var_with_name_simple=True)
theano.printing.pydotprint_variables(prediction,
outfile="pics/logreg_pydotprint_prediction.png",
var_with_name_simple=True)
theano.printing.pydotprint(train,
outfile="pics/logreg_pydotprint_train.png",
var_with_name_simple=True)
Modify and execute the example to run on CPU with floatX=float32
* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
GPU
---
* Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
* Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space
* Use the Theano flag ``floatX=float32``
* Be sure to use ``floatX`` (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable
* Cast "problem": int32 with float32 to float64
* Insert manual cast in your code or use [u]int{8,16}
* The mean operator is worked on to make the output stay in float32.
* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
* Theano 0.6rc4 will have the combination of ``force_device=True``
and ``device=cpu`` disable the GPU.
Exercise 3
-----------
* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
* Time with: ``time python file.py``
Symbolic variables
------------------
* # Dimensions
* tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
* Dtype
* tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
* tt.vector to floatX dtype
* floatX: configurable dtype that can be float32 or float64.
* Custom variable
* All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
* Other dtype: uint[8,16,32,64], floatX
Creating symbolic variables: Broadcastability
* Remember what I said about broadcasting?
* How to add a row to all rows of a matrix?
* How to add a column to all columns of a matrix?
Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable
* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
Differentiation details
-----------------------
>>> gw,gb = tt.grad(cost, [w,b])
* tt.grad works symbolically: takes and returns a Theano variable
* tt.grad can be compared to a macro: it can be applied multiple times
* tt.grad takes scalar costs only
* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
TODO: update the benchmark
Benchmarks
----------
Example:
* Multi-layer perceptron
* Convolutional Neural Networks
* Misc Elemwise operations
Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
* EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
* numexpr: similar to Theano, 'virtual machine' for elemwise expressions
**Multi-Layer Perceptron**:
60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation
.. image:: ../hpcs2011_tutorial/pics/mlp.png
**Convolutional Network**:
256x256 images convolved with 6 7x7 filters,
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
tanh, matrix multiply, softmax elementwise, then in reverse
.. image:: ../hpcs2011_tutorial/pics/conv.png
**Elemwise**
* All on CPU
* Solid blue: Theano
* Dashed Red: numexpr (without MKL)
.. image:: ../hpcs2011_tutorial/pics/multiple_graph.png
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论