提交 24bcbf36 authored 作者: Frederic Bastien's avatar Frederic Bastien

First version of Open Machine Learning workshop presentation.

上级 93be9cb8
.. _omlw2014_libgpundarray:
*************
libGpuNdArray
*************
Why a common GPU ndarray?
-------------------------
- Currently there are at least 4 different GPU array data structures in use by Python packages
- CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat), GPUArray (PyOpenCL), ...
- There are even more if we include other languages
- All of them are a subset of the functionality of ``numpy.ndarray`` on the GPU
- Lots of duplicated effort
- GPU code is harder/slower to do {\bf correctly} and {\bf fast} than on the CPU/Python
- Lack of a common array API makes it harder to port/reuse code
- Also harder to find/distribute code
- Divides development work
Design Goals
------------
- Make it VERY similar to ``numpy.ndarray``
- Be compatible with both CUDA and OpenCL
- Have the base object accessible from C to allow collaboration with more projects, across high-level languages
- We want people from C, C++, lua, Ruby, R, ... all use the same base GPU N-dimensional array
Final Note
----------
TODO: update
- Usable, but under development.
- Is the next GPU array container for Theano
- Mailing list: http://lists.tiker.net/listinfo/gpundarray
.. _omlw2014_index:
===========================
Theano Tutorial @ OMLW 2014
===========================
August 22, 2014, New York University, US.
This presentation will talk about Theano, Pylearn2 software stack for
machine learning.
It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
scikits, matplotlib, PIL.)
Theano
======
Theano is a software for evaluating and manipulating complicated array
expressions.
What does it do?
* aggressive expression optimizations,
* automatic GPU use,
* automatic symbolic differentiation, Jacobian, Hession computation
and R/L op (for hessian free).
Design and feature set has been driven by machine learning research
at the University of
Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
The result is a very good library for doing research in deep
learning and neural network training, and a flexible framework for
many other models and algorithms in machine learning more generally.
# TODO UPDATE
It has proven to be useful for implementing:
- linear and nonlinear neural network classifiers
- convolutional models
- Energy models: RBM, DBN, GRBM, ssRBM, AIS
- Auto-encoders: DAE, CAE
- GP regression
- sparse coding
- recurrent neural networks, echo state, (HMM?)
- online and batch learning and optimization
- Even SVM!
As people's needs change this list will grow, but Theano is built
around vector, matrix, and tensor expressions; there is little reason
to use it for calculations on other data structures except. There is
also sparse matrix support.
Pylearn2
========
Pylearn2 is still undergoing rapid development. Don’t expect a clean
road without bumps! It is made for machine learning
practitioner/researcher first.
Pylearn2 is a machine learning library. Most of its functionality is
built on top of Theano. This means you can write Pylearn2 plugins (new
models, algorithms, etc) using mathematical expressions, and Theano
will optimize and stabilize those expressions for you, and compile
them to a backend of your choice (CPU or GPU).
Pylearn2 Vision
---------------
* Researchers add features as they need them. We avoid getting bogged down by
too much top-down planning in advance.
* A machine learning toolbox for easy scientific experimentation.
* All models/algorithms published by the LISA lab should have reference
implementations in Pylearn2.
* Pylearn2 may wrap other libraries such as scikits.learn when this is practical
* Pylearn2 differs from scikits.learn in that Pylearn2 aims to provide great
flexibility and make it possible for a researcher to do almost anything,
while scikits.learn aims to work as a "black box" that can produce good
results even if the user does not understand the implementation
* Dataset interface for vector, images, video, ...
* Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution
experiments.
* *Easy reuse* of sub-component of Pylearn2.
* Using one sub-component of the library does not force you to use / learn to
use all of the other sub-components if you choose not to.
* Support cross-platform serialization of learned models.
* Remain approachable enough to be used in the classroom
Contents
========
The structured part of these lab sessions will be a walk-through of the following
material. Interleaved with this structured part will be blocks of time for
individual or group work. The idea is that you can try out Theano and get help
from gurus on hand if you get stuck.
.. toctree::
introduction
theano
pylearn2
gpundarray
.. _omlw2014_Introduction:
************
Introduction
************
Python in one slide
-------------------
* General-purpose high-level OO interpreted language
* Emphasizes code readability
* Comprehensive standard library
* Dynamic type and memory management
* Built-in types: int, float, str, list, dict, tuple, object
* Slow execution
* Popular in *web-dev* and *scientific communities*
NumPy in one slide
------------------
* Python floats are full-fledged objects on the heap
* Not suitable for high-performance computing!
* NumPy provides a N-dimensional numeric array in Python
* Perfect for high-performance computing.
* Slice are return view (no copy)
* NumPy provides
* elementwise computations
* linear algebra, Fourier transforms
* pseudorandom numbers from many distributions
* SciPy provides lots more, including
* more linear algebra
* solvers and optimization algorithms
* matlab-compatible I/O
* I/O and signal processing for images and audio
.. code-block:: python
##############################
# Properties of NumPy arrays
# that you really need to know
##############################
import numpy as np # import can rename
a = np.random.rand(3, 4, 5) # random generators
a32 = a.astype('float32') # arrays are strongly typed
a.ndim # int: 3
a.shape # tuple: (3, 4, 5)
a.size # int: 60
a.dtype # np.dtype object: 'float64'
a32.dtype # np.dtype object: 'float32'
assert a[1, 1, 1] != 10 # a[1, 1, 1] is a view
a[1, 1, 1] = 10 # So affectation to it change the
assert a[1, 1, 1] == 10 # original array
Arrays can be combined with numeric operators, standard mathematical
functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
Training an MNIST-ready classification neural network in pure NumPy might look like this:
.. code-block:: python
#########################
# NumPy for Training a
# Neural Network on MNIST
#########################
x = np.load('data_x.npy')
y = np.load('data_y.npy')
w = np.random.normal(
avg=0,
std=.1,
size=(784, 500))
b = np.zeros((500,))
v = np.zeros((500, 10))
c = np.zeros((10,))
batchsize = 100
for i in xrange(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
hidin = np.dot(x_i, w) + b
hidout = np.tanh(hidin)
outin = np.dot(hidout, v) + c
outout = (np.tanh(outin) + 1) / 2.0
g_outout = outout - y_i
err = 0.5 * np.sum(g_outout) ** 2
g_outin = g_outout * outout * (1.0 - outout)
g_hidout = np.dot(g_outin, v.T)
g_hidin = g_hidout * (1 - hidout ** 2)
b -= lr * np.sum(g_hidin, axis=0)
c -= lr * np.sum(g_outin, axis=0)
w -= lr * np.dot(x_i.T, g_hidin)
v -= lr * np.dot(hidout.T, g_outin)
What's missing?
---------------
* Non-lazy evaluation (required by Python) hurts performance
* NumPy is bound to the CPU
* NumPy lacks symbolic or automatic differentiation
Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to).
.. code-block:: python
#########################
# Theano for Training a
# Neural Network on MNIST
#########################
import numpy as np
import theano
import theano.tensor as tensor
x = np.load('data_x.npy')
y = np.load('data_y.npy')
# symbol declarations
sx = tensor.matrix()
sy = tensor.matrix()
w = theano.shared(np.random.normal(avg=0, std=.1,
size=(784, 500)))
b = theano.shared(np.zeros(500))
v = theano.shared(np.zeros((500, 10)))
c = theano.shared(np.zeros(10))
# symbolic expression-building
hid = tensor.tanh(tensor.dot(sx, w) + b)
out = tensor.tanh(tensor.dot(hid, v) + c)
err = 0.5 * tensor.sum(out - sy) ** 2
gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
# compile a fast training function
train = theano.function([sx, sy], err,
updates={
w: w - lr * gw,
b: b - lr * gb,
v: v - lr * gv,
c: c - lr * gc})
# now do the computations
batchsize = 100
for i in xrange(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
err_i = train(x_i, y_i)
Theano in one slide
-------------------
* High-level domain-specific language tailored to numeric computation
* Compiles most common expressions to C for CPU and GPU.
* Limited expressivity means lots of opportunities for expression-level optimizations
* No function call -> global optimization
* Strongly typed -> compiles to machine instructions
* Array oriented -> parallelizable across cores
* Support for looping and branching in expressions
* Expression substitution optimizations automatically draw
on many backend technologies for best performance.
* FFTW, MKL, ATLAS, SciPy, Cython, CUDA
* Slower fallbacks always available
* Automatic differentiation and R op
* Sparse matrices
Project status
--------------
* Mature: theano has been developed and used since January 2008 (6.5 yrs old)
* Driven over 100 research papers
* Good user documentation
* Active mailing list with participants from outside our lab
* Core technology for a few funded Silicon-Valley startup
* Many contributors (some from outside our lab)
* Used to teach many university classes
* Used for research at Google and Yahoo.
* Downloads
* Pypi (August 18th 2014, the last release): 255 last day, 2140 last week, 9145 last month
* Github (`bleeding edge` repository, the one recommanded): unknown
* Github stats?????
Why scripting for GPUs?
-----------------------
They *Complement each other*:
* GPUs are everything that scripting/high level languages are not
* Highly parallel
* Very architecture-sensitive
* Built for maximum FP/memory throughput
* So hard to program that meta-programming is easier.
* CPU: largely restricted to control
* Optimized for sequential code and low latency (rather than high throughput)
* Tasks (1000/sec)
* Scripting fast enough
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
import numpy
import theano
import theano.tensor as tt
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy loss
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize
gw, gb = tt.grad(cost, [w, b])
# Compile
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates=[(w, w - 0.1 * gw),
(b, b - 0.1 * gb)],
name='train')
predict = theano.function(inputs=[x], outputs=prediction,
name='predict')
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
.. _omlw2014_pylearn2:
********
Pylearn2
********
Pointers
--------
TODO:
* http://deeplearning.net/software/pylearn2/
* User mailing list: http://groups.google.com/group/pylearn-users
* Dev mailing list: http://groups.google.com/group/pylearn-dev
* Installation: http://deeplearning.net/software/pylearn2/index.html#download-and-installation
Description
-----------
TODO:
* ...
Simple example
--------------
(logistic regression?) TODO
Real example
------------
(maxout?)TODO
Known limitations
-----------------
TODO
* It is getting stabilized, but still heavily modified.
.. _omlw2014_theano:
******
Theano
******
Pointers
--------
* http://deeplearning.net/software/theano/
* Announcements mailing list: http://groups.google.com/group/theano-announce
* User mailing list: http://groups.google.com/group/theano-users
* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
* Installation: https://deeplearning.net/software/theano/install.html
Description
-----------
* Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation
* Efficient symbolic differentiation
* Theano computes derivatives of functions with one or many inputs.
* Also support computation of the Jacobian, Hessian, R and L op.
* Speed and stability optimizations
* Gives the right answer for ``log(1+x)`` even if x is really tiny.
* Works on Linux, Mac and Windows
* Transparent use of a GPU
* float32 only for now (working on other data types)
* Still in experimental state on Windows
* Extensive unit-testing and self-verification
* Detects and diagnoses many types of errors
* (TODO REMOVE?) On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
* including specialized implementations in C/C++, NumPy, SciPy, and Matlab
* Is used with other technologie to generate fast code: C/C++, CUDA, OpenCL, PyCUDA, Cython, Numba, ...
* Expressions mimic NumPy's syntax & semantics
* Statically typed and purely functional
* Sparse operations (CPU only)
Simple example
--------------
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print f([0, 1, 2]) # prints `array([0, 2, 1026])`
====================================================== =====================================================
Unoptimized graph Optimized graph
====================================================== =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
====================================================== =====================================================
Symbolic programming = *Paradigm shift*: people need to use it to understand it.
Exercise 1
-----------
.. code-block:: python
import theano
a = theano.tensor.vector() # declare variable
out = a + a ** 10 # build symbolic expression
f = theano.function([a], out) # compile function
print f([0, 1, 2])
# prints `array([0, 2, 1026])`
theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
Real example
------------
**Logistic Regression**
* GPU-ready
* Symbolic differentiation
* Speed optimizations
* Stability optimizations
.. literalinclude:: logreg.py
**Optimizations:**
Where are those optimization applied?
* ``log(1+exp(x))``
* ``1 / (1 + tt.exp(var))`` (sigmoid)
* ``log(1-sigmoid(var))`` (softplus, stabilisation)
* GEMV (matrix-vector multiply from BLAS)
* Loop fusion
.. code-block:: python
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
# 1 / (1 + tt.exp(var)) -> sigmoid(var)
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
# Log(1-sigmoid(var)) -> -sigmoid(var)
prediction = p_1 > 0.5
cost = xent.mean() + 0.01 * (w ** 2).sum()
gw,gb = tt.grad(cost, [w, b])
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
# w - 0.1 * gw: GEMV with the dot in the grad
updates=[(w, w - 0.1 * gw),
(b, b - 0.1 * gb)])
Theano flags
------------
Theano can be configured with flags. They can be defined in two ways
* With an environment variable: ``THEANO_FLAGS="floatX=float32,profile=True"``
* With a configuration file that defaults to ``~/.theanorc``
Exercise 2
-----------
.. code-block:: python
import numpy
import theano
import theano.tensor as tt
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = xent.mean() + 0.01 * (w**2).sum() # The cost to optimize
gw,gb = tt.grad(cost, [w, b])
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
# Print the graph used in the slides
theano.printing.pydotprint(predict,
outfile="pics/logreg_pydotprint_predic.png",
var_with_name_simple=True)
theano.printing.pydotprint_variables(prediction,
outfile="pics/logreg_pydotprint_prediction.png",
var_with_name_simple=True)
theano.printing.pydotprint(train,
outfile="pics/logreg_pydotprint_train.png",
var_with_name_simple=True)
Modify and execute the example to run on CPU with floatX=float32
* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
GPU
---
* Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
* Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space
* Use the Theano flag ``floatX=float32``
* Be sure to use ``floatX`` (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable
* Cast "problem": int32 with float32 to float64
* Insert manual cast in your code or use [u]int{8,16}
* The mean operator is worked on to make the output stay in float32.
* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
* Theano 0.6rc4 will have the combination of ``force_device=True``
and ``device=cpu`` disable the GPU.
Symbolic variables
------------------
* # Dimensions
* tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
* Dtype
* tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
* tt.vector to floatX dtype
* floatX: configurable dtype that can be float32 or float64.
* Custom variable
* All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
* Other dtype: uint[8,16,32,64], floatX
Creating symbolic variables: Broadcastability
* Remember what I said about broadcasting?
* How to add a row to all rows of a matrix?
* How to add a column to all columns of a matrix?
Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable
* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
Differentiation details
-----------------------
>>> gw,gb = tt.grad(cost, [w,b])
* tt.grad works symbolically: takes and returns a Theano variable
* tt.grad can be compared to a macro: it can be applied multiple times
* tt.grad takes scalar costs only
* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
* TODO update: We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
New Benchmarks (REMOVE???)
--------------
`Example <http://arxiv.org/pdf/1211.5590v1.pdf>`_ (Page 7 and 9):
* Logistic regression, MLP with 1 and 3 layers
* Recurrent neural networks
Competitors: Torch7, RNNLM
* Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks
OLD advanced presentation
- compilation pipeline
- inplace optimization
- conditions
- loops/rnn
- debugging support
- profiling support
Known limitations
-----------------
- Compilation phase distinct from execution phase
- Use ``a_tensor_variable.eval()`` to make this less visible
- Compilation time can be significant
- Amortize it with functions over big input or reuse functions
- Execution overhead
- We have worked on this, but more work needed
- So needs a certain number of operations to be useful
- Compilation time superlinear in the size of the graph.
- Hundreds of nodes is fine
- Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论