提交 ea36c662 authored 作者: Frederic's avatar Frederic

remove old version.

上级 85ebc3aa
.. _omlw2014_libgpuarray:
***********
libgpuarray
***********
Why a common GPU ndarray?
-------------------------
- Currently there are at least 4 different GPU array data structures in use by Python packages
- CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat), GPUArray (PyOpenCL), ...
- There are even more if we include other languages
- All of them are a subset of the functionality of ``numpy.ndarray`` on the GPU
- Lots of duplicated effort
- GPU code is harder/slower to do {\bf correctly} and {\bf fast} than on the CPU/Python
- Lack of a common array API makes it harder to port/reuse code
- Also harder to find/distribute code
- Divides development work
Design Goals
------------
- Make it VERY similar to ``numpy.ndarray``
- Be compatible with both CUDA and OpenCL
- Have the base object accessible from C to allow collaboration with more projects, across high-level languages
- We want people from C, C++, lua, Ruby, R, ... all use the same base GPU N-dimensional array
Final Note
----------
- Usable directly, but not all implementation available.
- Is the next GPU array container for Theano and is working (not all implementation available now)
- Mailing list: http://lists.tiker.net/listinfo/gpundarray
.. _omlw2014_index:
======================================================
Theano, Pylearn2, libgpuarray Presentation @ OMLW 2014
======================================================
August 22, 2014, New York University, US.
By Frédéric Bastien and Bart van Merriënboer. University of Montréal, Canada.
Theano, Pylearn2 and libgpuarray software stack for machine learning.
It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
scikits, matplotlib, PIL.)
Theano
======
Theano is a software for evaluating and manipulating complicated array
expressions.
What does it do?
* aggressive expression optimizations,
* automatic GPU use,
* automatic symbolic differentiation, Jacobian, Hession computation
and R/L op (for hessian free).
Design and feature set has been driven by machine learning research
at the University of
Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
The result is a very good library for doing research in deep
learning and neural network training, and a flexible framework for
many other models and algorithms in machine learning more generally.
It has proven to be useful for implementing:
- linear and nonlinear neural network classifiers
- including Maxout, Dropout
- convolutional models
- Energy models: RBM, DBN, GRBM, ssRBM, AIS
- Auto-encoders: DAE, CAE
- GP regression
- sparse coding
- recurrent neural networks, echo state, (HMM?) TODO
- online and batch learning and optimization
- Even SVM!
As people's needs change this list will grow, but Theano is built
around vector, matrix, and tensor expressions. It also support sparse matrix.
Pylearn2
========
Pylearn2 is undergoing rapid development. Don’t expect a clean
road without bumps! It is made for machine learning
practitioner/researcher first.
Pylearn2 is a machine learning library. Most of its functionality is
built on top of Theano. This means you can write Pylearn2 plugins (new
models, algorithms, etc) using mathematical expressions, and Theano
will optimize and stabilize those expressions for you, and compile
them to a backend of your choice (CPU or GPU).
Pylearn2 Vision
---------------
TODO: SHould we split this in 2 part, what is done, what is the vision not done yet?
* Researchers **add features as they need them**. We avoid getting bogged down by
too much top-down planning in advance.
* A machine learning toolbox for **easy scientific experimentation**.
* All models/algorithms published by the LISA lab should have reference
implementations in Pylearn2. TODO REMOVE???
* Pylearn2 **may wrap other libraries** such as scikits.learn when this is practical
* Pylearn2 **differs from scikits.learn** in that Pylearn2 aims to provide great
flexibility and make it possible for a researcher to do almost anything,
while **scikits.learn aims to work as a "black box"**.
* **Dataset interface** for vector, images, video, ... TODO (DO WE HAVE VIDEO?)
* Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution
experiments. (TODO: I think I would remove this)
* **Easy reuse of sub-component** of Pylearn2.
* Using one sub-component of the library does not force you to use / learn to
use all of the other sub-components if you choose not to. TODO remove?
* Support cross-platform serialization of learned models.(TODO, I think this isn't done)
* Remain approachable enough to be used in the classroom
libgpuarray
===========
Make a common GPU ndarray(vector, matrix or n dimensions) that can be
reused by all projects. It support CUDA and OpenCL.
Motivation
----------
* Currently there are at least 6 different gpu arrays in python
* CudaNdarray(Theano), GPUArray(pycuda), CUDAMatrix(cudamat), GPUArray(pyopencl), Clyther, Copperhead, ...
* There are even more if we include other languages.
* They are incompatible
* None have the same properties and interface.
* All of them are a subset of numpy.ndarray on the gpu!
Design Goals
------------
* Have the base object in C to allow collaboration with more projects.
* We want people from C, C++, ruby, R, ... all use the same base GPU ndarray.
* Be compatible with CUDA and OpenCL.
* Not too simple, (don't support just matrix).
* But still easy to develop new code that support only a few memory layout.
* This ease the development of new code.
Contents
========
.. toctree::
introduction
theano
pylearn2
gpundarray
sharing
.. _omlw2014_Introduction:
************
Introduction
************
Python in one slide
-------------------
* General-purpose high-level **OO interpreted language**
* Emphasizes **code readability**
* Comprehensive standard library
* Dynamic type and memory management
* Built-in types: int, float, str, list, dict, tuple, object
* Slow execution
* Popular in **web-dev** and **scientific communities**
NumPy in one slide
------------------
* Python floats are full-fledged objects on the heap
* Not suitable for high-performance computing!
* NumPy provides a N-dimensional numeric array in Python
* Perfect for high-performance computing.
* Slice are return view (no copy)
* NumPy provides
* elementwise computations
* linear algebra, Fourier transforms
* pseudorandom numbers from many distributions
* SciPy provides lots more, including
* more linear algebra
* solvers and optimization algorithms
* matlab-compatible I/O
* I/O and signal processing for images and audio
.. code-block:: python
##############################
# Properties of NumPy arrays
# that you really need to know
##############################
import numpy as np # import can rename
a = np.random.rand(3, 4, 5) # random generators
a32 = a.astype('float32') # arrays are strongly typed
a.ndim # int: 3
a.shape # tuple: (3, 4, 5)
a.size # int: 60
a.dtype # np.dtype object: 'float64'
a32.dtype # np.dtype object: 'float32'
assert a[1, 1, 1] != 10 # a[1, 1, 1] is a view
a[1, 1, 1] = 10 # So affectation to it change the
assert a[1, 1, 1] == 10 # original array
Arrays can be combined with numeric operators, standard mathematical
functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
What's missing?
---------------
* Non-lazy evaluation (required by Python) hurts performance
* NumPy is bound to the CPU
* NumPy lacks symbolic or automatic differentiation
Quick look at a small examples:
.. code-block:: python
#########################
# Theano for Training a
# Neural Network on MNIST
#########################
import numpy as np
import theano
import theano.tensor as tensor
x = np.load('data_x.npy')
y = np.load('data_y.npy')
# symbol declarations
sx = tensor.matrix()
sy = tensor.matrix()
w = theano.shared(np.random.normal(avg=0, std=.1,
size=(784, 500)))
b = theano.shared(np.zeros(500))
v = theano.shared(np.zeros((500, 10)))
c = theano.shared(np.zeros(10))
# symbolic expression-building
hid = tensor.tanh(tensor.dot(sx, w) + b)
out = tensor.tanh(tensor.dot(hid, v) + c)
err = 0.5 * tensor.sum(out - sy) ** 2
gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
# compile a fast training function
train = theano.function([sx, sy], err,
updates={
w: w - lr * gw,
b: b - lr * gb,
v: v - lr * gv,
c: c - lr * gc})
# now do the computations
batchsize = 100
for i in xrange(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
err_i = train(x_i, y_i)
Theano in one slide
-------------------
* High-level domain-specific language tailored to numeric computation
* Compiles most common expressions to C for CPU and GPU.
* Limited expressivity means lots of opportunities for expression-level optimizations
* No function call -> global optimization
* Strongly typed -> compiles to machine instructions
* Array oriented -> easy parallelism
* Support for looping and branching in expressions
* Expression substitution optimizations automatically draw
on many backend technologies for best performance.
* BLAS, SciPy, Cython, CUDA
* Slower fallbacks always available
* Automatic differentiation and R op
* Sparse matrices
Project status
--------------
* Mature: theano has been developed and used since January 2008 (6.5 yrs old)
* Driven over 100 research papers
* Good user documentation
* Active mailing list with participants from outside our lab
* Core technology for a few Silicon-Valley startup
* Many contributors (some from outside our lab)
* Used to teach many university classes
* Used for research at Google and Yahoo. (TODO, should we remove? I think so)
Pylearn2 in one slide
---------------------
TODO
Other global information
------------------------
Theano have small basic operation, not layers as base operation:
* Easy reuse
* Don't need to reimplement the grad for each variation of layers
This could cause slowness (more small operation), but the optimizer fix that.
Pylearn2 wrap the small operations into layers like other
projects:
* There is no overhead to this extra layer, due to the
compilation of the function by Theano.
Why scripting for GPUs?
-----------------------
They *Complement each other*:
* GPUs are everything that scripting/high level languages are not
* Highly parallel
* Very architecture-sensitive
* Built for maximum FP/memory throughput
* So hard to program that meta-programming is easier.
* CPU: largely restricted to control
* Optimized for sequential code and low latency (rather than high throughput)
* Tasks (1000/sec)
* Scripting fast enough
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
.. _omlw2014_pylearn2:
********
Pylearn2
********
Pointers
--------
TODO:
* http://deeplearning.net/software/pylearn2/
* User mailing list: http://groups.google.com/group/pylearn-users
* Dev mailing list: http://groups.google.com/group/pylearn-dev
* Installation: http://deeplearning.net/software/pylearn2/index.html#download-and-installation
Description
-----------
TODO:
* ...
Simple example
--------------
(logistic regression?) TODO
Real example
------------
(maxout?)TODO
Known limitations
-----------------
TODO
* It is getting stabilized, but still heavily modified.
.. _omlw2014_sharing:
************
Sharing code
************
* License (BSD 3 clauses suggested, don't forget to add the license info in the code)
* Common base object? libgpuarray.
* If not, important implementation that use raw ptr/shape? Doc that interface.
* Important, *acknowledgement section on web site*(citation like) AND *in paper* about the software we reuse! (and use too)
*************
Theano future
*************
.. _omlw2014_theano:
******
Theano
******
Pointers
--------
* http://deeplearning.net/software/theano/
* Announcements mailing list: http://groups.google.com/group/theano-announce
* User mailing list: http://groups.google.com/group/theano-users
* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
* Installation: https://deeplearning.net/software/theano/install.html
Description
-----------
* Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation
* Efficient symbolic differentiation
* Theano computes derivatives of functions with one or many inputs.
* Also support computation of the Jacobian, Hessian, R and L op.
* Speed and stability optimizations
* Gives the right answer for ``log(1+x)`` even if x is really tiny.
* Works on Linux, Mac and Windows
* Transparent use of a GPU
* float32 only for now (working on other data types)
* Still in experimental state on Windows
* Extensive unit-testing and self-verification
* Detects and diagnoses many types of errors
* (TODO REMOVE?) On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
* including specialized implementations in C/C++, NumPy, SciPy, and Matlab
* Is used with other technologie to generate fast code: C/C++, CUDA, OpenCL, PyCUDA, Cython, Numba, ...
* Expressions mimic NumPy's syntax & semantics
* Statically typed and purely functional
* Sparse operations (CPU only)
Simple example
--------------
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print f([0, 1, 2]) # prints `array([0, 2, 1026])`
====================================================== =====================================================
Unoptimized graph Optimized graph
====================================================== =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
====================================================== =====================================================
Symbolic programming = *Paradigm shift*: people need to use it to understand it.
Exercise 1
-----------
.. code-block:: python
import theano
a = theano.tensor.vector() # declare variable
out = a + a ** 10 # build symbolic expression
f = theano.function([a], out) # compile function
print f([0, 1, 2])
# prints `array([0, 2, 1026])`
theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
Real example
------------
**Logistic Regression**
* GPU-ready
* Symbolic differentiation
* Speed optimizations
* Stability optimizations
.. literalinclude:: logreg.py
**Optimizations:**
Where are those optimization applied?
* ``log(1+exp(x))``
* ``1 / (1 + tt.exp(var))`` (sigmoid)
* ``log(1-sigmoid(var))`` (softplus, stabilisation)
* GEMV (matrix-vector multiply from BLAS)
* Loop fusion
.. code-block:: python
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
# 1 / (1 + tt.exp(var)) -> sigmoid(var)
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
# Log(1-sigmoid(var)) -> -sigmoid(var)
prediction = p_1 > 0.5
cost = xent.mean() + 0.01 * (w ** 2).sum()
gw,gb = tt.grad(cost, [w, b])
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
# w - 0.1 * gw: GEMV with the dot in the grad
updates=[(w, w - 0.1 * gw),
(b, b - 0.1 * gb)])
Theano flags
------------
Theano can be configured with flags. They can be defined in two ways
* With an environment variable: ``THEANO_FLAGS="floatX=float32,profile=True"``
* With a configuration file that defaults to ``~/.theanorc``
Exercise 2
-----------
.. code-block:: python
import numpy
import theano
import theano.tensor as tt
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = xent.mean() + 0.01 * (w**2).sum() # The cost to optimize
gw,gb = tt.grad(cost, [w, b])
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
# Print the graph used in the slides
theano.printing.pydotprint(predict,
outfile="pics/logreg_pydotprint_predic.png",
var_with_name_simple=True)
theano.printing.pydotprint_variables(prediction,
outfile="pics/logreg_pydotprint_prediction.png",
var_with_name_simple=True)
theano.printing.pydotprint(train,
outfile="pics/logreg_pydotprint_train.png",
var_with_name_simple=True)
Modify and execute the example to run on CPU with floatX=float32
* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
GPU
---
* Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
* Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space
* Use the Theano flag ``floatX=float32``
* Be sure to use ``floatX`` (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable
* Cast "problem": int32 with float32 to float64
* Insert manual cast in your code or use [u]int{8,16}
* The mean operator is worked on to make the output stay in float32.
* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
* Theano 0.6rc4 will have the combination of ``force_device=True``
and ``device=cpu`` disable the GPU.
Symbolic variables
------------------
* # Dimensions
* tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
* Dtype
* tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
* tt.vector to floatX dtype
* floatX: configurable dtype that can be float32 or float64.
* Custom variable
* All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
* Other dtype: uint[8,16,32,64], floatX
Creating symbolic variables: Broadcastability
* Remember what I said about broadcasting?
* How to add a row to all rows of a matrix?
* How to add a column to all columns of a matrix?
Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable
* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
Differentiation details
-----------------------
>>> gw,gb = tt.grad(cost, [w,b])
* tt.grad works symbolically: takes and returns a Theano variable
* tt.grad can be compared to a macro: it can be applied multiple times
* tt.grad takes scalar costs only
* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
* TODO update: We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
New Benchmarks (REMOVE???)
--------------
`Example <http://arxiv.org/pdf/1211.5590v1.pdf>`_ (Page 7 and 9):
* Logistic regression, MLP with 1 and 3 layers
* Recurrent neural networks
Competitors: Torch7, RNNLM
* Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks
OLD advanced presentation
- compilation pipeline
- inplace optimization
- conditions
- loops/rnn
- debugging support
- profiling support
Known limitations
-----------------
- Compilation phase distinct from execution phase
- Use ``a_tensor_variable.eval()`` to make this less visible
- Compilation time can be significant
- Amortize it with functions over big input or reuse functions
- Execution overhead
- We have worked on this, but more work needed
- So needs a certain number of operations to be useful
- Compilation time superlinear in the size of the graph.
- Hundreds of nodes is fine
- Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论