First version of Open Machine Learning workshop presentation.

24bcbf36 · Frederic Bastien · 93be9cb8 · 24bcbf36 · 24bcbf36 · 24bcbf36
--- a/doc/omlw2014/gpundarray.txt
+++ b/doc/omlw2014/gpundarray.txt
+.. _omlw2014_libgpundarray:
+*************
+libGpuNdArray
+*************
+Why a common GPU ndarray?
+-------------------------
+- Currently there are at least 4 different GPU array data structures in use by Python packages
+  - CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat), GPUArray (PyOpenCL), ...
+  - There are even more if we include other languages
+- All of them are a subset of the functionality of ``numpy.ndarray`` on the GPU
+- Lots of duplicated effort
+  - GPU code is harder/slower to do {\bf correctly} and {\bf fast} than on the CPU/Python
+- Lack of a common array API makes it harder to port/reuse code
+- Also harder to find/distribute code
+- Divides development work
+Design Goals
+------------
+- Make it VERY similar to ``numpy.ndarray``
+- Be compatible with both CUDA and OpenCL
+- Have the base object accessible from C to allow collaboration with more projects, across high-level languages
+  - We want people from C, C++, lua, Ruby, R, ... all use the same base GPU N-dimensional array
+Final Note
+----------
+TODO: update
+- Usable, but under development.
+- Is the next GPU array container for Theano
+- Mailing list: http://lists.tiker.net/listinfo/gpundarray
--- a/doc/omlw2014/index.txt
+++ b/doc/omlw2014/index.txt
+.. _omlw2014_index:
+===========================
+Theano Tutorial @ OMLW 2014
+===========================
+August 22, 2014, New York University, US.
+This presentation will talk about Theano, Pylearn2  software stack for
+machine learning.
+It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
+scikits, matplotlib, PIL.)
+Theano
+======
+Theano is a software for evaluating and manipulating complicated array
+expressions.
+What does it do?
+ * aggressive expression optimizations,
+ * automatic GPU use,
+ * automatic symbolic differentiation, Jacobian, Hession computation
+   and R/L op (for hessian free).
+Design and feature set has been driven by machine learning research
+at the University of
+Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
+The result is a very good library for doing research in deep
+learning and neural network training, and a flexible framework for
+many other models and algorithms in machine learning more generally.
+# TODO UPDATE
+It has proven to be useful for implementing:
+ - linear and nonlinear neural network classifiers
+ - convolutional models
+ - Energy models: RBM, DBN, GRBM, ssRBM, AIS
+ - Auto-encoders: DAE, CAE
+ - GP regression
+ - sparse coding
+ - recurrent neural networks, echo state, (HMM?)
+ - online and batch learning and optimization
+ - Even SVM!
+As people's needs change this list will grow, but Theano is built
+around vector, matrix, and tensor expressions; there is little reason
+to use it for calculations on other data structures except. There is
+also sparse matrix support.
+Pylearn2
+========
+Pylearn2 is still undergoing rapid development. Don’t expect a clean
+road without bumps! It is made for machine learning
+practitioner/researcher first.
+Pylearn2 is a machine learning library. Most of its functionality is
+built on top of Theano. This means you can write Pylearn2 plugins (new
+models, algorithms, etc) using mathematical expressions, and Theano
+will optimize and stabilize those expressions for you, and compile
+them to a backend of your choice (CPU or GPU).
+Pylearn2 Vision
+---------------
+* Researchers add features as they need them. We avoid getting bogged down by
+  too much top-down planning in advance.
+* A machine learning toolbox for easy scientific experimentation.
+* All models/algorithms published by the LISA lab should have reference
+  implementations in Pylearn2.
+* Pylearn2 may wrap other libraries such as scikits.learn when this is practical
+* Pylearn2 differs from scikits.learn in that Pylearn2 aims to provide great
+  flexibility and make it possible for a researcher to do almost anything,
+  while scikits.learn aims to work as a "black box" that can produce good
+  results even if the user does not understand the implementation
+* Dataset interface for vector, images, video, ...
+* Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution
+  experiments.
+* *Easy reuse* of sub-component of Pylearn2.
+* Using one sub-component of the library does not force you to use / learn to
+  use all of the other sub-components if you choose not to.
+* Support cross-platform serialization of learned models.
+* Remain approachable enough to be used in the classroom
+Contents
+========
+The structured part of these lab sessions will be a walk-through of the following
+material. Interleaved with this structured part will be blocks of time for
+individual or group work.  The idea is that you can try out Theano and get help
+from gurus on hand if you get stuck.
+.. toctree::
+    introduction
+    theano
+    pylearn2
+    gpundarray
--- a/doc/omlw2014/introduction.txt
+++ b/doc/omlw2014/introduction.txt
+.. _omlw2014_Introduction:
+************
+Introduction
+************
+Python in one slide
+-------------------
+* General-purpose high-level OO interpreted language
+* Emphasizes code readability
+* Comprehensive standard library
+* Dynamic type and memory management
+* Built-in types: int, float, str, list, dict, tuple, object
+* Slow execution
+* Popular in *web-dev* and *scientific communities*
+NumPy in one slide
+------------------
+* Python floats are full-fledged objects on the heap
+ * Not suitable for high-performance computing!
+* NumPy provides a N-dimensional numeric array in Python
+ * Perfect for high-performance computing.
+ * Slice are return view (no copy)
+* NumPy provides
+ * elementwise computations
+ * linear algebra, Fourier transforms
+ * pseudorandom numbers from many distributions
+* SciPy provides lots more, including
+ * more linear algebra
+ * solvers and optimization algorithms
+ * matlab-compatible I/O
+ * I/O and signal processing for images and audio
+.. code-block:: python
+    ##############################
+    # Properties of NumPy arrays
+    # that you really need to know
+    ##############################
+    import numpy as np          # import can rename
+    a = np.random.rand(3, 4, 5) # random generators
+    a32 = a.astype('float32')   # arrays are strongly typed
+    a.ndim                      # int: 3
+    a.shape                     # tuple: (3, 4, 5)
+    a.size                      # int: 60
+    a.dtype                     # np.dtype object: 'float64'
+    a32.dtype                   # np.dtype object: 'float32'
+    assert a[1, 1, 1] != 10     # a[1, 1, 1] is a view
+    a[1, 1, 1] = 10             # So affectation to it change the
+    assert a[1, 1, 1] == 10     # original array
+Arrays can be combined with numeric operators, standard mathematical
+functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
+Training an MNIST-ready classification neural network in pure NumPy might look like this:
+.. code-block:: python
+    #########################
+    # NumPy for Training a
+    # Neural Network on MNIST
+    #########################
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    w = np.random.normal(
+        avg=0,
+        std=.1,
+        size=(784, 500))
+    b = np.zeros((500,))
+    v = np.zeros((500, 10))
+    c = np.zeros((10,))
+    batchsize = 100
+    for i in xrange(1000):
+        x_i = x[i * batchsize: (i + 1) * batchsize]
+        y_i = y[i * batchsize: (i + 1) * batchsize]
+        hidin = np.dot(x_i, w) + b
+        hidout = np.tanh(hidin)
+        outin = np.dot(hidout, v) + c
+        outout = (np.tanh(outin) + 1) / 2.0
+        g_outout = outout - y_i
+        err = 0.5 * np.sum(g_outout) ** 2
+        g_outin = g_outout * outout * (1.0 - outout)
+        g_hidout = np.dot(g_outin, v.T)
+        g_hidin = g_hidout * (1 - hidout ** 2)
+        b -= lr * np.sum(g_hidin, axis=0)
+        c -= lr * np.sum(g_outin, axis=0)
+        w -= lr * np.dot(x_i.T, g_hidin)
+        v -= lr * np.dot(hidout.T, g_outin)
+What's missing?
+---------------
+* Non-lazy evaluation (required by Python) hurts performance
+* NumPy is bound to the CPU
+* NumPy lacks symbolic or automatic differentiation
+Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
+you have GPU (I'm skipping some dtype-details which we'll come back to).
+.. code-block:: python
+    #########################
+    # Theano for Training a
+    # Neural Network on MNIST
+    #########################
+    import numpy as np
+    import theano
+    import theano.tensor as tensor
+    x = np.load('data_x.npy')
+    y = np.load('data_y.npy')
+    # symbol declarations
+    sx = tensor.matrix()
+    sy = tensor.matrix()
+    w = theano.shared(np.random.normal(avg=0, std=.1,
+                                       size=(784, 500)))
+    b = theano.shared(np.zeros(500))
+    v = theano.shared(np.zeros((500, 10)))
+    c = theano.shared(np.zeros(10))
+    # symbolic expression-building
+    hid = tensor.tanh(tensor.dot(sx, w) + b)
+    out = tensor.tanh(tensor.dot(hid, v) + c)
+    err = 0.5 * tensor.sum(out - sy) ** 2
+    gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
+    # compile a fast training function
+    train = theano.function([sx, sy], err,
+        updates={
+            w: w - lr * gw,
+            b: b - lr * gb,
+            v: v - lr * gv,
+            c: c - lr * gc})
+    # now do the computations
+    batchsize = 100
+    for i in xrange(1000):
+        x_i = x[i * batchsize: (i + 1) * batchsize]
+        y_i = y[i * batchsize: (i + 1) * batchsize]
+        err_i = train(x_i, y_i)
+Theano in one slide
+-------------------
+* High-level domain-specific language tailored to numeric computation
+* Compiles most common expressions to C for CPU and GPU.
+* Limited expressivity means lots of opportunities for expression-level optimizations
+ * No function call -> global optimization
+ * Strongly typed -> compiles to machine instructions
+ * Array oriented -> parallelizable across cores
+ * Support for looping and branching in expressions
+* Expression substitution optimizations automatically draw
+  on many backend technologies for best performance.
+ * FFTW, MKL, ATLAS, SciPy, Cython, CUDA
+ * Slower fallbacks always available
+* Automatic differentiation and R op
+* Sparse matrices
+Project status
+--------------
+* Mature: theano has been developed and used since January 2008 (6.5 yrs old)
+* Driven over 100 research papers
+* Good user documentation
+* Active mailing list with participants from outside our lab
+* Core technology for a few funded Silicon-Valley startup
+* Many contributors (some from outside our lab)
+* Used to teach many university classes
+* Used for research at Google and Yahoo.
+* Downloads
+ * Pypi (August 18th 2014, the last release): 255 last day, 2140 last week, 9145 last month
+ * Github (`bleeding edge` repository, the one recommanded): unknown
+* Github stats?????
+Why scripting for GPUs?
+-----------------------
+They *Complement each other*:
+* GPUs are everything that scripting/high level languages are not
+ * Highly parallel
+ * Very architecture-sensitive
+ * Built for maximum FP/memory throughput
+ * So hard to program that meta-programming is easier.
+* CPU: largely restricted to control
+ * Optimized for sequential code and low latency (rather than high throughput)
+ * Tasks (1000/sec)
+ * Scripting fast enough
+Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
--- a/doc/omlw2014/logreg.py
+++ b/doc/omlw2014/logreg.py
+import numpy
+import theano
+import theano.tensor as tt
+rng = numpy.random
+N = 400
+feats = 784
+D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
+training_steps = 10000
+# Declare Theano symbolic variables
+x = tt.matrix("x")
+y = tt.vector("y")
+w = theano.shared(rng.randn(feats), name="w")
+b = theano.shared(0., name="b")
+print "Initial model:"
+print w.get_value(), b.get_value()
+# Construct Theano expression graph
+p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))   # Probability that target = 1
+prediction = p_1 > 0.5                      # The prediction thresholded
+xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy loss
+cost = xent.mean() + 0.01 * (w ** 2).sum()  # The cost to minimize
+gw, gb = tt.grad(cost, [w, b])
+# Compile
+train = theano.function(
+    inputs=[x, y],
+    outputs=[prediction, xent],
+    updates=[(w, w - 0.1 * gw),
+             (b, b - 0.1 * gb)],
+    name='train')
+predict = theano.function(inputs=[x], outputs=prediction,
+                          name='predict')
+# Train
+for i in range(training_steps):
+    pred, err = train(D[0], D[1])
+print "Final model:"
+print w.get_value(), b.get_value()
+print "target values for D:", D[1]
+print "prediction on D:", predict(D[0])
--- a/doc/omlw2014/pylearn2.txt
+++ b/doc/omlw2014/pylearn2.txt
+.. _omlw2014_pylearn2:
+********
+Pylearn2
+********
+Pointers
+--------
+TODO:
+* http://deeplearning.net/software/pylearn2/
+* User mailing list: http://groups.google.com/group/pylearn-users
+* Dev mailing list: http://groups.google.com/group/pylearn-dev
+* Installation: http://deeplearning.net/software/pylearn2/index.html#download-and-installation
+Description
+-----------
+TODO:
+* ...
+Simple example
+--------------
+(logistic regression?) TODO
+Real example
+------------
+(maxout?)TODO
+Known limitations
+-----------------
+TODO
+* It is getting stabilized, but still heavily modified.
--- a/doc/omlw2014/theano.txt
+++ b/doc/omlw2014/theano.txt
+.. _omlw2014_theano:
+******
+Theano
+******
+Pointers
+--------
+* http://deeplearning.net/software/theano/
+* Announcements mailing list: http://groups.google.com/group/theano-announce
+* User mailing list: http://groups.google.com/group/theano-users
+* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
+* Installation: https://deeplearning.net/software/theano/install.html
+Description
+-----------
+* Mathematical symbolic expression compiler
+* Dynamic C/CUDA code generation
+* Efficient symbolic differentiation
+  * Theano computes derivatives of functions with one or many inputs.
+  * Also support computation of the Jacobian, Hessian, R and L op.
+* Speed and stability optimizations
+  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
+* Works on Linux, Mac and Windows
+* Transparent use of a GPU
+  * float32 only for now (working on other data types)
+  * Still in experimental state on Windows
+* Extensive unit-testing and self-verification
+  * Detects and diagnoses many types of errors
+* (TODO REMOVE?) On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
+  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
+* Is used with other technologie to generate fast code: C/C++, CUDA, OpenCL, PyCUDA, Cython, Numba, ...
+* Expressions mimic NumPy's syntax & semantics
+* Statically typed and purely functional
+* Sparse operations (CPU only)
+Simple example
+--------------
+>>> import theano
+>>> a = theano.tensor.vector("a")      # declare symbolic variable
+>>> b = a + a ** 10                    # build symbolic expression
+>>> f = theano.function([a], b)        # compile function
+>>> print f([0, 1, 2])                 # prints `array([0, 2, 1026])`
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================
+Symbolic programming = *Paradigm shift*: people need to use it to understand it.
+Exercise 1
+-----------
+.. code-block:: python
+  import theano
+  a = theano.tensor.vector()      # declare variable
+  out = a + a ** 10               # build symbolic expression
+  f = theano.function([a], out)   # compile function
+  print f([0, 1, 2])
+  # prints `array([0, 2, 1026])`
+  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
+  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
+Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
+Real example
+------------
+**Logistic Regression**
+* GPU-ready
+* Symbolic differentiation
+* Speed optimizations
+* Stability optimizations
+.. literalinclude:: logreg.py
+**Optimizations:**
+Where are those optimization applied?
+* ``log(1+exp(x))``
+* ``1 / (1 + tt.exp(var))`` (sigmoid)
+* ``log(1-sigmoid(var))`` (softplus, stabilisation)
+* GEMV (matrix-vector multiply from BLAS)
+* Loop fusion
+.. code-block:: python
+  p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
+  # 1 / (1 + tt.exp(var)) -> sigmoid(var)
+  xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
+  # Log(1-sigmoid(var)) -> -sigmoid(var)
+  prediction = p_1 > 0.5
+  cost = xent.mean() + 0.01 * (w ** 2).sum()
+  gw,gb = tt.grad(cost, [w, b])
+  train = theano.function(
+            inputs=[x, y],
+            outputs=[prediction, xent],
+            # w - 0.1 * gw: GEMV with the dot in the grad
+            updates=[(w, w - 0.1 * gw),
+                     (b, b - 0.1 * gb)])
+Theano flags
+------------
+Theano can be configured with flags. They can be defined in two ways
+* With an environment variable: ``THEANO_FLAGS="floatX=float32,profile=True"``
+* With a configuration file that defaults to ``~/.theanorc``
+Exercise 2
+-----------
+.. code-block:: python
+    import numpy
+    import theano
+    import theano.tensor as tt
+    rng = numpy.random
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+    # Declare Theano symbolic variables
+    x = tt.matrix("x")
+    y = tt.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+    #print "Initial model:"
+    #print w.get_value(), b.get_value()
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
+    prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
+    xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
+    cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to optimize
+    gw,gb = tt.grad(cost, [w, b])
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x, y],
+                outputs=[prediction, xent],
+                updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+                name="train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                              name="predict")
+    if any([x.op.__class__.__name__=='Gemv' for x in
+            train.maker.fgraph.toposort()]):
+        print 'Used the cpu'
+    elif any([x.op.__class__.__name__=='GpuGemm' for x in
+              train.maker.fgraph.toposort()]):
+        print 'Used the gpu'
+    else:
+        print 'ERROR, not able to tell if theano used the cpu or the gpu'
+        print train.maker.fgraph.toposort()
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+    #print "Final model:"
+    #print w.get_value(), b.get_value()
+    print "target values for D"
+    print D[1]
+    print "prediction on D"
+    print predict(D[0])
+    # Print the graph used in the slides
+    theano.printing.pydotprint(predict,
+                               outfile="pics/logreg_pydotprint_predic.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint_variables(prediction,
+                               outfile="pics/logreg_pydotprint_prediction.png",
+                               var_with_name_simple=True)
+    theano.printing.pydotprint(train,
+                               outfile="pics/logreg_pydotprint_train.png",
+                               var_with_name_simple=True)
+Modify and execute the example to run on CPU with floatX=float32
+* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
+GPU
+---
+* Only 32 bit floats are supported (being worked on)
+* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
+* Use the Theano flag ``device=gpu`` to tell to use the GPU device
+ * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
+ * Shared variables with float32 dtype are by default moved to the GPU memory space
+* Use the Theano flag ``floatX=float32``
+ * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
+ * Cast inputs before putting them into a shared variable
+ * Cast "problem": int32 with float32 to float64
+  * Insert manual cast in your code or use [u]int{8,16}
+  * The mean operator is worked on to make the output stay in float32.
+* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
+  * Theano 0.6rc4 will have the combination of ``force_device=True``
+    and ``device=cpu`` disable the GPU.
+Symbolic variables
+------------------
+* # Dimensions
+ * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
+* Dtype
+ * tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
+ * tt.vector to floatX dtype
+ * floatX: configurable dtype that can be float32 or float64.
+* Custom variable
+ * All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
+ * Other dtype: uint[8,16,32,64], floatX
+Creating symbolic variables: Broadcastability
+* Remember what I said about broadcasting?
+* How to add a row to all rows of a matrix?
+* How to add a column to all columns of a matrix?
+Details regarding symbolic broadcasting...
+* Broadcastability must be specified when creating the variable
+* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
+* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
+Differentiation details
+-----------------------
+>>> gw,gb = tt.grad(cost, [w,b])
+* tt.grad works symbolically: takes and returns a Theano variable
+* tt.grad can be compared to a macro: it can be applied multiple times
+* tt.grad takes scalar costs only
+* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
+* TODO update: We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
+New Benchmarks (REMOVE???)
+--------------
+`Example <http://arxiv.org/pdf/1211.5590v1.pdf>`_ (Page 7 and 9):
+* Logistic regression, MLP with 1 and 3 layers
+* Recurrent neural networks
+Competitors: Torch7, RNNLM
+* Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks
+OLD advanced presentation
+- compilation pipeline
+- inplace optimization
+- conditions
+- loops/rnn
+- debugging support
+- profiling support 
+Known limitations
+-----------------
+- Compilation phase distinct from execution phase
+  - Use ``a_tensor_variable.eval()`` to make this less visible
+- Compilation time can be significant
+  - Amortize it with functions over big input or reuse functions
+- Execution overhead
+  - We have worked on this, but more work needed
+  - So needs a certain number of operations to be useful
+- Compilation time superlinear in the size of the graph.
+  - Hundreds of nodes is fine
+  - Disabling a few optimizations can speed up compilation
+  - Usually too many nodes indicates a problem with the graph