remove old version.

ea36c662 · Frederic · 85ebc3aa · 85ebc3aa · 85ebc3aa · 85ebc3aa
--- a/doc/omlw2014/gpundarray.txt
+++ b/doc/omlw2014/gpundarray.txt
-.. _omlw2014_libgpuarray:
-***********
-libgpuarray
-***********
-Why a common GPU ndarray?
-------------------------
- Currently there are at least 4 different GPU array data structures in use by Python packages
-  - CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat), GPUArray (PyOpenCL), ...
-  - There are even more if we include other languages
- All of them are a subset of the functionality of ``numpy.ndarray`` on the GPU
- Lots of duplicated effort
-  - GPU code is harder/slower to do {\bf correctly} and {\bf fast} than on the CPU/Python
- Lack of a common array API makes it harder to port/reuse code
- Also harder to find/distribute code
- Divides development work
-Design Goals
------------
- Make it VERY similar to ``numpy.ndarray``
- Be compatible with both CUDA and OpenCL
- Have the base object accessible from C to allow collaboration with more projects, across high-level languages
-  - We want people from C, C++, lua, Ruby, R, ... all use the same base GPU N-dimensional array
-Final Note
----------
- Usable directly, but not all implementation available.
- Is the next GPU array container for Theano and is working (not all implementation available now)
- Mailing list: http://lists.tiker.net/listinfo/gpundarray
--- a/doc/omlw2014/index.txt
+++ b/doc/omlw2014/index.txt
-.. _omlw2014_index:
-======================================================
-Theano, Pylearn2, libgpuarray Presentation @ OMLW 2014
-======================================================
-August 22, 2014, New York University, US.
-By Frédéric Bastien and Bart van Merriënboer. University of Montréal, Canada.
-Theano, Pylearn2 and libgpuarray software stack for machine learning.
-It complements the Python numeric/scientific software stack (e.g. NumPy, SciPy,
-scikits, matplotlib, PIL.)
-Theano
-======
-Theano is a software for evaluating and manipulating complicated array
-expressions.
-What does it do?
- * aggressive expression optimizations,
- * automatic GPU use,
- * automatic symbolic differentiation, Jacobian, Hession computation
-   and R/L op (for hessian free).
-Design and feature set has been driven by machine learning research
-at the University of
-Montreal (groups of Yoshua Bengio, Pascal Vincent, Aaron Courville and Roland Memisevic)
-The result is a very good library for doing research in deep
-learning and neural network training, and a flexible framework for
-many other models and algorithms in machine learning more generally.
-It has proven to be useful for implementing:
- - linear and nonlinear neural network classifiers
-   - including Maxout, Dropout
- - convolutional models
- - Energy models: RBM, DBN, GRBM, ssRBM, AIS
- - Auto-encoders: DAE, CAE
- - GP regression
- - sparse coding
- - recurrent neural networks, echo state, (HMM?) TODO
- - online and batch learning and optimization
- - Even SVM!
-As people's needs change this list will grow, but Theano is built
-around vector, matrix, and tensor expressions. It also support sparse matrix.
-Pylearn2
-========
-Pylearn2 is undergoing rapid development. Don’t expect a clean
-road without bumps! It is made for machine learning
-practitioner/researcher first.
-Pylearn2 is a machine learning library. Most of its functionality is
-built on top of Theano. This means you can write Pylearn2 plugins (new
-models, algorithms, etc) using mathematical expressions, and Theano
-will optimize and stabilize those expressions for you, and compile
-them to a backend of your choice (CPU or GPU).
-Pylearn2 Vision
---------------
-TODO: SHould we split this in 2 part, what is done, what is the vision not done yet?
-* Researchers **add features as they need them**. We avoid getting bogged down by
-  too much top-down planning in advance.
-* A machine learning toolbox for **easy scientific experimentation**.
-* All models/algorithms published by the LISA lab should have reference
-  implementations in Pylearn2. TODO REMOVE???
-* Pylearn2 **may wrap other libraries** such as scikits.learn when this is practical
-* Pylearn2 **differs from scikits.learn** in that Pylearn2 aims to provide great
-  flexibility and make it possible for a researcher to do almost anything,
-  while **scikits.learn aims to work as a "black box"**.
-* **Dataset interface** for vector, images, video, ... TODO (DO WE HAVE VIDEO?)
-* Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution
-  experiments. (TODO: I think I would remove this)
-* **Easy reuse of sub-component** of Pylearn2.
-  * Using one sub-component of the library does not force you to use / learn to
-    use all of the other sub-components if you choose not to. TODO remove?
-* Support cross-platform serialization of learned models.(TODO, I think this isn't done)
-* Remain approachable enough to be used in the classroom
-libgpuarray
-===========
-Make a common GPU ndarray(vector, matrix or n dimensions) that can be
-reused by all projects. It support CUDA and OpenCL.
-Motivation
----------
-* Currently there are at least 6 different gpu arrays in python
-  *  CudaNdarray(Theano), GPUArray(pycuda), CUDAMatrix(cudamat), GPUArray(pyopencl), Clyther, Copperhead, ...
-  *  There are even more if we include other languages.
-* They are incompatible
-  * None have the same properties and interface.
-* All of them are a subset of numpy.ndarray on the gpu!
-Design Goals
------------
-* Have the base object in C to allow collaboration with more projects.
-  * We want people from C, C++, ruby, R, ... all use the same base GPU ndarray.
-* Be compatible with CUDA and OpenCL.
-* Not too simple, (don't support just matrix).
-* But still easy to develop new code that support only a few memory layout.
-  * This ease the development of new code.
-Contents
-========
-.. toctree::
-    introduction
-    theano
-    pylearn2
-    gpundarray
-    sharing
--- a/doc/omlw2014/introduction.txt
+++ b/doc/omlw2014/introduction.txt
-.. _omlw2014_Introduction:
-************
-Introduction
-************
-Python in one slide
-------------------
-* General-purpose high-level **OO interpreted language**
-* Emphasizes **code readability**
-* Comprehensive standard library
-* Dynamic type and memory management
-* Built-in types: int, float, str, list, dict, tuple, object
-* Slow execution
-* Popular in **web-dev** and **scientific communities**
-NumPy in one slide
------------------
-* Python floats are full-fledged objects on the heap
- * Not suitable for high-performance computing!
-* NumPy provides a N-dimensional numeric array in Python
- * Perfect for high-performance computing.
- * Slice are return view (no copy)
-* NumPy provides
- * elementwise computations
- * linear algebra, Fourier transforms
- * pseudorandom numbers from many distributions
-* SciPy provides lots more, including
- * more linear algebra
- * solvers and optimization algorithms
- * matlab-compatible I/O
- * I/O and signal processing for images and audio
-.. code-block:: python
-    ##############################
-    # Properties of NumPy arrays
-    # that you really need to know
-    ##############################
-    import numpy as np          # import can rename
-    a = np.random.rand(3, 4, 5) # random generators
-    a32 = a.astype('float32')   # arrays are strongly typed
-    a.ndim                      # int: 3
-    a.shape                     # tuple: (3, 4, 5)
-    a.size                      # int: 60
-    a.dtype                     # np.dtype object: 'float64'
-    a32.dtype                   # np.dtype object: 'float32'
-    assert a[1, 1, 1] != 10     # a[1, 1, 1] is a view
-    a[1, 1, 1] = 10             # So affectation to it change the
-    assert a[1, 1, 1] == 10     # original array
-Arrays can be combined with numeric operators, standard mathematical
-functions. NumPy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.
-What's missing?
---------------
-* Non-lazy evaluation (required by Python) hurts performance
-* NumPy is bound to the CPU
-* NumPy lacks symbolic or automatic differentiation
-Quick look at a small examples:
-.. code-block:: python
-    #########################
-    # Theano for Training a
-    # Neural Network on MNIST
-    #########################
-    import numpy as np
-    import theano
-    import theano.tensor as tensor
-    x = np.load('data_x.npy')
-    y = np.load('data_y.npy')
-    # symbol declarations
-    sx = tensor.matrix()
-    sy = tensor.matrix()
-    w = theano.shared(np.random.normal(avg=0, std=.1,
-                                       size=(784, 500)))
-    b = theano.shared(np.zeros(500))
-    v = theano.shared(np.zeros((500, 10)))
-    c = theano.shared(np.zeros(10))
-    # symbolic expression-building
-    hid = tensor.tanh(tensor.dot(sx, w) + b)
-    out = tensor.tanh(tensor.dot(hid, v) + c)
-    err = 0.5 * tensor.sum(out - sy) ** 2
-    gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
-    # compile a fast training function
-    train = theano.function([sx, sy], err,
-        updates={
-            w: w - lr * gw,
-            b: b - lr * gb,
-            v: v - lr * gv,
-            c: c - lr * gc})
-    # now do the computations
-    batchsize = 100
-    for i in xrange(1000):
-        x_i = x[i * batchsize: (i + 1) * batchsize]
-        y_i = y[i * batchsize: (i + 1) * batchsize]
-        err_i = train(x_i, y_i)
-Theano in one slide
-------------------
-* High-level domain-specific language tailored to numeric computation
-* Compiles most common expressions to C for CPU and GPU.
-* Limited expressivity means lots of opportunities for expression-level optimizations
- * No function call -> global optimization
- * Strongly typed -> compiles to machine instructions
- * Array oriented -> easy parallelism
- * Support for looping and branching in expressions
-* Expression substitution optimizations automatically draw
-  on many backend technologies for best performance.
- * BLAS, SciPy, Cython, CUDA
- * Slower fallbacks always available
-* Automatic differentiation and R op
-* Sparse matrices
-Project status
--------------
-* Mature: theano has been developed and used since January 2008 (6.5 yrs old)
-* Driven over 100 research papers
-* Good user documentation
-* Active mailing list with participants from outside our lab
-* Core technology for a few Silicon-Valley startup
-* Many contributors (some from outside our lab)
-* Used to teach many university classes
-* Used for research at Google and Yahoo. (TODO, should we remove? I think so)
-Pylearn2 in one slide
---------------------
-TODO
-Other global information
------------------------
-Theano have small basic operation, not layers as base operation:
-* Easy reuse
-* Don't need to reimplement the grad for each variation of layers
-This could cause slowness (more small operation), but the optimizer fix that.
-Pylearn2 wrap the small operations into layers like other
-projects:
-* There is no overhead to this extra layer, due to the
-  compilation of the function by Theano.
-Why scripting for GPUs?
-----------------------
-They *Complement each other*:
-* GPUs are everything that scripting/high level languages are not
- * Highly parallel
- * Very architecture-sensitive
- * Built for maximum FP/memory throughput
- * So hard to program that meta-programming is easier.
-* CPU: largely restricted to control
- * Optimized for sequential code and low latency (rather than high throughput)
- * Tasks (1000/sec)
- * Scripting fast enough
-Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
--- a/doc/omlw2014/pylearn2.txt
+++ b/doc/omlw2014/pylearn2.txt
-.. _omlw2014_pylearn2:
-********
-Pylearn2
-********
-Pointers
--------
-TODO:
-* http://deeplearning.net/software/pylearn2/
-* User mailing list: http://groups.google.com/group/pylearn-users
-* Dev mailing list: http://groups.google.com/group/pylearn-dev
-* Installation: http://deeplearning.net/software/pylearn2/index.html#download-and-installation
-Description
-----------
-TODO:
-* ...
-Simple example
--------------
-(logistic regression?) TODO
-Real example
------------
-(maxout?)TODO
-Known limitations
-----------------
-TODO
-* It is getting stabilized, but still heavily modified.
--- a/doc/omlw2014/sharing.txt
+++ b/doc/omlw2014/sharing.txt
-.. _omlw2014_sharing:
-************
-Sharing code
-************
-* License (BSD 3 clauses suggested, don't forget to add the license info in the code)
-* Common base object? libgpuarray.
-* If not, important implementation that use raw ptr/shape? Doc that interface.
-* Important, *acknowledgement section on web site*(citation like) AND *in paper* about the software we reuse! (and use too)
-*************
-Theano future
-*************
--- a/doc/omlw2014/theano.txt
+++ b/doc/omlw2014/theano.txt
-.. _omlw2014_theano:
-******
-Theano
-******
-Pointers
--------
-* http://deeplearning.net/software/theano/
-* Announcements mailing list: http://groups.google.com/group/theano-announce
-* User mailing list: http://groups.google.com/group/theano-users
-* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
-* Installation: https://deeplearning.net/software/theano/install.html
-Description
-----------
-* Mathematical symbolic expression compiler
-* Dynamic C/CUDA code generation
-* Efficient symbolic differentiation
-  * Theano computes derivatives of functions with one or many inputs.
-  * Also support computation of the Jacobian, Hessian, R and L op.
-* Speed and stability optimizations
-  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
-* Works on Linux, Mac and Windows
-* Transparent use of a GPU
-  * float32 only for now (working on other data types)
-  * Still in experimental state on Windows
-* Extensive unit-testing and self-verification
-  * Detects and diagnoses many types of errors
-* (TODO REMOVE?) On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
-  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
-* Is used with other technologie to generate fast code: C/C++, CUDA, OpenCL, PyCUDA, Cython, Numba, ...
-* Expressions mimic NumPy's syntax & semantics
-* Statically typed and purely functional
-* Sparse operations (CPU only)
-Simple example
--------------
->>> import theano
->>> a = theano.tensor.vector("a")      # declare symbolic variable
->>> b = a + a ** 10                    # build symbolic expression
->>> f = theano.function([a], b)        # compile function
->>> print f([0, 1, 2])                 # prints `array([0, 2, 1026])`
-======================================================  =====================================================
-        Unoptimized graph                                    Optimized graph
-======================================================  =====================================================
-.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
-======================================================  =====================================================
-Symbolic programming = *Paradigm shift*: people need to use it to understand it.
-Exercise 1
-----------
-.. code-block:: python
-  import theano
-  a = theano.tensor.vector()      # declare variable
-  out = a + a ** 10               # build symbolic expression
-  f = theano.function([a], out)   # compile function
-  print f([0, 1, 2])
-  # prints `array([0, 2, 1026])`
-  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
-  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
-Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``
-Real example
------------
-**Logistic Regression**
-* GPU-ready
-* Symbolic differentiation
-* Speed optimizations
-* Stability optimizations
-.. literalinclude:: logreg.py
-**Optimizations:**
-Where are those optimization applied?
-* ``log(1+exp(x))``
-* ``1 / (1 + tt.exp(var))`` (sigmoid)
-* ``log(1-sigmoid(var))`` (softplus, stabilisation)
-* GEMV (matrix-vector multiply from BLAS)
-* Loop fusion
-.. code-block:: python
-  p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
-  # 1 / (1 + tt.exp(var)) -> sigmoid(var)
-  xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
-  # Log(1-sigmoid(var)) -> -sigmoid(var)
-  prediction = p_1 > 0.5
-  cost = xent.mean() + 0.01 * (w ** 2).sum()
-  gw,gb = tt.grad(cost, [w, b])
-  train = theano.function(
-            inputs=[x, y],
-            outputs=[prediction, xent],
-            # w - 0.1 * gw: GEMV with the dot in the grad
-            updates=[(w, w - 0.1 * gw),
-                     (b, b - 0.1 * gb)])
-Theano flags
------------
-Theano can be configured with flags. They can be defined in two ways
-* With an environment variable: ``THEANO_FLAGS="floatX=float32,profile=True"``
-* With a configuration file that defaults to ``~/.theanorc``
-Exercise 2
-----------
-.. code-block:: python
-    import numpy
-    import theano
-    import theano.tensor as tt
-    rng = numpy.random
-    N = 400
-    feats = 784
-    D = (rng.randn(N, feats).astype(theano.config.floatX),
-    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
-    training_steps = 10000
-    # Declare Theano symbolic variables
-    x = tt.matrix("x")
-    y = tt.vector("y")
-    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
-    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
-    x.tag.test_value = D[0]
-    y.tag.test_value = D[1]
-    #print "Initial model:"
-    #print w.get_value(), b.get_value()
-    # Construct Theano expression graph
-    p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
-    prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
-    xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
-    cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to optimize
-    gw,gb = tt.grad(cost, [w, b])
-    # Compile expressions to functions
-    train = theano.function(
-                inputs=[x, y],
-                outputs=[prediction, xent],
-                updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
-                name="train")
-    predict = theano.function(inputs=[x], outputs=prediction,
-                              name="predict")
-    if any([x.op.__class__.__name__=='Gemv' for x in
-            train.maker.fgraph.toposort()]):
-        print 'Used the cpu'
-    elif any([x.op.__class__.__name__=='GpuGemm' for x in
-              train.maker.fgraph.toposort()]):
-        print 'Used the gpu'
-    else:
-        print 'ERROR, not able to tell if theano used the cpu or the gpu'
-        print train.maker.fgraph.toposort()
-    for i in range(training_steps):
-        pred, err = train(D[0], D[1])
-    #print "Final model:"
-    #print w.get_value(), b.get_value()
-    print "target values for D"
-    print D[1]
-    print "prediction on D"
-    print predict(D[0])
-    # Print the graph used in the slides
-    theano.printing.pydotprint(predict,
-                               outfile="pics/logreg_pydotprint_predic.png",
-                               var_with_name_simple=True)
-    theano.printing.pydotprint_variables(prediction,
-                               outfile="pics/logreg_pydotprint_prediction.png",
-                               var_with_name_simple=True)
-    theano.printing.pydotprint(train,
-                               outfile="pics/logreg_pydotprint_train.png",
-                               var_with_name_simple=True)
-Modify and execute the example to run on CPU with floatX=float32
-* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
-GPU
---
-* Only 32 bit floats are supported (being worked on)
-* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
-* Use the Theano flag ``device=gpu`` to tell to use the GPU device
- * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
- * Shared variables with float32 dtype are by default moved to the GPU memory space
-* Use the Theano flag ``floatX=float32``
- * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
- * Cast inputs before putting them into a shared variable
- * Cast "problem": int32 with float32 to float64
-  * Insert manual cast in your code or use [u]int{8,16}
-  * The mean operator is worked on to make the output stay in float32.
-* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.
-  * Theano 0.6rc4 will have the combination of ``force_device=True``
-    and ``device=cpu`` disable the GPU.
-Symbolic variables
------------------
-* # Dimensions
- * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
-* Dtype
- * tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
- * tt.vector to floatX dtype
- * floatX: configurable dtype that can be float32 or float64.
-* Custom variable
- * All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``
- * Other dtype: uint[8,16,32,64], floatX
-Creating symbolic variables: Broadcastability
-* Remember what I said about broadcasting?
-* How to add a row to all rows of a matrix?
-* How to add a column to all columns of a matrix?
-Details regarding symbolic broadcasting...
-* Broadcastability must be specified when creating the variable
-* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
-* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``
-Differentiation details
-----------------------
->>> gw,gb = tt.grad(cost, [w,b])
-* tt.grad works symbolically: takes and returns a Theano variable
-* tt.grad can be compared to a macro: it can be applied multiple times
-* tt.grad takes scalar costs only
-* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
-* TODO update: We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
-New Benchmarks (REMOVE???)
--------------
-`Example <http://arxiv.org/pdf/1211.5590v1.pdf>`_ (Page 7 and 9):
-* Logistic regression, MLP with 1 and 3 layers
-* Recurrent neural networks
-Competitors: Torch7, RNNLM
-* Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks
-OLD advanced presentation
- compilation pipeline
- inplace optimization
- conditions
- loops/rnn
- debugging support
- profiling support 
-Known limitations
-----------------
- Compilation phase distinct from execution phase
-  - Use ``a_tensor_variable.eval()`` to make this less visible
- Compilation time can be significant
-  - Amortize it with functions over big input or reuse functions
- Execution overhead
-  - We have worked on this, but more work needed
-  - So needs a certain number of operations to be useful
- Compilation time superlinear in the size of the graph.
-  - Hundreds of nodes is fine
-  - Disabling a few optimizations can speed up compilation
-  - Usually too many nodes indicates a problem with the graph