Merge pull request #3559 from abergeron/multi_gpu_doc

Multi gpu doc

Merge pull request #3559 from abergeron/multi_gpu_doc
54e96754 · Pascal Lamblin · 4ad36ddc · c5084ac8 · 54e96754 · 54e96754
--- a/doc/index.txt
+++ b/doc/index.txt
@@ -132,7 +132,7 @@ Roughly in order of what you'll want to check out:
 * :ref:`extending` -- Learn to add a Type, Op, or graph optimization.
 * :ref:`dev_start_guide` -- How to contribute code to Theano.
 * :ref:`developer` -- Primarily of interest to developers of Theano
-* :ref:`internal` -- How to maintain Theano, LISA-specific tips, and more...
+* :ref:`internal` -- How to maintain Theano and more...
 * :ref:`release` -- How our release should work.
 * :ref:`acknowledgement` -- What we took from other projects.
 * `Related Projects`_ -- link to other projects that implement new functionalities on top of Theano

--- a/doc/internal/index.txt
+++ b/doc/internal/index.txt
@@ -5,16 +5,11 @@
 Internal Documentation
 ======================
-If you're feeling ambitious, go fix some `pylint
-<http://lgcm.iro.umontreal.ca/auto_theano_pylint/pylint_global.html>` errors!
 .. toctree::
   :maxdepth: 2
   release
   dev_start_guide
-   lisa_labo
-   mammouth
   metadocumentation
   python
   how_to_release
--- a/doc/internal/lisa_labo.txt
+++ b/doc/internal/lisa_labo.txt
-.. _lisa_labo:
-===============================
-LISA Labo specific instructions
-===============================
-Tips for running at LISA
------------------------
-Shell configuration files ``/opt/lisa/os/.local.{bash,csh}rc`` should define
-:envvar:`THEANORC` to include ``/opt/lisa/os/.local.theanorc`` as a
-configuration file.
-``/opt/lisa/os/.local.theanorc`` should include the right default values for
-the lab, in particular, ``blas.ldflags`` should contain '-lgoto'.
-Tips for running on a cluster
-----------------------------
-:ref:`mammouth`
-    For instructions on running Theano on the mammouth cluster.
--- a/doc/internal/mammouth.txt
+++ b/doc/internal/mammouth.txt
-.. _mammouth:
-===========================
-Running Theano on Mammouth
-===========================
-To run Theano on the Mammouth cluster, follow these simple steps:
-    * Make sure to source Fred's .local.bashrc file. It contains all
-      the goodies for using the latest and greatest (optimized) libraries
-      (numpy, scipy, etc.)
-      .. code-block:: sh
-         source /home/bastienf/.local.bashrc
-      Perhaps even put this in your ``.bashrc``
-    * set ``config.blas.ldflags`` to ``'-lmkl -lguide -fopenmp'``
-      (see :mod:`config` to know how)
-      Note: the -lguide flag works, however the fix should probably be considered temporary.
-      Intel has deprecated libguide.so in favor of the newer library libiomp5.so. However, 
-      both libraries are mutually exclusive and one component (theano, numpy or scipy?) already
-      seems to be using libguide.so (hence -liomp5 causes a linking error when compiling thunks)
--- a/doc/internal/metadocumentation.txt
+++ b/doc/internal/metadocumentation.txt
@@ -110,9 +110,6 @@ pylint output is not autogenerated anymore.
 Pylint documentation is generated using pylintrc file: ``Theano/doc/pylintrc``
-You can see a list of all `pylint messages
-<http://www.logilab.org/card/pylintfeatures>`__.
 .. _metadocumentation_nightly_build:

--- a/doc/library/sandbox/gpuarray/dnn.txt
+++ b/doc/library/sandbox/gpuarray/dnn.txt
+.. _libdoc_gpuarray_dnn:
+===========================================
+:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
+===========================================
+.. moduleauthor:: LISA
+`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
+with functionality used by deep neural networks. It provides optimized
+versions of some operations like the convolution. cuDNN is not
+currently installed with CUDA. You must download and install it
+yourself.
+To install it, decompress the downloaded file and make the ``*.h`` and
+``*.so*`` files available to the compilation environment.
+There are at least three possible ways of doing so:
+- The easiest is to include them in your CUDA installation. Copy the
+  ``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
+  ``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
+  on Linux).
+- Alternatively, on Linux, you can set the environment variables
+  ``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
+  extracted from the download. If needed, separate multiple directories
+  with ``:`` as in the ``PATH`` environment variable.
+  example::
+      export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
+      export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
+      export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
+- And as a third way, also on Linux, you can copy the ``*.h`` files
+  to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
+By default, Theano will detect if it can use cuDNN. If so, it will use
+it.  If not, Theano optimizations will not introduce cuDNN ops. So
+Theano will still work if the user did not introduce them manually.
+To get an error if Theano can not use cuDNN, use this Theano flag:
+``optimizer_including=cudnn``.
+.. note::
+   CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
+   faster and offers many more options. We recommend that everybody update to
+   v3.
+.. note::
+   Starting in CuDNN v3, multiple convolution implementations are offered and
+   it is possible to use heuristics to automatically choose a convolution
+   implementation well suited to the parameters of the convolution.
+   The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
+   convolution implementation that Theano should use for forward convolutions.
+   Possible values include :
+   * ``small`` (default) : use a convolution implementation with small memory
+     usage
+   * ``none`` : use a slower implementation with minimal memory usage
+   * ``large`` : use a sometimes faster implementation with large memory usage
+   * ``fft`` : use the Fast Fourrier Transform implementation of convolution
+     (very high memory usage)
+   * ``guess_once`` : the first time a convolution is executed, the
+     implementation to use is chosen according to CuDNN's heuristics and reused
+     for every subsequent execution of the convolution.
+   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   * ``time_once`` : the first time a convolution is executed, every convolution
+     implementation offered by CuDNN is executed and timed. The fastest is
+     reused for every subsequent execution of the convolution.
+   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
+   convolution implementation that Theano should use for gradient convolutions.
+   Possible values include :
+   * ``none`` (default) : use the default non-deterministic convolution
+     implementation
+   * ``deterministic`` : use a slower but deterministic implementation
+   * ``fft`` : use the Fast Fourrier Transform implementation of convolution
+     (very high memory usage)
+   * ``guess_once`` : the first time a convolution is executed, the
+     implementation to use is chosen according to CuDNN's heuristics and reused
+     for every subsequent execution of the convolution.
+   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   * ``time_once`` : the first time a convolution is executed, every convolution
+     implementation offered by CuDNN is executed and timed. The fastest is
+     reused for every subsequent execution of the convolution.
+   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   ``guess_*`` and ``time_*`` flag values take into account the amount of
+   available memory when selecting an implementation. This means that slower
+   implementations might be selected if not enough memory is available for the
+   faster implementations.
+.. note::
+    Normally you should not call GPU Ops directly, but the CPU interface
+    currently does not allow all options supported by cuDNN ops. So it is
+    possible that you will need to call them manually.
+.. note::
+    The documentation of CUDNN tells that, for the 2 following operations, the
+    reproducibility is not guaranteed with the default implementation:
+    `cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
+    Those correspond to the gradient wrt the weights and the gradient wrt the
+    input of the convolution. They are also used sometimes in the forward
+    pass, when they give a speed up.
+    The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
+    slower but deterministic convolution implementation.
+.. note::
+    There is a problem we do not understand yet when cudnn paths are
+    used with symbolic links. So avoid using that.
+.. note::
+    cudnn.so* must be readable and executable by everybody.
+    cudnn.h must be readable by everybody.
+Functions
+=========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: dnn_conv, dnn_pool
+Convolution Ops
+===============
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
+Pooling Ops
+===========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
+Softmax Ops
+===========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
--- a/doc/library/sandbox/gpuarray/extra.txt
+++ b/doc/library/sandbox/gpuarray/extra.txt
+.. _libdoc_gpuarray_extra:
+=================
+Utility functions
+=================
+Optimisation
+------------
+.. automodule:: theano.sandbox.gpuarray.opt_util
+   :members:
+Kernel generation
+-----------------
+.. automodule:: theano.sandbox.gpuarray.kernel_codegen
+   :members:
--- a/doc/library/sandbox/gpuarray/index.txt
+++ b/doc/library/sandbox/gpuarray/index.txt
+.. _libdoc_gpuarray:
+=======================================================
+:mod:`theano.sandbox.gpuarray` -- The (new) GPU backend
+=======================================================
+.. module:: theano.sandbox.gpuarray
+   :platform: Unix, Windows
+   :synopsis: Code for GPU programming (new)
+.. moduleauthor:: MILA
+.. toctree::
+    :maxdepth: 1
+    op
+    dnn
+    type
+    extra
--- a/doc/library/sandbox/gpuarray/op.txt
+++ b/doc/library/sandbox/gpuarray/op.txt
+.. _libdoc_gpuarray_op:
+================================
+List of gpuarray Ops implemented
+================================
+.. moduleauthor:: LISA
+Normally you should not call directly those Ops! Theano should
+automatically transform cpu ops to their gpu equivalent. So this list
+is just useful to let people know what is implemented on the gpu.
+Basic Op
+========
+.. automodule:: theano.sandbox.gpuarray.basic_ops
+    :members:
+Blas Op
+=======
+.. automodule:: theano.sandbox.gpuarray.blas
+    :members:
+.. automodule:: theano.sandbox.gpuarray.nerv
+    :members:
+Elemwise Op
+===========
+.. automodule:: theano.sandbox.gpuarray.elemwise
+    :members:
+Subtensor Op
+============
+.. automodule:: theano.sandbox.gpuarray.subtensor
+    :members:
+Nnet Op
+=======
+.. automodule:: theano.sandbox.gpuarray.nnet
+    :members:
+.. automodule:: theano.sandbox.gpuarray.neighbours
+    :members:
--- a/doc/library/sandbox/gpuarray/type.txt
+++ b/doc/library/sandbox/gpuarray/type.txt
+.. _libdoc_gpuarray_type:
+===================================================
+:mod:`theano.sandbox.gpuarray.type` -- Type classes
+===================================================
+.. automodule:: theano.sandbox.gpuarray.type
+   :members:
--- a/doc/library/sandbox/index.txt
+++ b/doc/library/sandbox/index.txt
@@ -14,6 +14,8 @@
    :maxdepth: 1
    cuda/index
+    gpuarray/index
    linalg
    neighbours
    rng_mrg
+    blocksparse
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -37,6 +37,7 @@ you out.
    loop
    sparse
    using_gpu
+    using_multi_gpu
    gpu_data_convert
    aliasing
    shape_info

--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
+.. _tut_using_multi_gpu:
+===================
+Using multiple GPUs
+===================
+Theano has a feature to allow the use of multiple GPUs at the same
+time in one function.  The multiple gpu feature requires the use of
+the :ref:`gpuarray` backend, so make sure that works correctly.
+In order to keep a reasonably high level of abstraction you do not
+refer to device names directly for multiple-gpu use.  You instead
+refer to what we call context names.  These are then mapped to a
+device using the theano configuration.  This allows portability of
+models between machines.
+.. warning::
+   The code is rather new and is still considered experimental at this
+   point.  It has been tested and seems to perform correctly in all
+   cases observed, but make sure to double-check your results before
+   publishing a paper or anything of the sort.
+Defining the context map
+------------------------
+The mapping from context names to devices is done through the
+:attr:`config.contexts` option.  The format looks like this::
+    dev0->cuda0;dev1->cuda1
+Let's break it down.  First there is a list of mappings.  Each of
+these mappings is separeted by a semicolon ';'.  There can be any
+number of such mappings, but in the example above we have two of them:
+`dev0->cuda0` and `dev1->cuda1`.
+The mappings themselves are composed of a context name followed by the
+two characters '->' and the device name.  The context name is a simple
+string which does not have any special meaning for Theano.  For
+parsing reasons, the context name cannot contain the sequence '->' or
+';'.  To avoid confusion context names that begin with 'cuda' or
+'opencl' are disallowed.  The device name is a device in the form that
+gpuarray expects like 'cuda0' or 'opencl0:0'.
+.. note::
+   Since there are a bunch of shell special characters in the syntax,
+   defining this on the command-line will require proper quoting, like this:
+   .. code-block:: shell
+       $ THEANO_FLAGS="contexts=dev0->cuda0"
+When you define a context map, if :attr:`config.print_active_device`
+is `True` (the default), Theano will print the mappings as they are
+defined.  This will look like this:
+.. code-block:: bash
+   $ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
+   Mapped name dev0 to device cuda0: GeForce GTX TITAN X
+   Mapped name dev1 to device cuda1: GeForce GTX TITAN X
+If you don't have enough GPUs for a certain model, you can assign the
+same device to more than one name. You can also assign extra names
+that a model doesn't need to some other devices.  However, a
+proliferation of names is not always a good idea since theano often
+assumes that different context names will be on different devices and
+will optimize accordingly.  So you may get faster performance for a
+single name and a single device.
+.. note::
+   It is often the case that multi-gpu operation requires or assumes
+   that all the GPUs involved are equivalent.  This is not the case
+   for this implementation.  Since the user has the task of
+   distrubuting the jobs across the different device a model can be
+   built on the assumption that one of the GPU is slower or has
+   smaller memory.
+A simple graph on two GPUs
+--------------------------
+The following simple program works on two GPUs.  It builds a function
+which perform two dot products on two different GPUs.
+.. code-block:: python
+   import numpy
+   import theano
+   v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
+                       target='dev0')
+   v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
+                       target='dev0')
+   v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
+                       target='dev1')
+   v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
+                       target='dev1')
+   f = theano.function([], [theano.tensor.dot(v01, v02),
+                            theano.tensor.dot(v11, v12)])
+   f()
+This model requires a context map with assignations for 'dev0' and
+'dev1'.  It should run twice as fast when the devices are different.
+Explicit transfers of data
+--------------------------
+Since operations themselves cannot work on more than one device, they
+will pick a device to work on based on their inputs and automatically
+insert transfers for any input which is not on the right device.
+However you may want some explicit control over where and how these
+transfers are done at some points.  This is done by using the new
+:meth:`transfer` method that is present on variables.  It works for
+moving data between GPUs and also between the host and the GPUs.  Here
+is a example.
+.. code-block:: python
+   import theano
+   v = theano.tensor.fmatrix()
+   # Move to the device associated with 'gpudev'
+   gv = v.transfer('gpudev')
+   # Move back to the cpu
+   cv = gv.transfer('cpu')
+Of course you can mix transfers and operations in any order you
+choose.  However you should try to minimize transfer operations
+because they will introduce overhead any may reduce performance.
--- a/theano/misc/check_multi_gpu.py
+++ b/theano/misc/check_multi_gpu.py
@@ -12,7 +12,6 @@ import numpy
 import theano
 from theano.sandbox.gpuarray import init_dev
-from theano.sandbox.gpuarray.type import gpuarray_shared_constructor as shared
 from theano.sandbox.gpuarray.blas import gpu_dot22
@@ -22,13 +21,13 @@ def main(dev1, dev2):
    size = 1024 * 16
    data = numpy.random.randn(size, size).astype('float32')
-    val1a = shared(data, target='ctx1')
+    val1a = theano.shared(data, target='ctx1')
-    val1b = shared(data, target='ctx1')
+    val1b = theano.shared(data, target='ctx1')
-    val1c = shared(data, target='ctx1')
+    val1c = theano.shared(data, target='ctx1')
-    val1d = shared(data, target='ctx1')
+    val1d = theano.shared(data, target='ctx1')
-    val2a = shared(data, target='ctx2')
+    val2a = theano.shared(data, target='ctx2')
-    val2b = shared(data, target='ctx2')
+    val2b = theano.shared(data, target='ctx2')
    f1 = theano.function([], [gpu_dot22(val1a, val1b),
                              gpu_dot22(val1c, val1d)])

--- a/theano/sandbox/gpuarray/basic_ops.py
+++ b/theano/sandbox/gpuarray/basic_ops.py
@@ -27,6 +27,20 @@ from .fp16_help import write_w
 def as_gpuarray_variable(x, context_name):
+    """
+    This will attempt to convert `x` into a variable on the GPU.
+    It can take either a value of another variable.  If `x` is already
+    suitable, it will be returned as-is.
+    Parameters
+    ----------
+    x
+        Object to convert
+    context_name : str or None
+        target context name for the result
+    """
    # If this is already some form of variable, try to avoid an extra transfer
    if isinstance(x, Variable):
        while True:
@@ -174,6 +188,13 @@ class Kernel(object):
 class GpuKernelBase(object):
+    """
+    Base class for operations that need to compile kernels.
+    It is not mandatory to use this class, but it helps with a lot of
+    the small things that you have to pay attention to.
+    """
    params_type = gpu_context_type
    def gpu_kernels(self, node, name):
@@ -274,10 +295,25 @@ class GpuKernelBase(object):
        return (self.c_code_cache_version(), self.kernel_version(node))
    def kernel_version(self, node):
+        """
+        If you override :meth:`c_code_cache_version_apply`, call this
+        method to have the version of the kernel support code and
+        device.
+        Parameters
+        ----------
+        node : apply node
+            The node that we need the cache version for.
+        """
        return (3, self.get_params(node).bin_id)
 class HostFromGpu(Op):
+    """
+    Transfer data to CPU.
+    """
    __props__ = ()
    _f16_ok = True
@@ -356,6 +392,10 @@ host_from_gpu = HostFromGpu()
 class GpuFromHost(Op):
+    """
+    Transfer data to GPU.
+    """
    __props__ = ('context_name',)
    _f16_ok = True
    params_type = gpu_context_type
@@ -443,6 +483,10 @@ class GpuFromHost(Op):
 class GpuToGpu(Op):
+    """
+    Transfer data between GPUs.
+    """
    __props__ = ('context_name',)
    _f16_ok = True
    params_type = gpu_context_type
@@ -494,6 +538,7 @@ class GpuToGpu(Op):
 class GpuAlloc(HideC, Alloc):
    """
+    Allocate initialized memory on the GPU.
    Parameters
    ----------
@@ -654,6 +699,10 @@ class GpuAlloc(HideC, Alloc):
 class GpuAllocEmpty(HideC, Alloc):
+    """
+    Allocate uninitialized memory on the GPU.
+    """
    __props__ = ('dtype', 'context_name')
    _f16_ok = True
    params_type = gpu_context_type
@@ -732,8 +781,10 @@ def empty_like(var):
 class GpuContiguous(Op):
    """
-    Always return a c contiguous output. Copy the input only if it is
+    Return a C contiguous version of the input.
-    not already c contiguous.
+    This may either pass the object as-is (if already C contiguous) or
+    make a copy.
    """
    __props__ = ()
@@ -793,7 +844,7 @@ gpu_contiguous = GpuContiguous()
 class GpuReshape(HideC, tensor.Reshape):
    """
-    Implement Reshape on the gpu.
+    Reshape for GPU variables.
    """
@@ -914,6 +965,10 @@ class GpuReshape(HideC, tensor.Reshape):
 class GpuJoin(HideC, Join):
+    """
+    Join for GPU.
+    """
    _f16_ok = True
    params_type = gpu_context_type
@@ -991,6 +1046,10 @@ gpu_join = GpuJoin()
 class GpuSplit(HideC, Split):
+    """
+    Split for GPU.
+    """
    def make_node(self, x, axis, splits):
        node = Split.make_node(self, x, axis, splits)
        x = as_gpuarray_variable(x, infer_context_name(x))
@@ -1002,6 +1061,10 @@ class GpuSplit(HideC, Split):
 class GpuEye(GpuKernelBase, Op):
+    """
+    Eye for GPU.
+    """
    __props__ = ('dtype', 'context_name')
    _f16_ok = True

--- a/theano/sandbox/gpuarray/blas.py
+++ b/theano/sandbox/gpuarray/blas.py
@@ -31,6 +31,10 @@ class BlasOp(Op):
 class GpuGemv(BlasOp):
+    """
+    Gemv on the GPU.
+    """
    __props__ = ('inplace',)
    def __init__(self, inplace=False):
@@ -107,6 +111,10 @@ gpugemv_inplace = GpuGemv(inplace=True)
 class GpuGemm(BlasOp):
+    """
+    Gemm on the GPU.
+    """
    __props__ = ('inplace',)
    _f16_ok = True
@@ -184,6 +192,10 @@ gpugemm_inplace = GpuGemm(inplace=True)
 class GpuGer(BlasOp):
+    """
+    Ger on the GPU.
+    """
    __props__ = ('inplace',)
    def __init__(self, inplace=False):
@@ -256,6 +268,10 @@ gpuger_inplace = GpuGer(inplace=True)
 class GpuDot22(BlasOp):
+    """
+    Dot22 on the GPU.
+    """
    __props__ = ()
    def make_node(self, x, y):

--- a/theano/sandbox/gpuarray/elemwise.py
+++ b/theano/sandbox/gpuarray/elemwise.py
@@ -57,6 +57,10 @@ def as_C_string_const(s):
 class GpuElemwise(GpuKernelBase, HideC, Elemwise):
+    """
+    Elemwise on the GPU.
+    """
    nin = property(lambda self: self.scalar_op.nin)
    nout = property(lambda self: self.scalar_op.nout)
    _f16_ok = True
@@ -445,6 +449,10 @@ class SupportCodeError(Exception):
 class GpuDimShuffle(HideC, DimShuffle):
+    """
+    DimShuffle on the GPU.
+    """
    _f16_ok = True
    def make_node(self, input):
@@ -548,7 +556,7 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
    Parameters
    ----------
-    reduce-mask
+    reduce_mask
        The dimensions along which to reduce. The `reduce_mask` is a tuple of
        booleans (actually integers 0 or 1) that specify for each input
        dimension, whether to reduce it (1) or not (0).
@@ -1279,14 +1287,6 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
        """ % locals()
    def c_code_reduce_ccontig(self, sio, node, name, x, z, fail):
-        """
-        WRITEME
-        IG: I believe, based on how this is called in c_code, that it
-        is for the case where we are reducing on all axes and x is
-        C contiguous.
-        """
        in_dtype = "npy_" + node.inputs[0].dtype
        out_dtype = "npy_" + node.outputs[0].dtype
        if getattr(self.scalar_op, 'identity', None) == 0:
@@ -2666,8 +2666,6 @@ class GpuCAReduceCPY(GpuKernelBase, HideC, CAReduceDtype):
    """
    CAReduce that reuse the python code from gpuarray.
-    Too slow for now as it only have a python interface.
    """
    def __init__(self, scalar_op, axis=None, dtype=None, acc_dtype=None):
        if not hasattr(scalar_op, 'identity'):

--- a/theano/sandbox/gpuarray/kernel_codegen.py
+++ b/theano/sandbox/gpuarray/kernel_codegen.py
@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
    count
        Number of executing threads.
    manner_fn
-        A function that accepts strings of arguments a and b, and returns c code
+        A function that accepts strings of arguments a and b, and
-        for their reduction.
+        returns c code for their reduction.
-        Example: return "%(a)s + %(b)s" for a sum reduction.
-    :postcondition:
+          return "%(a)s + %(b)s"
-    This function leaves the answer in position 0 of the buffer. The
-    rest of the buffer is trashed by this function.
+        for a sum reduction.
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
+    This function leaves the answer in position 0 of the buffer. The
+    rest of the buffer is trashed by this function.
    """
    loop_line = manner_fn("%s[%s]" % (buf, pos), "%s[i]" % (buf))
@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
              inline_reduce_sum.code_version)
 def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
    """
+    Generate code for a softmax.
+    On entry, `buf` and `buf2` must contain two identical copies of
+    the input to softmax.
+    After the code returns `buf` contains the softmax, `buf2` contains
+    un-normalized softmax.
    Parameters
    ----------
@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
    dtype
        Dtype of the softmax's output.
-    :Precondition: buf and buf2 contain two identical copies of the input
-        to softmax
-    :Postcondition: buf contains the softmax, buf2 contains un-normalized
-        softmax
    Notes
    -----
-    buf and buf2 should be in gpu shared memory, we access it many times.
+    `buf` and `buf2` should be in gpu shared memory, we access it many
+    times.
    We use __i as an int variable in a loop.
@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
    """
    Return C++ code for a function that reduces a contiguous buffer.
+    This function leaves the answer in position 0 of the buffer. The
+    rest of the buffer is trashed by this function.
    Parameters
    ----------
    N
@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
    dtype
        Optional, the dtype of the output.
    manner_fn
-        A function that accepts strings of arguments a and b, and returns c code
+        A function that accepts strings of arguments a and b, and
-        for their reduction.
+        returns c code for their reduction.
-        Example: return "%(a)s + %(b)s" for a sum reduction.
-    manner_init
-        A function that accepts strings of arguments a and return c code for its
-        initialization.
-    :postcondition:
+          return "%(a)s + %(b)s"
-    This function leaves the answer in position 0 of the buffer. The rest of the
-    buffer is trashed by this function.
+        for a sum reduction.
+    manner_init
+        A function that accepts strings of arguments a and return c
+        code for its initialization.
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
    """
    if b:
@@ -320,6 +327,13 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
                                b='', stride_b='', load_b='',
                                dtype="float32"):
    """
+    Generate code to perform softmax with a fixed amount of shared
+    memory.
+    On entry, `buf` is assumed to be empty.
+    On exit, `buf[0]` contains the softmax, `buf2` contains
+    un-normalized softmax.
    Parameters
    ----------
@@ -352,13 +366,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
    dtype
        Optional, the dtype of the softmax's output if not float32.
-    :Precondition: buf is empty
-    :Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
-        softmax
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
    We use tx as an int variable in a loop.

--- a/theano/sandbox/gpuarray/neighbours.py
+++ b/theano/sandbox/gpuarray/neighbours.py
@@ -17,6 +17,10 @@ from .type import GpuArrayType
 class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
+    """
+    Images2Neibs for the GPU.
+    """
    def __init__(self, mode='valid'):
        if mode not in ['valid', 'ignore_borders', 'wrap_centered']:
            raise NotImplementedError("Only the mode valid, ignore_borders"

--- a/theano/sandbox/gpuarray/nerv.py
+++ b/theano/sandbox/gpuarray/nerv.py
@@ -41,6 +41,9 @@ def ensure_float(val, name):
 class Gemm16(COp):
+    """
+    Gemm for float16 using the nervena kernels.
+    """
    __props__ = ('relu', 'inplace')
    _f16_ok = True
    params_type = gpu_context_type

--- a/theano/sandbox/gpuarray/opt_util.py
+++ b/theano/sandbox/gpuarray/opt_util.py
@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
    Parameters
    ----------
-    v : variable
+    v
        Theano variable to extract the constant value from.
    nd : int
        Expected number of dimensions for the variable (for
@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
    Parameters
    ----------
-    v : variable
+    v
        The variable to dig through
    cls : Op class
        The type of the node we are looking for
@@ -84,9 +84,9 @@ def is_equal(var, val):
    Parameters
    ----------
-    var : variable
+    var
        Variable to compare
-    val : value
+    val
        Python value
    """
@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
    """
    Decorator to merge multiplication by a scalar on the output.
-    This will find a pattern of scal * <yourop>(some, params, alpha,
+    This will find a pattern of `scal * <yourop>(some, params, alpha,
-    beta) and update it so that the scalar multiplication happens as
+    beta)` and update it so that the scalar multiplication happens as
    part of your op.
-    The op needs to accept an alpha and a beta scalar which act this way:
+    The op needs to accept an alpha and a beta scalar which act this way::
       out = Op() * alpha + out_like * beta
@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
    and gets added to the "real" output of the operation.  An example
    of an operation that respects this pattern is GEMM from blas.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, *inputs)
@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
    for your op so that the new version performs the same computation.
    The `*inputs` parameters contains the new inputs for your op.  You
    MUST use those inputs instead of the ones on `node`.  Note that
-    this function can be as simple as:
+    this function can be as simple as::
        def maker(node, *inputs):
            return node.op(*inputs)
@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
    Returns
    -------
-    This returns an unregistered local optimizer that has the same
+    local optimizer
-    name as the decorated function.
+        an unregistered local optimizer that has the same name as the
+        decorated function.
    Notes
    -----
@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    """
    Decorator to merge addition by a value on the output.
-    This will find a pattern of val * <yourop>(some, params, alpha,
+    This will find a pattern of `val * <yourop>(some, params, alpha,
-    beta, out_like) and update it so that the addtition happens as
+    beta, out_like)` and update it so that the addtition happens as
    part of your op.
-    The op needs to accept an alpha and a beta scalar which act this way:
+    The op needs to accept an alpha and a beta scalar which act this way::
       out = Op() * alpha + out_like * beta
@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    and gets added to the "real" output of the operation.  An example
    of an operation that respects this pattern is GEMM from blas.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, *inputs)
@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    for your op so that the new version performs the same computation.
    The `*inputs` parameters contains the new inputs for your op.  You
    MUST use those inputs instead of the ones on `node`.  Note that
-    this function can be as simple as:
+    this function can be as simple as::
        def maker(node, *inputs):
            return node.op(*inputs)
@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    Returns
    -------
-    This returns an unregistered local optimizer that has the same
+    local optimizer
-    name as the decorated function.
+        an unregistered local optimizer that has the same name as the
+        decorated function.
    Notes
    -----
@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
    This will duplicate the alloc input if it has more than one client
    to allow the op to work on it inplace.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, inputs)
@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
    You should also switch the op to work inplace.  The `*inputs`
    parameters contains the new inputs for your op.  You MUST use
    those inputs instead of the ones on `node`.  Note that this
-    function can be as simple as:
+    function can be as simple as::
        def maker(node, inputs):
            return [node.op.__class__(inplace=True)(*inputs)]
@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
    Returns
    -------
-    This returns an unregistered inplace local optimizer that has the
+    local optimizer
-    same name as the decorated function.
+        an unregistered inplace local optimizer that has the same name
+        as the decorated function.
    """
    def wrapper(maker):

--- a/theano/sandbox/gpuarray/subtensor.py
+++ b/theano/sandbox/gpuarray/subtensor.py
@@ -24,6 +24,9 @@ from .elemwise import GpuElemwise
 class GpuSubtensor(HideC, Subtensor):
+    """
+    Subtensor on the GPU.
+    """
    _f16_ok = True
    def make_node(self, x, *inputs):
@@ -173,8 +176,8 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
    The optimization to make this inplace is in tensor/opt.
    The same optimization handles IncSubtensor and GpuIncSubtensor.
    This Op has c_code too; it inherits tensor.IncSubtensor's c_code.
-    The helper methods like do_type_checking, copy_of_x, etc. specialize
+    The helper methods like :meth:`do_type_checking`,
-    the c_code for this Op.
+    :meth:`copy_of_x`, etc. specialize the c_code for this Op.
    """
@@ -405,6 +408,9 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
 class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1):
+    """
+    AdvancedSubrensor1 on the GPU.
+    """
    def make_node(self, x, ilist):
        ctx_name = infer_context_name(x, ilist)
        x_ = as_gpuarray_variable(x, ctx_name)
@@ -580,8 +586,10 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1):
    _f16_ok = True
    def make_node(self, x, y, ilist):
-        """It defer from GpuAdvancedIncSubtensor1 in that it make sure
+        """
-        the index are of type long.
+        It differs from GpuAdvancedIncSubtensor1 in that it makes sure
+        the indexes are of type long.
        """
        ctx_name = infer_context_name(x, y, ilist)
        x_ = as_gpuarray_variable(x, ctx_name)

--- a/theano/sandbox/gpuarray/type.py
+++ b/theano/sandbox/gpuarray/type.py
@@ -67,6 +67,7 @@ def get_context(name):
 def list_contexts():
    """
    Return an iterable of all the registered context names.
    """
    return _context_reg.keys()
@@ -85,6 +86,54 @@ def _unreg_context(name):
 class GpuArrayType(Type):
+    """
+    The type that represents an array on a gpu.
+    The `dtype` indicates what scalar data type the elements of
+    variables of this type will be.
+    `broadcastable` indicates whether each dimension is broadcastable
+    or not (to be broadcastable a dimension must always be of length
+    1).
+    The `context_name` is the name of the context on will values of
+    variables of this type will be stored.
+    Parameters
+    ----------
+    dtype : str
+        The name of a numpy dtype
+    broadcastable : tuple of bools
+        A tuple that indicates both the number of dimensions (by its
+        length) and whether those dimensions are broadcastable or not
+        (by the boolean values).
+    context_name : str
+        The name of the context the that this type is attached to
+        (default: None, which is the context specified by
+        config.device).
+    name : string, optional
+        A name for the type that will be used in printouts.
+    Attributes
+    ----------
+    dtype : str
+        Data type used for scalar elements of variables.
+    broadcastable : tuple of bools
+        Indicates whether the dimensions are broadcastable or not.
+    ndim : int
+        The number of dimensions
+    context_name : str
+        The name of a gpu context on which variables will have their values.
+    name : str
+        A string used to print the type if given.
+    typecode : int
+        The gpuarray typecode for `dtype`
+    See Also
+    --------
+    theano.gof.type.PureType
+    """
    def __init__(self, dtype, broadcastable, context_name=None, name=None):
        # In case this was not provided and no global value is available
        self.dtype = str(dtype)
@@ -111,6 +160,11 @@ class GpuArrayType(Type):
    # This is a property to keep the type pickleable
    @property
    def context(self):
+        """
+        The context object mapped to the type's :attr:`context_name`.
+        This is a property.
+        """
        return get_context(self.context_name)
    def __repr__(self):
@@ -306,8 +360,6 @@ class GpuArrayType(Type):
        This function is used internally as part of C code generation.
        """
-        # TODO: add more type correspondances for e.g. int32, int64, float32,
-        # complex64, etc.
        try:
            return {
                'float16': (float, 'npy_float16', 'NPY_FLOAT16'),
@@ -321,8 +373,8 @@ class GpuArrayType(Type):
                'int32': (int, 'npy_int32', 'NPY_INT32'),
                'uint64': (int, 'npy_uint64', 'NPY_UINT64'),
                'int64': (int, 'npy_int64', 'NPY_INT64'),
-                'complex128': (complex, 'theano_complex128', 'NPY_COMPLEX128'),
+                # 'complex128': (complex, 'theano_complex128', 'NPY_COMPLEX128'),
-                'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
+                # 'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
                }[self.dtype]
        except KeyError:
            raise TypeError("Unsupported dtype for %s: %s" %
@@ -420,10 +472,21 @@ class _operators(_tensor_py_operators):
 class GpuArrayVariable(_operators, Variable):
+    """
+    A variable representing a computation on a certain GPU.
+    This supports all the operations that :class:`TensorType`
+    supports.
+    See Also
+    --------
+    Variable
+    """
    # override the default
    def __repr_test_value__(self):
        return repr(numpy.array(theano.gof.op.get_test_value(self)))
-    pass
 GpuArrayType.Variable = GpuArrayVariable
@@ -436,6 +499,17 @@ class GpuArraySignature(tensor.TensorConstantSignature):
 class GpuArrayConstant(_operators, Constant):
+    """
+    A constant representing a value on a certain GPU.
+    This supports all the operations that :class:`TensorType`
+    supports.
+    See Also
+    --------
+    Constant
+    """
    def signature(self):
        return GpuArraySignature((self.type, numpy.asarray(self.data)))
@@ -453,6 +527,17 @@ GpuArrayType.Constant = GpuArrayConstant
 class GpuArraySharedVariable(_operators, SharedVariable):
+    """
+    A variable representing a shared value on a certain GPU.
+    This supports all the operations that :class:`TensorType`
+    supports.
+    See Also
+    --------
+    SharedVariable
+    """
    def get_value(self, borrow=False, return_internal_type=False):
        if return_internal_type:
            if borrow:
@@ -481,6 +566,8 @@ def gpuarray_shared_constructor(value, name=None, strict=False,
    """
    SharedVariable constructor for GpuArrayType.
+    See :func:`theano.shared`.
    """
    if target == 'gpu' or target == 'cpu':
        raise TypeError('not for me')
@@ -596,6 +683,13 @@ theano.compile.register_specify_shape_c_code(
 class GpuContextType(Type):
+    """
+    Minimal type used for passing contexts to nodes.
+    This Type is not a complete type and should never be used for
+    regular graph operations.
+    """
    def filter(self, data, strict=False, allow_downcast=None):
        if not isinstance(data, gpuarray.GpuContext):
            raise TypeError('context is not a GpuContext')
@@ -652,4 +746,8 @@ Py_INCREF(%(name)s);
    # Variable, Contstant, ... not declared
+"""
+Instance of :class:`GpuContextType` to use for the context_type
+declaration of an operation.
+"""
 gpu_context_type = GpuContextType()