Rest of libdoc for gpuarray.

30617ff5 · Arnaud Bergeron · 3bf6f4cb · 30617ff5 · 30617ff5 · 30617ff5
--- a/doc/library/sandbox/gpuarray/dnn.txt
+++ b/doc/library/sandbox/gpuarray/dnn.txt
+.. _libdoc_gpuarray_dnn:
+===========================================
+:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
+===========================================
+.. moduleauthor:: LISA
+`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
+with functionality used by deep neural networks. It provides optimized
+versions of some operations like the convolution. cuDNN is not
+currently installed with CUDA. You must download and install it
+yourself.
+To install it, decompress the downloaded file and make the ``*.h`` and
+``*.so*`` files available to the compilation environment.
+There are at least three possible ways of doing so:
+- The easiest is to include them in your CUDA installation. Copy the
+  ``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
+  ``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
+  on Linux).
+- Alternatively, on Linux, you can set the environment variables
+  ``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
+  extracted from the download. If needed, separate multiple directories
+  with ``:`` as in the ``PATH`` environment variable.
+  example::
+      export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
+      export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
+      export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
+- And as a third way, also on Linux, you can copy the ``*.h`` files
+  to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
+By default, Theano will detect if it can use cuDNN. If so, it will use
+it.  If not, Theano optimizations will not introduce cuDNN ops. So
+Theano will still work if the user did not introduce them manually.
+To get an error if Theano can not use cuDNN, use this Theano flag:
+``optimizer_including=cudnn``.
+.. note::
+   CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
+   faster and offers many more options. We recommend that everybody update to
+   v3.
+.. note::
+   Starting in CuDNN v3, multiple convolution implementations are offered and
+   it is possible to use heuristics to automatically choose a convolution
+   implementation well suited to the parameters of the convolution.
+   The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
+   convolution implementation that Theano should use for forward convolutions.
+   Possible values include :
+   * ``small`` (default) : use a convolution implementation with small memory
+     usage
+   * ``none`` : use a slower implementation with minimal memory usage
+   * ``large`` : use a sometimes faster implementation with large memory usage
+   * ``fft`` : use the Fast Fourrier Transform implementation of convolution
+     (very high memory usage)
+   * ``guess_once`` : the first time a convolution is executed, the
+     implementation to use is chosen according to CuDNN's heuristics and reused
+     for every subsequent execution of the convolution.
+   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   * ``time_once`` : the first time a convolution is executed, every convolution
+     implementation offered by CuDNN is executed and timed. The fastest is
+     reused for every subsequent execution of the convolution.
+   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
+   convolution implementation that Theano should use for gradient convolutions.
+   Possible values include :
+   * ``none`` (default) : use the default non-deterministic convolution
+     implementation
+   * ``deterministic`` : use a slower but deterministic implementation
+   * ``fft`` : use the Fast Fourrier Transform implementation of convolution
+     (very high memory usage)
+   * ``guess_once`` : the first time a convolution is executed, the
+     implementation to use is chosen according to CuDNN's heuristics and reused
+     for every subsequent execution of the convolution.
+   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   * ``time_once`` : the first time a convolution is executed, every convolution
+     implementation offered by CuDNN is executed and timed. The fastest is
+     reused for every subsequent execution of the convolution.
+   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
+     implementation selected every time the shapes of the inputs and kernels
+     don't match the shapes from the last execution.
+   ``guess_*`` and ``time_*`` flag values take into account the amount of
+   available memory when selecting an implementation. This means that slower
+   implementations might be selected if not enough memory is available for the
+   faster implementations.
+.. note::
+    Normally you should not call GPU Ops directly, but the CPU interface
+    currently does not allow all options supported by cuDNN ops. So it is
+    possible that you will need to call them manually.
+.. note::
+    The documentation of CUDNN tells that, for the 2 following operations, the
+    reproducibility is not guaranteed with the default implementation:
+    `cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
+    Those correspond to the gradient wrt the weights and the gradient wrt the
+    input of the convolution. They are also used sometimes in the forward
+    pass, when they give a speed up.
+    The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
+    slower but deterministic convolution implementation.
+.. note::
+    There is a problem we do not understand yet when cudnn paths are
+    used with symbolic links. So avoid using that.
+.. note::
+    cudnn.so* must be readable and executable by everybody.
+    cudnn.h must be readable by everybody.
+Functions
+=========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: dnn_conv, dnn_pool
+Convolution Ops
+===============
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
+Pooling Ops
+===========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
+Softmax Ops
+===========
+.. automodule:: theano.sandbox.gpuarray.dnn
+   :noindex:
+   :members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
--- a/doc/library/sandbox/gpuarray/extra.txt
+++ b/doc/library/sandbox/gpuarray/extra.txt
+.. _libdoc_gpuarray_extra:
+=================
+Utility functions
+=================
+Optimisation
+------------
+.. automodule:: theano.sandbox.gpuarray.opt_util
+   :members:
+Kernel generation
+-----------------
+.. automodule:: theano.sandbox.gpuarray.kernel_codegen
+   :members:
--- a/doc/library/sandbox/gpuarray/index.txt
+++ b/doc/library/sandbox/gpuarray/index.txt
@@ -14,4 +14,6 @@
    :maxdepth: 1
    op
+    dnn
    type
+    extra
--- a/theano/sandbox/gpuarray/kernel_codegen.py
+++ b/theano/sandbox/gpuarray/kernel_codegen.py
@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
    count
        Number of executing threads.
    manner_fn
-        A function that accepts strings of arguments a and b, and returns c code
+        A function that accepts strings of arguments a and b, and
-        for their reduction.
+        returns c code for their reduction.
-        Example: return "%(a)s + %(b)s" for a sum reduction.
-    :postcondition:
+          return "%(a)s + %(b)s"
-    This function leaves the answer in position 0 of the buffer. The
-    rest of the buffer is trashed by this function.
+        for a sum reduction.
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
+    This function leaves the answer in position 0 of the buffer. The
+    rest of the buffer is trashed by this function.
    """
    loop_line = manner_fn("%s[%s]" % (buf, pos), "%s[i]" % (buf))
@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
              inline_reduce_sum.code_version)
 def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
    """
+    Generate code for a softmax.
+    On entry, `buf` and `buf2` must contain two identical copies of
+    the input to softmax.
+    After the code returns `buf` contains the softmax, `buf2` contains
+    un-normalized softmax.
    Parameters
    ----------
@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
    dtype
        Dtype of the softmax's output.
-    :Precondition: buf and buf2 contain two identical copies of the input
-        to softmax
-    :Postcondition: buf contains the softmax, buf2 contains un-normalized
-        softmax
    Notes
    -----
-    buf and buf2 should be in gpu shared memory, we access it many times.
+    `buf` and `buf2` should be in gpu shared memory, we access it many
+    times.
    We use __i as an int variable in a loop.
@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
    """
    Return C++ code for a function that reduces a contiguous buffer.
+    This function leaves the answer in position 0 of the buffer. The
+    rest of the buffer is trashed by this function.
    Parameters
    ----------
    N
@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
    dtype
        Optional, the dtype of the output.
    manner_fn
-        A function that accepts strings of arguments a and b, and returns c code
+        A function that accepts strings of arguments a and b, and
-        for their reduction.
+        returns c code for their reduction.
-        Example: return "%(a)s + %(b)s" for a sum reduction.
-    manner_init
-        A function that accepts strings of arguments a and return c code for its
-        initialization.
-    :postcondition:
+          return "%(a)s + %(b)s"
-    This function leaves the answer in position 0 of the buffer. The rest of the
-    buffer is trashed by this function.
+        for a sum reduction.
+    manner_init
+        A function that accepts strings of arguments a and return c
+        code for its initialization.
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
    """
    if b:
@@ -320,6 +327,10 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
                                b='', stride_b='', load_b='',
                                dtype="float32"):
    """
+    Generate code to perform softmax with a fixed amount of shared
+    memory.
+    On entry, `buf` is assumed to be empty.
    Parameters
    ----------
@@ -352,13 +363,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
    dtype
        Optional, the dtype of the softmax's output if not float32.
-    :Precondition: buf is empty
-    :Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
-        softmax
    Notes
    -----
-    buf should be in gpu shared memory, we access it many times.
+    `buf` should be in gpu shared memory, we access it many times.
    We use tx as an int variable in a loop.

--- a/theano/sandbox/gpuarray/opt_util.py
+++ b/theano/sandbox/gpuarray/opt_util.py
@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
    Parameters
    ----------
-    v : variable
+    v
        Theano variable to extract the constant value from.
    nd : int
        Expected number of dimensions for the variable (for
@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
    Parameters
    ----------
-    v : variable
+    v
        The variable to dig through
    cls : Op class
        The type of the node we are looking for
@@ -84,9 +84,9 @@ def is_equal(var, val):
    Parameters
    ----------
-    var : variable
+    var
        Variable to compare
-    val : value
+    val
        Python value
    """
@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
    """
    Decorator to merge multiplication by a scalar on the output.
-    This will find a pattern of scal * <yourop>(some, params, alpha,
+    This will find a pattern of `scal * <yourop>(some, params, alpha,
-    beta) and update it so that the scalar multiplication happens as
+    beta)` and update it so that the scalar multiplication happens as
    part of your op.
-    The op needs to accept an alpha and a beta scalar which act this way:
+    The op needs to accept an alpha and a beta scalar which act this way::
       out = Op() * alpha + out_like * beta
@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
    and gets added to the "real" output of the operation.  An example
    of an operation that respects this pattern is GEMM from blas.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, *inputs)
@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
    for your op so that the new version performs the same computation.
    The `*inputs` parameters contains the new inputs for your op.  You
    MUST use those inputs instead of the ones on `node`.  Note that
-    this function can be as simple as:
+    this function can be as simple as::
        def maker(node, *inputs):
            return node.op(*inputs)
@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
    Returns
    -------
-    This returns an unregistered local optimizer that has the same
+    local optimizer
-    name as the decorated function.
+        an unregistered local optimizer that has the same name as the
+        decorated function.
    Notes
    -----
@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    """
    Decorator to merge addition by a value on the output.
-    This will find a pattern of val * <yourop>(some, params, alpha,
+    This will find a pattern of `val * <yourop>(some, params, alpha,
-    beta, out_like) and update it so that the addtition happens as
+    beta, out_like)` and update it so that the addtition happens as
    part of your op.
-    The op needs to accept an alpha and a beta scalar which act this way:
+    The op needs to accept an alpha and a beta scalar which act this way::
       out = Op() * alpha + out_like * beta
@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    and gets added to the "real" output of the operation.  An example
    of an operation that respects this pattern is GEMM from blas.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, *inputs)
@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    for your op so that the new version performs the same computation.
    The `*inputs` parameters contains the new inputs for your op.  You
    MUST use those inputs instead of the ones on `node`.  Note that
-    this function can be as simple as:
+    this function can be as simple as::
        def maker(node, *inputs):
            return node.op(*inputs)
@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
    Returns
    -------
-    This returns an unregistered local optimizer that has the same
+    local optimizer
-    name as the decorated function.
+        an unregistered local optimizer that has the same name as the
+        decorated function.
    Notes
    -----
@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
    This will duplicate the alloc input if it has more than one client
    to allow the op to work on it inplace.
-    The decorated function must have this signature:
+    The decorated function must have this signature::
        maker(node, inputs)
@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
    You should also switch the op to work inplace.  The `*inputs`
    parameters contains the new inputs for your op.  You MUST use
    those inputs instead of the ones on `node`.  Note that this
-    function can be as simple as:
+    function can be as simple as::
        def maker(node, inputs):
            return [node.op.__class__(inplace=True)(*inputs)]
@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
    Returns
    -------
-    This returns an unregistered inplace local optimizer that has the
+    local optimizer
-    same name as the decorated function.
+        an unregistered inplace local optimizer that has the same name
+        as the decorated function.
    """
    def wrapper(maker):