Merge pull request #2033 from f0k/corrmm-faster-fullconv

Faster algorithms and gradients for GpuCorrMM

Merge pull request #2033 from f0k/corrmm-faster-fullconv
cfc493d1 · Frédéric Bastien · a81b5cdc · 372bab54 · cfc493d1 · cfc493d1
--- a/doc/library/tensor/nnet/conv.txt
+++ b/doc/library/tensor/nnet/conv.txt
@@ -22,23 +22,28 @@
 .. moduleauthor:: LISA
-TODO: Give examples for how to use these things! They are pretty complicated.
+TODO: Give examples on how to use these things! They are pretty complicated.
- Conv implemented
+- Convolution operators implemented:
-    - :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>`.
+    - :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>`. See note above.
    - :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
+      This is the standard operator for convolutional neural networks working
+      with batches of multi-channel 2D images, available for CPU and GPU.
+      Most of the more efficient GPU implementations listed below can be used
+      as an automatic replacement for nnet.conv2d by enabling specific graph
+      optimizations.
    - :func:`conv2d_fft <theano.sandbox.cuda.fftconv.conv2d_fft>`
      This is a GPU-only version of nnet.conv2d that uses an FFT transform
-      to perform the work. conv2d_fft should not be used directly as it
+      to perform the work. conv2d_fft should not be called directly as it
-      does not implement a grad function. Instead, you should use
+      does not provide a gradient. Instead, use nnet.conv2d and allow
-       nnet.conv2d and enable the fft optimization by setting
+      Theano's graph optimizer to replace it by the FFT version by setting
-      'THEANO_FLAGS=optimizer_including=conv_fft_valid:conv_fft_full'
+      ``THEANO_FLAGS=optimizer_including=conv_fft_valid:conv_fft_full``
      in your environement.  This is not enabled by default because it
-      has some restrictions on input and uses more memory.  Also note
+      has some restrictions on input and uses a lot more memory.  Also note
      that it requires CUDA >= 5.0, scikits.cuda >= 0.5.0 and PyCUDA to run.
-      To desactivate the fft optimization on a specific nnet.conv2d
+      To deactivate the FFT optimization on a specific nnet.conv2d
-      while the optimization flags are active, you can set its parameters
+      while the optimization flags are active, you can set its ``version``
-      version to 'no_fft'. To enable for just one Theano function:
+      parameter to ``'no_fft'``. To enable it for just one Theano function:
      .. code-block:: python
@@ -47,17 +52,58 @@ TODO: Give examples for how to use these things! They are pretty complicated.
          f = theano.function(..., mode=mode)
+    - `cuda-convnet wrapper for 2d correlation <http://deeplearning.net/software/pylearn2/library/alex.html>`_
+      Wrapper for an open-source GPU-only implementation of conv2d by Alex
+      Krizhevsky, very fast, but with several restrictions on input and kernel
+      shapes, and with a different memory layout for the input.
+      This is in Pylearn2, where it is normally called from the `linear transform
+      <http://deeplearning.net/software/pylearn2/library/linear.html>`_
+      implementation, but it can also be used `directly from within Theano
+      <http://benanne.github.io/2014/04/03/faster-convolutions-in-theano.html>`_
+      as a manual replacement for nnet.conv2d.
+    - :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
+      This is a GPU-only 2d correlation implementation taken from
+      `caffe <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_
+      and also used by Torch.
+      For each element in a batch, it first creates a
+      `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
+      Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
+      (hence the name: MM is for matrix multiplication).
+      It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
+      ``(no of channels * filter width * filter height, output width * output height)``.
+      As it provides a gradient, you can use it as a replacement for nnet.conv2d.
+      Alternatively, you can use nnet.conv2d and allow Theano's graph optimizer
+      to replace it by the GEMM version by setting
+      ``THEANO_FLAGS=optimizer_including=conv_gemm`` in your environment.
+      This is not enabled by default because it uses some extra memory, but the
+      overhead is small compared to conv2d_fft, there are no restrictions on
+      input or kernel shapes and it is sometimes still faster than cuda-convnet.
+      If using it, please see the warning about a bug in CUDA 5.0 to 6.0 below.
+      To enable it for just one Theano function:
+      .. code-block:: python
+          mode = theano.compile.get_default_mode()
+          mode = mode.including('conv_gemm')
+          f = theano.function(..., mode=mode)
    - :func:`conv3D <theano.tensor.nnet.Conv3D.conv3D>`
-      3D Convolution. Doesn't work on the GPU.
+      3D Convolution applying multi-channel 3D filters to batches of
+      multi-channel 3D images.
    - :func:`conv3d_fft <theano.sandbox.cuda.fftconv.conv3d_fft>`
      GPU-only version of conv3D using FFT transform. conv3d_fft should
-      not be call directly as it does not implement a grad function.
+      not be called directly as it does not provide a gradient.
-      You can enable it by setting THEANO_FLAGS to
+      Instead, use conv3D and allow Theano's graph optimizer to replace it by
-      'optimizer_including=conv3d_fft:convgrad3d_fft:convtransp3d_fft'
+      the FFT version by setting
-      It does not support strides.
+      ``THEANO_FLAGS=optimizer_including=conv3d_fft:convgrad3d_fft:convtransp3d_fft``
-      This is not enabled by default because it uses more memory.
+      in your environment. This is not enabled by default because it does not
-      Also note that it requires CUDA >= 5.0,
+      support strides and uses more memory. Also note that it requires
-      scikits.cuda >= 0.5.0 and PyCUDA to run.
+      CUDA >= 5.0, scikits.cuda >= 0.5.0 and PyCUDA to run.
      To enable for just one Theano function:
      .. code-block:: python
@@ -70,33 +116,10 @@ TODO: Give examples for how to use these things! They are pretty complicated.
    - :func:`conv3d2d <theano.tensor.nnet.conv3d2d.conv3d>`
      Another conv3d implementation that uses the conv2d with data reshaping.
      It is faster in some cases than conv3d, specifically on the GPU.
-    - `Faster conv2d <http://deeplearning.net/software/pylearn2/library/alex.html>`_
-      This is in Pylearn2, not very documented and uses a different
-      memory layout for the input. It is important to have the input
-      in the native memory layout, and not use dimshuffle on the
-      inputs, otherwise you lose most of the speed up. So this is not
-      a drop in replacement of conv2d.
-      Normally those are called from the `linear transform
-      <http://deeplearning.net/software/pylearn2/library/linear.html>`_
-      implementation.
-      Also, there is restrictions on which shape are supported.
-    - :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
-      This is a GPU-only version of a correlation that computes correlations
-      as `caffe <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_.
-      For each element in a batch, it first creates a 
-      `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a cuda kernel.
-      Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the kernel.
-      It need extra memory equal to the size of the Toeplitz matrix. Precisely, 
-      the dimensions of this 2D Toeplitz matrix is equal to
-      ``(no of channels * filter width * filter height, output width * output height)``.
-      You can enable it for call to conv2d 2d by setting ``THEANO_FLAGS=optimizer_including=conv_gemm``
-      in your environment. This is not enabled by default because it
-      uses some extra memory. MM mean matrix multiply.
 .. autofunction:: theano.tensor.nnet.conv.conv2d
+.. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
+.. autofunction:: theano.sandbox.cuda.blas.GpuCorrMM
 .. autofunction:: theano.tensor.nnet.Conv3D.conv3D
+.. autofunction:: theano.sandbox.cuda.fftconv.conv3d_fft
 .. autofunction:: theano.tensor.nnet.conv3d2d.conv3d
-.. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
--- a/theano/sandbox/cuda/blas.py
+++ b/theano/sandbox/cuda/blas.py
--- a/theano/sandbox/cuda/caffe_common.hpp
+++ b/theano/sandbox/cuda/caffe_common.hpp
-/*
-Copyright (c) 2014, The Regents of the University of California (Regents)
-All rights reserved.
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met: 
-1. Redistributions of source code must retain the above copyright notice, this
-   list of conditions and the following disclaimer. 
-2. Redistributions in binary form must reproduce the above copyright notice,
-   this list of conditions and the following disclaimer in the documentation
-   and/or other materials provided with the distribution. 
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
-ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-*/
-#ifndef CAFFE_COMMON_HPP_
-#define CAFFE_COMMON_HPP_
-#include <cublas_v2.h>
-#include <cuda.h>
-#include <driver_types.h>  // cuda driver types
-// CUDA: thread number configuration.
-// Use 1024 threads per block, which requires cuda sm_2x or above,
-// or fall back to attempt compatibility (best of luck to you).
-#if __CUDA_ARCH__ >= 200
-    const int CAFFE_CUDA_NUM_THREADS = 1024;
-#else
-    const int CAFFE_CUDA_NUM_THREADS = 512;
-#endif
-// CUDA: number of blocks for threads.
-inline int CAFFE_GET_BLOCKS(const int N) {
-  return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
-}
-#endif  // CAFFE_COMMON_HPP_
--- a/theano/sandbox/cuda/conv_gemm.cu
+++ b/theano/sandbox/cuda/conv_gemm.cu
--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -25,7 +25,8 @@ from theano.sandbox.cuda.basic_ops import (
    GpuIncSubtensor, gpu_alloc, GpuAlloc, gpu_shape)
 from theano.sandbox.cuda.type import CudaNdarrayType
 from theano.sandbox.cuda.blas import (gpu_dot22, gpu_dot22scalar,
-        gpu_gemm_inplace, gpu_gemm_no_inplace, GpuConv, GpuCorrMM)
+        gpu_gemm_inplace, gpu_gemm_no_inplace, GpuConv,
+        GpuCorrMM, GpuCorrMM_gradInputs, GpuCorrMM_gradWeights)
 from theano.sandbox.cuda.blas import gpu_gemv_inplace
 from theano.sandbox.cuda.blas import gpu_gemv_no_inplace
 from theano.sandbox.cuda.blas import gpu_ger_inplace
@@ -1121,6 +1122,8 @@ def local_gpu_conv(node):
                    version=op.version,
                    verbose=op.verbose,
                    imshp=op.imshp,
+                    nkern=op.nkern,
+                    bsize=op.bsize,
                    fft_opt=op.fft_opt
                    )
        if op.imshp_logical is not None:
@@ -1206,15 +1209,25 @@ def _gpu_conv_to_fftconv(node):
        node.op.imshp[-1] is not None and
        node.op.imshp[-1] % 2 == 1):
        kwargs['pad_last_dim'] = True
-    # TODO: If the user supplied the full nonsymbolic image_shape and
+    # If the user supplied the full nonsymbolic image_shape and
-    # filter_shape in conv2d(), we could pass it on to conv2d_fft(). However,
+    # filter_shape in conv2d(), we can pass it on to conv2d_fft().
-    # information on batch size and channel counts is currently discarded
+    if ((node.op.imshp is not None) and
-    # when a ConvOp is replaced by a GpuConv, so this would need more changes.
+            (len(node.op.imshp) == 3) and
-    #if (node.op.imshp is not None) and (None not in node.op.imshp):
+            (None not in node.op.imshp) and
-    #    kwargs['image_shape'] = (bsize, inchannels) + node.op.imshp
+            (node.op.bsize is not None)):
-    #if (node.op.kshp is not None) and (None not in node.op.kshp):
+        kwargs['image_shape'] = (node.op.bsize,) + node.op.imshp
-    #    kwargs['filter_shape'] = (outchannels, inchannels) + node.op.kshp
+    if ((node.op.kshp is not None) and
-    return conv2d_fft(node.inputs[0], node.inputs[1], **kwargs)
+            (None not in node.op.kshp) and
+            (node.op.nkern is not None) and
+            (len(node.op.imshp) == 3) and
+            (node.op.imshp[0] is not None)):
+        kwargs['filter_shape'] = (node.op.nkern, node.op.imshp[0]) + node.op.kshp
+    rval = conv2d_fft(node.inputs[0], node.inputs[1], **kwargs)
+    if ('image_shape' in kwargs) or ('filter_shape' in kwargs):
+        # With given shape information, conv2d_fft may return a different
+        # broadcast pattern than GpuConv. This is forbidden, so we fix it.
+        rval = tensor.patternbroadcast(rval, node.outputs[0].type.broadcastable)
+    return rval
 @local_optimizer([GpuConv])
@@ -1351,10 +1364,55 @@ def local_conv_gemm(node):
    if (isinstance(node.op, GpuConv) and
        node.op.border_mode in ['full', 'valid']):
        img, kern = node.inputs
-        img = gpu_contiguous(img)
+        border_mode = node.op.border_mode
-        kern = kern[:, :, ::-1, ::-1]
+        subsample = node.op.subsample
-        kern = gpu_contiguous(kern)
+        pad = (0,0)
-        return [GpuCorrMM(node.op.border_mode, node.op.subsample)(img, kern)]
+        if (border_mode == 'full') and (subsample != (1,1)):
+            # need to simulate this via a padded valid convolution
+            pad = 'full'
+            border_mode = 'valid'
+        if (border_mode == 'valid'):
+            # need to flip the kernel for valid convolution
+            kern = kern[:, :, ::-1, ::-1]
+            # call GpuCorrMM or GpuCorrMM_gradWeights
+            # (the latter is faster if batchsize * kernelHeight * kernelWidth
+            # is larger than inputChannels * outputHeight * outputWidth.
+            # GpuConv does not always store information on the batchsize and
+            # channels, though, so we only use what information we have.)
+            if ((subsample == (1,1)) and
+                    (node.op.imshp is not None) and
+                    (None not in node.op.imshp[-2:]) and
+                    (node.op.kshp is not None) and
+                    (None not in node.op.kshp)):
+                # we know the kernel and output size
+                prod1 = node.op.kshp[0] * node.op.kshp[1]
+                prod2 = ((node.op.imshp[-2] - node.op.kshp[0] + 1) *
+                    (node.op.imshp[-1] - node.op.kshp[1] + 1))
+                if ((node.op.bsize is not None) and
+                        (len(node.op.imshp) == 3) and
+                        (node.op.imshp[0] is not None)):
+                    # we also know batchsize and input channels
+                    prod1 *= node.op.bsize
+                    prod2 *= node.op.imshp[0]
+                # compare to decide
+                if prod1 > prod2:
+                    # (we need to wrap the result in as_cuda_ndarray_variable,
+                    # because we are not allowed to replace a CudaNdarray with
+                    # a DimShuffle instance in a graph optimization)
+                    return [theano.sandbox.cuda.as_cuda_ndarray_variable(
+                            GpuCorrMM_gradWeights('valid', subsample, pad)(
+                            gpu_contiguous(img.dimshuffle(1, 0, 2, 3)),
+                            gpu_contiguous(kern.dimshuffle(1, 0, 2, 3))
+                            ).dimshuffle(1, 0, 2, 3))]
+            # use GpuCorrMM if we did not choose GpuCorrMM_gradWeights above
+            return [GpuCorrMM('valid', subsample, pad)(
+                    gpu_contiguous(img), gpu_contiguous(kern))]
+        elif (border_mode == 'full'):
+            # need to dimshuffle the kernel for full convolution
+            kern = kern.dimshuffle(1, 0, 2, 3)
+            # call GpuCorrMM_gradInputs
+            return [GpuCorrMM_gradInputs('valid', subsample, pad)(
+                    gpu_contiguous(kern), gpu_contiguous(img))]
 gpu_optimizer.register("conv_gemm", local_conv_gemm)

--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py