merge

45f07e93 · James Bergstra · ea627155 · b60891b6 · 45f07e93 · 45f07e93
--- a/doc/extending/pipeline.txt
+++ b/doc/extending/pipeline.txt
@@ -101,14 +101,13 @@ case if ``borrow`` was True, the thunk would be allowed to reuse (or
    Compiled libraries are stored within a specific compilation directory,
    which by default is set to ``$HOME/.theano/compiledir_xxx``, where
    ``xxx`` identifies the platform. It may be manually set to a different
-    location either by setting the ``THEANO_COMPILEDIR`` environment variable,
-    the ``THEANO_BASE_COMPILEDIR`` environment variable
-    or by calling ``theano.gof.compiledir.set_compiledir(..)`` within your
-    Python script.
+    location either by setting :attr:`config.compiledir` or
+    :attr:`config.base_compiledir`, either within your Python script or by
+    using one of the configuration mechanisms described in :mod:`config`.

    The compile cache is based upon the C++ code of the graph to be compiled.
-    So, if you change compilation environment variables, such as
-    ``THEANO_BLAS_LDFLAGS``, you will need to manually remove your compile cache,
+    So, if you change compilation configuration variables, such as
+    :attr:`config.blas.ldflags`, you will need to manually remove your compile cache,
    using ``Theano/bin/theano-compiledir clear``

    Theano also implements a lock mechanism that prevents

--- a/doc/extending/unittest.txt
+++ b/doc/extending/unittest.txt
@@ -260,21 +260,22 @@ Example:
 >>> m = theano.Module()
 >>> minstance = m.make(mode='DEBUG_MODE')

-Whenever possible, unit tests should omit this parameter. Leaving-out
-the mode will ensure that unit tests use the default mode (defined in
-compile.mode.default_mode). This default_mode is set to the
-THEANO_DEFAULT_MODE environment variable, if it is present. If not, it
-defaults to 'FAST_RUN'.
-
-This allows the user to easily switch the mode in which unittests are
+Whenever possible, unit tests should omit this parameter. Leaving
+out the mode will ensure that unit tests use the default mode
+(defined in compile.mode.default_mode). This default_mode is set to
+the configuration variable :attr:`config.mode`, which defaults to
+'FAST_RUN', and can be set by various mechanisms (see :mod:`config`).
+
+In particular, the enviromnment variable :envvar:`THEANO_FLAGS`
+allows the user to easily switch the mode in which unittests are
 run. For example to run all tests in all modes from a BASH script,
 type this:

 .. code-block:: bash

-    THEANO_DEFAULT_MODE=FAST_COMPILE nosetests
-    THEANO_DEFAULT_MODE=FAST_RUN nosetests
-    THEANO_DEFAULT_MODE=DEBUG_MODE nosetests
+    THEANO_FLAGS='mode=FAST_COMPILE' nosetests
+    THEANO_FLAGS='mode=FAST_RUN' nosetests
+    THEANO_FLAGS='mode=DEBUG_MODE' nosetests

 Using Random Values in Test Cases
 ---------------------------------
@@ -299,14 +300,12 @@ do the following:

 The behaviour of seed_rng is as follows:

-* If an explicit seed is given, it will be used for seending numpy's rng.
-
-* If not, it will try to get a seed from the THEANO_UNITTEST_SEED variable.
+* If an explicit seed is given, it will be used for seeding numpy's rng.

-* If THEANO_UNITTEST_SEED is set to "random", it will seed the
-  rng. with None, which is equivalent to seeding with a random seed.
+* If not, it will use ``config.unittest.rseed`` (its default value is 666).

-* If THEANO_UNITTEST_SEED is not defined, it will use a default seed of 666.
+* If config.unittest.rseed is set to "random", it will seed the rng with
+  None, which is equivalent to seeding with a random seed.


 The main advantage of using unittest_tools.seed_rng is that it allows
@@ -317,7 +316,8 @@ a higher confidence that the variables are correct), while still
 making sure unittests are deterministic.

 Users who prefer their unittests to be random (when run on their local
-machine) can simply set THEANO_UNITTEST_SEED to 'random'.
+machine) can simply set ``config.unittest.rseed`` to 'random' (see
+:mod:`config`).

 Similarly, to provide a seed to numpy.random.RandomState, simply use:


--- a/doc/index.txt
+++ b/doc/index.txt
@@ -40,7 +40,7 @@ Roughly in order of what you'll want to check out:
 * :ref:`internal` -- How to maintaining Theano, LISA-specific tips, and more...
 * `API <api/>`_ -- The automatically-generated API

-You can download the latest `PDF documentation <http://deeplearning.net/theanodoc/theano.pdf>`_, rather than reading it online.
+You can download the latest `PDF documentation <http://deeplearning.net/software/theano/theano.pdf>`_, rather than reading it online.

 Community
 =========

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -339,9 +339,9 @@ Generating the documentation
 ----------------------------

 You can read the latest HTML documentation `here
-<http://deeplearning.net/theanodoc>`__.
+<http://deeplearning.net/software/theano>`__.
 You can download the latest PDF documentation `here
-<http://deeplearning.net/theanodoc/theano.pdf>`__.
+<http://deeplearning.net/software/theano/theano.pdf>`__.

 We recommend you look at the documentation on the website, since it
 will be more current than the documentation included with the package.

--- a/doc/internal/lisa_labo.txt
+++ b/doc/internal/lisa_labo.txt
@@ -8,8 +8,12 @@ LISA Labo specific instructions
 Tips for running at LISA
 ------------------------

-Use the fast BLAS library that Fred installed, by setting
-`THEANO_BLAS_LDFLAGS=-lgoto`.
+Shell configuration files ``/opt/lisa/os/.local.{bash,csh}rc`` should define
+:envvar:`THEANORC` to include ``/opt/lisa/os/.local.theanorc`` as a
+configuration file.
+
+``/opt/lisa/os/.local.theanorc`` should include the right default values for
+the lab, in particular, ``blas.ldflags`` should contain '-lgoto'.

 Tips for running on a cluster
 -----------------------------

--- a/doc/internal/mammouth.txt
+++ b/doc/internal/mammouth.txt
@@ -14,7 +14,8 @@ To run Theano on the Mammouth cluster, follow these simple steps:

      Perhaps even put this in your ``.bashrc``

-    * ``set THEANO_BLAS_LDFLAGS='-lmkl -lguide -fopenmp'``
+    * set ``config.blas.ldflags`` to ``'-lmkl -lguide -fopenmp'``
+      (see :mod:`config` to know how)

      Note: the -lguide flag works, however the fix should probably be considered temporary.
      Intel has deprecated libguide.so in favor of the newer library libiomp5.so. However, 

--- a/doc/internal/metadocumentation.txt
+++ b/doc/internal/metadocumentation.txt
@@ -145,7 +145,7 @@ Then it executes something like

 .. code-block:: bash

-    THEANO_UNITTEST_SEED=<SEED> THEANO_DEFAULT_MODE=DEBUG_MODE /usr/bin/nosetests --with-coverage --cover-package=theano --cover-package=pylearn
+    THEANO_FLAGS='unittests.rseed=<SEED>,mode=DEBUG_MODE' /usr/bin/nosetests --with-coverage --cover-package=theano --cover-package=pylearn

 in the updated ``theano`` directory.
 The output is emailed automatically to one of the developers.

--- a/doc/introduction.txt
+++ b/doc/introduction.txt
@@ -130,7 +130,7 @@ Getting started


 A PDF version of the online documentation may be found `here
-<http://deeplearning.net/theanodoc/theano.pdf>`_.
+<http://deeplearning.net/software/theano/theano.pdf>`_.


 Contact us

--- a/doc/library/compile/debugmode.txt
+++ b/doc/library/compile/debugmode.txt
@@ -35,7 +35,7 @@ DebugMode can be used as follows:
    f(0) 
    f(7) 

-It can also be used by setting an environment variable ``THEANO_DEFAULT_MODE=DEBUG_MODE``.
+It can also be used by setting the configuration variable :attr:`config.mode`.
 It can also be used by passing a DebugMode instance as the mode, as in 

 >>> f = theano.function([x], 10*x, mode=DebugMode(check_c_code=False))
@@ -78,47 +78,47 @@ Reference
    Each of these exceptions inherits from the more generic `DebugModeError`.

    If there are no internal errors, this mode behaves like FAST_RUN or FAST_COMPILE, but takes
-    a little longer and uses more memory.  
+    a little longer and uses more memory.

    If there are internal errors, this mode will raise an `DebugModeError` exception.

-    .. attribute:: stability_patience = config.THEANO_DEBUGMODE_PATIENCE
+    .. attribute:: stability_patience = config.DebugMode.patience

        When checking for the stability of optimization, recompile the graph this many times.
        Default 10.

-    .. attribute:: check_c_code = config.THEANO_DEBUGMODE_CHECK_C
+    .. attribute:: check_c_code = config.DebugMode.check_c

        Should we evaluate (and check) the `c_code` implementations?

        ``True`` -> yes, ``False`` -> no.
-        
+
        Default yes.

-    .. attribute:: check_py_code = config.THEANO_DEBUGMODE_CHECK_PY
+    .. attribute:: check_py_code = config.DebugMode.check_py

    Should we evaluate (and check) the `perform` implementations?

        ``True`` -> yes, ``False`` -> no.
-        
+
        Default yes.

-    .. attribute:: check_isfinite = config.THEANO_DEBUGMODE_CHECK_FINITE
+    .. attribute:: check_isfinite = config.DebugMode.check_finite

        Should we check for (and complain about) ``NaN``/``Inf`` ndarray elements?

        ``True`` -> yes, ``False`` -> no.
-        
+
        Default yes.

-    .. attribute:: require_matching_strides = config.THEANO_DEBUGMODE_CHECK_STRIDES
+    .. attribute:: require_matching_strides = config.DebugMode.check_strides

        Check for (and complain about) Ops whose python and C
        outputs are ndarrays with different strides. (This can catch bugs, but
-        is generally overly strict.) 
-        
+        is generally overly strict.)
+
        0 -> no check, 1 -> warn, 2 -> err.
-        
+
        Default warn.

    .. method:: __init__(self, optimizer='fast_run', stability_patience=None, check_c_code=None, check_py_code=None, check_isfinite=None, require_matching_strides=None, linker=None)
@@ -128,7 +128,7 @@ Reference
        If any of these arguments (except optimizer) is not None, it overrides the class default.
        The linker arguments is not used. It is set their to allow Mode.requiring() and some other fct to work with DebugMode too.

-    
+

 The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
 is quite strict, and can raise several different Exception types.

--- a/doc/library/compile/function.txt
+++ b/doc/library/compile/function.txt
@@ -134,7 +134,7 @@ Reference
    about how output variables should be returned.

    The default is typically 'FAST_RUN' but this can be changed in
-    :doc:`theano.config <../config>` or via :envvar:`THEANO_DEFAULT_MODE`.  The mode
+    :doc:`theano.config <../config>`.  The mode
    argument controls the sort of optimizations that will be applied to the
    graph, and the way the optimized graph will be evaluated.


--- a/doc/library/compile/mode.txt
+++ b/doc/library/compile/mode.txt
@@ -10,21 +10,21 @@
 Guide
 =====

-The ``mode`` parameter to :func:`theano.function`` controls how the
+The ``mode`` parameter to :func:`theano.function` controls how the
 inputs-to-outputs graph is transformed into a callable object.

 Theano defines the following modes by name:

- ``FAST_COMPILE``: Apply just a few optimizations, but use C op implementations where possible.
- ``FAST_RUN``: Apply all optimizations, and use C op implementations where possible.
- ``DEBUG_MODE``: Verify the correctness of all optimizations, and compare C and python 
+- ``'FAST_COMPILE'``: Apply just a few graph optimizations, but use C implementations where possible.
+- ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
+- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python 
    implementations. This mode can take much longer than the other modes, 
    but can identify many kinds of problems.

-The default mode is typically 'FAST_RUN', but it can be controlled via the
-environment variable 'THEANO_DEFAULT_MODE', which can in turn be overridden by
+The default mode is typically ``FAST_RUN``, but it can be controlled via the
+configuration variable :attr:`config.mode`, which can in turn be overridden by
 setting ``theano.compile.mode.default_mode`` directly, which can in turn be
-overridden by passing the keyword argument to ``theano.function``.
+overridden by passing the keyword argument to :func:`theano.function`.

 .. TODO::


--- a/doc/library/floatX.txt
+++ b/doc/library/floatX.txt
-
-.. _libdoc_floatX:
-
-=======================================================================
-:mod:`floatX` -- Switching Between 'float32' and 'float64'
-=======================================================================
-
-.. module:: floatX
-   :platform: Unix, Windows
-   :synopsis: easy switching between float32 and float64
-.. moduleauthor:: LISA
-
-
-Guide
-=====
-
-On the CPU, 'float32' computations are often twice as fast as 'float64'
-and are half the size.
-On GPUs the speed difference between 'float32'``  and 'float64' is much greater.
-Often we develop our code using double-precision expressions, and then wonder if
-we might get the same answer much more quickly with single-precision arithmetic.
-If we have used ``tensor.dmatrix`` and ``tensor.dvector`` and so on throughout
-our code, it could be tedious to switch to single-precision Variables.  To make
-switching precisions easier, Theano provides the ``floatX`` module.
-
->>> from theano.floatX import xmatrix, xvector, xtensor4
->>> import numpy
->>> a = xvector('a')
->>> b = xmatrix()
->>> c = xtensor4()
-
-These calls are identical to ``dvector``, ``dmatrix``, and ``dtensor4`` by default, but a
-single environment variable can switch them to ``fvector``, ``fmatrix`` and ``ftensor4``.
-
-You can set the floatX precision via ``floatX`` in the :envvar:`THEANO_FLAGS`.
-It defaults to ``'float64'``. To set it to ``'float32'`` in *bash* for example, type ``export THEANO_FLAGS=floatX=float64``.
-To set it from within your program call :func:`set_floatX`
-
-The current floatX precision is stored in ``theano.config.floatX`` as a string.
-Its value is either 'float32' or 'float64'. 
-So it is easy to allocate a numpy vector of the floatX dtype.
-
->>> import theano.config as config
->>> print config.floatX   # either 'float32' or 'float64'
->>> x = numpy.asarray([1,2,3], dtype=config.floatX)
-
-Reference
-==========
-
-.. function:: xscalar(name=None)
-
-    Alias for either :func:`dscalar` or :func:`fscalar`
-
-.. function:: xvector(name=None)
-
-    Alias for either :func:`dvector` or :func:`fvector`
-
-.. function:: xmatrix(name=None)
-
-    Alias for either :func:`dmatrix` or :func:`fmatrix`
-
-.. function:: xrow(name=None)
-
-    Alias for either :func:`drow` or :func:`frow`
-
-.. function:: xcol(name=None)
-
-    Alias for either :func:`dcol` or :func:`fcol`
-
-.. function:: xtensor3(name=None)
-
-    Alias for either :func:`dtensor3` or :func:`ftensor3`
-
-.. function:: xtensor4(name=None)
-
-    Alias for either :func:`dtensor4` or :func:`ftensor4`
-
-.. function:: set_floatX(dtype=config.floatX)
-
-    Reset the :func:`xscalar`, ... :func:`xtensor4` aliases to return Variables with given dtype.
-    This is called at import-time when setting floatX in :envvar:`THEANO_FLAGS`.
-
-
-
--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -263,7 +263,7 @@ them perfectly, but a dscalar otherwise.

 .. note::

-    When config.floatX==float32 (see :module:`config`), then Python floats
+    When config.floatX==float32 (see :mod:`config`), then Python floats
    are stored instead as single-precision floats.

    For fine control of this rounding policy, see

--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
@@ -22,7 +22,7 @@ Theano defines the following modes by name:
    but can identify many kinds of problems.

 The default mode is typically ``FAST_RUN``, but it can be controlled via
-the environment variable ``THEANO_DEFAULT_MODE``, which can in turn be
+the configuration variable :attr:`config.mode`, which can in turn be
 overridden by setting `theano.compile.mode.default_mode` directly,
 which can in turn be overridden by passing the keyword argument to
 :func:`theano.function <function.function>`.
@@ -30,7 +30,6 @@ which can in turn be overridden by passing the keyword argument to
 ================= =============================================================== ===============================================================================
 short name        Full constructor                                                What does it do?
 ================= =============================================================== ===============================================================================
-(default)         ``compile.mode.Mode(linker='py', optimizer=None)``              Python implementations with zero graph modifications.
 FAST_COMPILE      ``compile.mode.Mode(linker='c|py', optimizer='fast_compile')``  C implementations where available, quick and cheap graph transformations
 FAST_RUN          ``compile.mode.Mode(linker='c|py', optimizer='fast_run')``      C implementations where available, all available graph transformations.
 DEBUG_MODE        ``compile.debugmode.DebugMode()``                               Both implementations where available, all available graph transformations.

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -16,7 +16,7 @@ Setting up CUDA

 The first thing you'll need for Theano to use your GPU is Nvidia's
 GPU-programming toolchain.  You should install at least the CUDA driver and the CUDA Toolkit, as 
-:ref:`described here <http://www.nvidia.com/object/cuda_get.html>`.  The CUDA
+`described here <http://www.nvidia.com/object/cuda_get.html>`_.  The CUDA
 Toolkit installs a folder on your computer with subfolders *bin*, *lib*,
 *include*, and some more too.  (Sanity check: The *bin* subfolder should contain an *nvcc*
 program which is the compiler for GPU code.)  This folder is called the *cuda

--- a/theano/compile/mode.py
+++ b/theano/compile/mode.py
@@ -221,12 +221,8 @@ predefined_modes = {'FAST_COMPILE': FAST_COMPILE,
                    'SANITY_CHECK': SANITY_CHECK}


-##
-# The default mode used by functions and modules is read from the environment
-# variable THEANO_DEFAULT_MODE. Unit tests will run using this value. If the env. var.
-# is not set, it will default to 'FAST_RUN'
+# The default mode used by functions and modules is read from the configuration.
 # keep default_mode.optimizer==default_optimizer and default_mode.linker==default_linker!
-##
 default_mode = config.mode

 def get_mode(string):

--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
@@ -319,10 +319,10 @@ class ProfileMode(Mode):
 register_mode('PROFILE_MODE',ProfileMode())

 def atexit_print_default_profile_mode():
-    """Print the summary of the predefied mode PROFILE_MODE if used.
+    """Print the summary of the predefined mode PROFILE_MODE if used.
    
-    This all to have the summary printed at exit when we do
-    THEANO_DEFAULT_MODE=PROFILE_MODE
+    This all to have the summary printed at exit when
+    config.mode=PROFILE_MODE
    """
    prof_mode=predefined_modes["PROFILE_MODE"]
    if prof_mode.local_time[0]>0:

--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -18,16 +18,19 @@ def debug(*msg):

 # Compile cuda_ndarray.cu
 # This need that nvcc (part of cuda) is installed. If it is not, a warning is
-# printed and this module will not be working properly (we set `enable_cuda`
+# printed and this module will not be working properly (we set `cuda_available`
 # to False).

 # This variable is True by default, and set to False if something goes wrong
 # when trying to initialize cuda.
-enable_cuda = True
+cuda_available = True

 # Global variable to avoid displaying the same warning multiple times.
 cuda_warning_is_displayed = False

+#This variable is set to True when we enable the cuda.(i.e. when use() is called)
+cuda_enabled = False
+
 # Code factorized within a function so that it may be called from multiple
 # places (which is not currently the case, but may be useful in the future).
 def set_cuda_disabled():
@@ -38,8 +41,8 @@ def set_cuda_disabled():
    Note that there is no point calling this function from outside of
    `cuda.__init__`, since it has no effect once the module is loaded.
    """
-    global enable_cuda, cuda_warning_is_displayed
-    enable_cuda = False
+    global cuda_available, cuda_warning_is_displayed
+    cuda_available = False
    if not cuda_warning_is_displayed:
        cuda_warning_is_displayed = True
        warning('Cuda is disabled, cuda-based code will thus not be '
@@ -70,7 +73,7 @@ try:
        if not nvcc_compiler.is_nvcc_available():
            set_cuda_disabled()

-        if enable_cuda:
+        if cuda_available:
            code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()

            if not os.path.exists(cuda_ndarray_loc):
@@ -84,7 +87,7 @@ except Exception, e:
    error( "Failed to compile cuda_ndarray.cu: %s" % str(e))
    set_cuda_disabled()

-if enable_cuda:
+if cuda_available:
    #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
    import cuda_ndarray.cuda_ndarray
    if os.path.join(config.compiledir,'cuda_ndarray','cuda_ndarray.so')!=cuda_ndarray.cuda_ndarray.__file__:
@@ -104,7 +107,8 @@ if enable_cuda:
    import cuda_ndarray


-def use(device=config.device):
+def use(device):
+    global cuda_enabled, enabled_cuda
    if device.startswith('gpu'):
        device = int(device[3:])
    elif device == 'cpu':
@@ -122,8 +126,10 @@ def use(device=config.device):
            gpu_init(device)
            handle_shared_float32(True)
            use.device_number = device
+            cuda_enabled = True
        except RuntimeError, e:
            _logger.warning("ERROR: Not using GPU. Initialisation of device %i failed. %s" %(device, e))
+            enabled_cuda = False
    elif use.device_number != device:
        logging.getLogger('theano.sandbox.cuda').warning("WARNING: ignoring call to use(%s), GPU number %i is already in use." %(str(device), use.device_number))
    optdb.add_tags('gpu',
@@ -144,5 +150,6 @@ def handle_shared_float32(tf):
    else:
        raise NotImplementedError('removing our handler')

-if enable_cuda and config.device.startswith('gpu'):
-    use()
+if cuda_available and config.device.startswith('gpu'):
+    use(config.device)
+
--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -381,11 +381,11 @@ def local_gpu_conv(node):
            gpu_conv = GpuConvOp_from_ConvOp(node.op)
            return [host_from_gpu(gpu_conv(gpu_from_host(img), gpu_from_host(kern)))]

-import theano.sandbox.downsample
+import theano.tensor.signal.downsample as downsample
 @register_opt()
 @local_optimizer([])
 def local_gpu_downsample_factor_max(node):
-    if isinstance(node.op, theano.sandbox.downsample.DownsampleFactorMax):
+    if isinstance(node.op, downsample.DownsampleFactorMax):
        x, = node.inputs
        if (x.owner and x.owner.op == host_from_gpu):
            gpu_ds = GpuDownsampleFactorMax(node.op.ds, node.op.ignore_border)
@@ -394,7 +394,7 @@ def local_gpu_downsample_factor_max(node):
 @register_opt()
 @local_optimizer([])
 def local_gpu_downsample_factor_max_grad(node):
-    if isinstance(node.op, theano.sandbox.downsample.DownsampleFactorMaxGrad):
+    if isinstance(node.op, downsample.DownsampleFactorMaxGrad):
        x,z,gz = node.inputs
        if (x.owner and x.owner.op == host_from_gpu):
            gpu_ds_grad = GpuDownsampleFactorMaxGrad(node.op.ds, node.op.ignore_border)

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -11,7 +11,7 @@ import theano.tensor as T
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda_ndarray
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_enabled == False:
    raise SkipTest('Optional package cuda disabled')

 import theano.sandbox.cuda as tcn

--- a/theano/sandbox/cuda/tests/test_bench_loopfusion.py
+++ b/theano/sandbox/cuda/tests/test_bench_loopfusion.py
@@ -270,7 +270,7 @@ def test_bench_elemwise(n_iter=1000, **kwargs):
        # Skip test if cuda_ndarray is not available.
        from nose.plugins.skip import SkipTest
        import theano.sandbox.cuda as cuda_ndarray
-        if cuda_ndarray.enable_cuda == False:
+        if cuda_ndarray.cuda_enabled == False:
            raise SkipTest('Optional package cuda disabled')
        import theano.sandbox.cuda
        theano.sandbox.cuda.use()

--- a/theano/sandbox/cuda/tests/test_blas.py
+++ b/theano/sandbox/cuda/tests/test_blas.py
@@ -8,12 +8,12 @@ import numpy
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda_ndarray
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_enabled == False:
    raise SkipTest('Optional package cuda disabled')

 import theano.sandbox.cuda as tcn

-from theano.sandbox.downsample import DownsampleFactorMax
+from theano.tensor.signal.downsample import DownsampleFactorMax

 import theano.compile.mode


--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
@@ -5,7 +5,7 @@ import theano
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda_ndarray
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_enabled == False:
    raise SkipTest('Optional package cuda disabled')
    
 cuda_tensor4 = cuda_ndarray.CudaNdarrayType([False]*4)

--- a/theano/sandbox/cuda/tests/test_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_cuda_ndarray.py
@@ -3,7 +3,7 @@ import theano
 import theano.sandbox.cuda as cuda_ndarray
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_enabled == False:
        raise SkipTest('Optional package cuda disabled')
 import numpy


--- a/theano/sandbox/cuda/tests/test_nnet.py
+++ b/theano/sandbox/cuda/tests/test_nnet.py
@@ -7,14 +7,14 @@ from theano import tensor
 import theano.tensor.nnet

 import theano.sandbox.conv
-import theano.sandbox.downsample
+import theano.tensor.signal.downsample as downsample

 import numpy

 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda_ndarray
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_enabled == False:
    raise SkipTest('Optional package cuda disabled')

 import theano.sandbox.cuda as tcn
@@ -307,7 +307,7 @@ def run_conv_nnet2_classif(use_gpu, isize, ksize, n_batch, n_iter,
    conv_op.set_flops()
    conv_op1.set_flops()

-    ds_op = theano.sandbox.downsample.DownsampleFactorMax((2,2), ignore_border=False)
+    ds_op = downsample.DownsampleFactorMax((2,2), ignore_border=False)
    if downsample_ops:
        hid = tensor.tanh(ds_op(conv_op(x, w0)+b0.dimshuffle((0,'x','x'))))
    else:

--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
@@ -8,7 +8,7 @@ import numpy
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda_ndarray
-if cuda_ndarray.enable_cuda == False:
+if cuda_ndarray.cuda_available == False:
    raise SkipTest('Optional package cuda disabled')

 import theano.compile.mode

--- a/theano/sandbox/downsample.py
+++ b/theano/sandbox/downsample.py
-""" Ops for downsampling images.
+import sys
+print >> sys.stderr, "DEPRECATION: theano.sandbox.downsample is deprecated. Use theano.tensor.signal.downsample instead."

-Planned: 
-DownsampleFactorMax, DownsampleAvg, DownsampleSoftmax.
-
-"""
-#This file should move along with conv.py
-
-from theano import gof, Op, tensor, Variable, Apply
-from theano.printing import Print
-import numpy, theano
-import __builtin__
-
-class DownsampleFactorMaxGrad(Op):
-
-    def __init__(self, ds, ignore_border):
-        self.ds = tuple(ds)
-        self.ignore_border = ignore_border
-
-    def __eq__(self, other):
-        return type(self) == type(other) and self.ds == other.ds and self.ignore_border == other.ignore_border
-
-    def __hash__(self):
-        return hash(type(self)) ^ hash(self.ds) ^ hash(self.ignore_border)
-
-    def __str__(self):
-        return '%s{%s,%s}' % (self.__class__.__name__, self.ds, self.ignore_border)
-
-    def make_node(self, x, maxout, gz):
-        # make_node should only be called by the grad function of DownsampleFactorMax, 
-        # so these asserts should not fail.
-        assert isinstance(x, Variable) and x.ndim==4
-        assert isinstance(maxout, Variable) and maxout.ndim==4
-        assert isinstance(gz, Variable) and gz.ndim==4
-
-        return Apply(self, [x, maxout, gz], [x.type()])
-
-    def perform(self, node, (x, maxout, gz), (gx_stg,)):
-        gx = numpy.zeros_like(x)
-
-        ds0, ds1 = self.ds
-        shape2 = (x.shape[2] / ds0 * ds0)
-        if not self.ignore_border: shape2 = x.shape[2]
-        shape3 = (x.shape[3] / ds1 * ds1)
-        if not self.ignore_border: shape3 = x.shape[3]
-        for n in xrange(x.shape[0]):
-            for k in xrange(x.shape[1]):
-                for i in xrange(shape2):
-                    zi = i / ds0
-                    for j in xrange(shape3):
-                        zj = j / ds1
-                        if (maxout[n,k,zi,zj] == x[n,k,i,j]):
-                            gx[n,k,i,j] = gz[n,k,zi,zj]
-                        else: gx[n,k,i,j] = 0
-        gx_stg[0] = gx
-
-    def c_code(self, node, name, (x, z, gz), (gx,), sub):
-        fail = sub['fail']
-        ignore_border = int(self.ignore_border)
-        ds0, ds1 = self.ds
-        return """
-        int x_typenum = PyArray_ObjectType((PyObject*)%(x)s, 0);
-        int z_typenum = PyArray_ObjectType((PyObject*)%(z)s, 0);
-        int gz_typenum = PyArray_ObjectType((PyObject*)%(gz)s, 0);
-        int x_shp0_usable;
-        int x_shp1_usable;
-        int z_shp0, z_shp1;
-        if ((x_typenum != z_typenum) || (x_typenum != gz_typenum))
-        {
-            PyErr_SetString(PyExc_ValueError, "input types must all match");
-            %(fail)s;
-        }
-        if(%(x)s->nd!=4) 
-        {
-            PyErr_SetString(PyExc_ValueError, "x must be a 4d ndarray");
-            %(fail)s;
-        }
-        if(%(z)s->nd!=4) 
-        {
-            PyErr_SetString(PyExc_ValueError, "z must be a 4d ndarray");
-            %(fail)s;
-        }
-        if(%(gz)s->nd!=4) 
-        {
-            PyErr_SetString(PyExc_ValueError, "gz must be a 4d ndarray");
-            %(fail)s;
-        }
-        z_shp0 = %(z)s->dimensions[2];
-        z_shp1 = %(z)s->dimensions[3];
-        if (%(ignore_border)s)
-        {
-            x_shp0_usable = z_shp0 * %(ds0)s;
-            x_shp1_usable = z_shp1 * %(ds1)s;
-        }
-        else
-        {
-            x_shp0_usable = %(x)s->dimensions[2];
-            x_shp1_usable = %(x)s->dimensions[3];
-        }
-        if ((!%(gx)s)
-          || *PyArray_DIMS(%(gx)s)!=4
-          ||(%(gx)s->dimensions[0] != %(x)s->dimensions[0])
-          ||(%(gx)s->dimensions[1] != %(x)s->dimensions[1])
-          ||(%(gx)s->dimensions[2] != %(x)s->dimensions[2])
-          ||(%(gx)s->dimensions[3] != %(x)s->dimensions[3])
-          )
-        {
-          Py_XDECREF(%(gx)s);
-          %(gx)s = (PyArrayObject*) PyArray_ZEROS(4, %(x)s->dimensions, x_typenum,0);
-        }
-
-        for(int b=0;b<%(x)s->dimensions[0];b++){
-          for(int k=0;k<%(x)s->dimensions[1];k++){
-            int mini_i = 0;
-            int zi = 0;
-            for(int i=0;i< x_shp0_usable; i++){
-               int mini_j = 0;
-               int zj = 0;
-               for(int j=0; j< x_shp1_usable; j++){
-                 dtype_%(x)s * __restrict__ xp = ((dtype_%(x)s*)(PyArray_GETPTR4(%(x)s,b,k,i,j)));
-                 dtype_%(gx)s * __restrict__ gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
-                 dtype_%(z)s * __restrict__ zp = ((dtype_%(z)s*)(PyArray_GETPTR4(%(z)s,b,k,zi,zj)));
-                 dtype_%(gz)s * __restrict__ gzp = ((dtype_%(gz)s*)(PyArray_GETPTR4(%(gz)s,b,k,zi,zj)));
-                 gxp[0] = (zp[0] == xp[0]) ? gzp[0] : 0;
-                 mini_j = (mini_j + 1 == %(ds1)s) ? 0 : mini_j+1;
-                 zj += (mini_j == 0);
-              }//for j
-              mini_i = (mini_i + 1 == %(ds0)s) ? 0 : mini_i+1;
-              zi += (mini_i == 0);
-
-              for (int j = x_shp1_usable; j < %(x)s->dimensions[3]; ++j) {
-                dtype_%(gx)s * gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
-                gxp[0] = 0;
-              }
-            }//for i
-
-            for(int i = x_shp0_usable; i < %(x)s->dimensions[2]; i++){
-                for (int j = 0; j < %(x)s->dimensions[3]; ++j) {
-                    dtype_%(gx)s * gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
-                    gxp[0] = 0;
-                }
-            }
-          }//for k
-        }//for b
-        """ %locals()
-
-    def c_code_cache_version(self):
-        return ()
-
-                
-def max_pool2D(input, ds, ignore_border=False):
-    """
-    Takes as input a N-D tensor, where N >= 2. It downscales the input image by
-    the specified factor, by keeping only the maximum value of non-overlapping
-    patches of size (ds[0],ds[1])
-    :type input: N-D theano tensor of input images. 
-    :param input: input images. Max pooling will be done over the 2 last dimensions.
-    :type ds: tuple of length 2
-    :param ds: factor by which to downscale. (2,2) will halve the image in each
-               dimension.
-    :param ignore_border: boolean value. When True, (5,5) input with ds=(2,2)
-                         will generate a (2,2) output. (3,3) otherwise.
-    """
-    if input.ndim < 2:
-        raise NotImplementedError('max_pool2D requires a dimension >= 2')
-
-    # extract image dimensions
-    img_shape = input.shape[-2:]
-    
-    # count the number of "leading" dimensions, store as dmatrix
-    batch_size = tensor.prod(input.shape[:-2])
-    batch_size = tensor.shape_padright(batch_size,1)
-
-    # store as 4D tensor with shape: (batch_size,1,height,width)
-    new_shape = tensor.cast(tensor.join(0, batch_size, 
-        tensor.as_tensor([1,]), img_shape), 'int64')
-    input_4D = tensor.reshape(input, new_shape, ndim=4)
-
-    # downsample mini-batch of images
-    op = DownsampleFactorMax(ds, ignore_border)
-    output = op(input_4D)
-
-    # restore to original shape
-    outshp = tensor.join(0, input.shape[:-2], output.shape[-2:])
-    return tensor.reshape(output, outshp, ndim=input.ndim)
-
-
-class DownsampleFactorMax(Op):
-    """
-    For N-dimensional tensors, consider that the last two dimensions span images.  
-    This Op downsamples these images by a factor ds, by taking the max over non-
-    overlapping rectangular regions.
-    """
-
-    @staticmethod
-    def out_shape(imgshape, ds, ignore_border=False):
-        """Return the shape of the output from this op, for input of given shape and flags.
-
-        :param imgshape: the shape of a tensor of images. The last two elements are interpreted
-        as the number of rows, and the number of cols.
-        :type imgshape: tuple, list, or similar.
-
-        :param ds: downsample factor over rows and columns
-        :type ds: list or tuple of two ints
-
-        :param ignore_border: if ds doesn't divide imgshape, do we include an extra row/col of
-        partial downsampling (False) or ignore it (True).
-        :type ignore_border: bool
-
-        :rtype: list
-        :returns: the shape of the output from this op, for input of given shape.  This will
-        have the same length as imgshape, but with last two elements reduced as per the
-        downsampling & ignore_border flags.  
-        """
-        if len(imgshape) < 2:
-            raise TypeError('imgshape must have at least two elements (rows, cols)')
-        r, c = imgshape[-2:]
-        rval = list(imgshape[:-2])+[ r/ds[0], c/ds[1]]
-        if not ignore_border:
-            if r % ds[0]:
-                rval[-2] += 1
-            if c % ds[1]:
-                rval[-1] += 1
-        return rval
-
-    def __init__(self, ds, ignore_border=False):
-        """
-        :param ds: downsample factor over rows and columns
-        :type ds: list or tuple of two ints
-
-        :param ignore_border: if ds doesn't divide imgshape, do we include an extra row/col of
-        partial downsampling (False) or ignore it (True).
-        :type ignore_border: bool
-
-        TODO: why is poolsize an op parameter here?
-        """
-        self.ds = tuple(ds)
-        self.ignore_border = ignore_border
-
-    def __eq__(self, other):
-        return type(self) == type(other) and self.ds == other.ds and self.ignore_border == other.ignore_border
-
-    def __hash__(self):
-        return hash(type(self)) ^ hash(self.ds) ^ hash(self.ignore_border)
-
-    def __str__(self):
-        return '%s{%s,%s}' % (self.__class__.__name__, self.ds, self.ignore_border)
-
-    def make_node(self, x):
-        if x.type.ndim != 4:
-            raise TypeError()
-        # TODO: consider restrucing the dtype?
-        return gof.Apply(self, [x], [x.type()])
-
-    def perform(self, node, (x,), (z,)):
-        """
-        """
-        if len(x.shape)!=4:
-            raise NotImplementedError('DownsampleFactorMax requires 4D input for now')
-        if z[0] is None:
-            z[0] = numpy.zeros(self.out_shape(x.shape, self.ds, self.ignore_border)) -float('inf')
-            z[0] = theano._asarray(z[0], dtype=x.dtype)
-        zz=z[0]
-        ds0, ds1 = self.ds
-        if self.ignore_border:
-            x_usable2 = (x.shape[2] / ds0 * ds0)
-        else: x_usable2 = x.shape[2]
-        if self.ignore_border:
-            x_usable3 = (x.shape[3] / ds1 * ds1)
-        else: x_usable3 = x.shape[3]
-        for n in xrange(x.shape[0]):
-            for k in xrange(x.shape[1]):
-                for i in xrange(x_usable2):
-                    zi = i / ds0
-                    for j in xrange(x_usable3):
-                        zj = j / ds1
-                        zz[n,k,zi,zj] = __builtin__.max(zz[n,k,zi,zj], x[n,k,i,j])
-
-    def grad(self,(x,), (gz,)):
-        maxout = self(x)
-        return [DownsampleFactorMaxGrad(self.ds, ignore_border=self.ignore_border)(x, maxout, gz)]
-
-    def c_code(self, node, name, (x,), (z, ), sub):
-        fail=sub['fail']
-        ignore_border = int(self.ignore_border)
-        ds0, ds1 = self.ds
-        return """
-        int typenum = PyArray_ObjectType((PyObject*)%(x)s, 0);
-        int x_shp0_usable;
-        int x_shp1_usable;
-        int z_shp0, z_shp1;
-        if(%(x)s->nd!=4) 
-        {
-            PyErr_SetString(PyExc_ValueError, "x must be a 4d ndarray");
-            %(fail)s;
-        }
-        z_shp0 = %(x)s->dimensions[2] / %(ds0)s;
-        z_shp1 = %(x)s->dimensions[3] / %(ds1)s;
-        if (%(ignore_border)s)
-        {
-            x_shp0_usable = z_shp0 * %(ds0)s;
-            x_shp1_usable = z_shp1 * %(ds1)s;
-        }
-        else
-        {
-            z_shp0 += (%(x)s->dimensions[2] %% %(ds0)s) ? 1 : 0;
-            z_shp1 += (%(x)s->dimensions[3] %% %(ds1)s) ? 1 : 0;
-            x_shp0_usable = %(x)s->dimensions[2];
-            x_shp1_usable = %(x)s->dimensions[3];
-        }
-        if ((!%(z)s)
-          || *PyArray_DIMS(%(z)s)!=4
-          ||(%(z)s->dimensions[0] != %(x)s->dimensions[0])
-          ||(%(z)s->dimensions[1] != %(x)s->dimensions[1])
-          ||(%(z)s->dimensions[2] != z_shp0)
-          ||(%(z)s->dimensions[3] != z_shp1)
-          )
-        {
-          if (%(z)s) Py_XDECREF(%(z)s);
-          npy_intp dims[4] = {0,0,0,0};
-          dims[0]=%(x)s->dimensions[0];
-          dims[1]=%(x)s->dimensions[1];
-          dims[2]=z_shp0;
-          dims[3]=z_shp1;
-          %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0); //TODO: zeros not necessary
-        }
-
-        if (z_shp0 && z_shp1)
-        {
-            for(int b=0;b<%(x)s->dimensions[0];b++){
-              for(int k=0;k<%(x)s->dimensions[1];k++){
-                int mini_i = 0;
-                int zi = 0;
-                for(int i=0;i< x_shp0_usable; i++){
-                  int mini_j = 0;
-                  int zj = 0;
-                  for(int j=0; j<x_shp1_usable; j++){
-                    dtype_%(x)s a = ((dtype_%(x)s*)(PyArray_GETPTR4(%(x)s,b,k,i,j)))[0];
-                    dtype_%(z)s * __restrict__ z = ((dtype_%(z)s*)(PyArray_GETPTR4(%(z)s,b,k,zi,zj)));
-                    z[0] = (((mini_j|mini_i) == 0) || z[0] < a) ? a : z[0];
-                    mini_j = ((mini_j + 1) == %(ds1)s) ? 0 : mini_j+1;
-                    zj += (mini_j == 0);
-                  }
-                  mini_i = ((mini_i + 1) == %(ds0)s) ? 0 : mini_i+1;
-                  zi += (mini_i == 0);
-                }
-              }
-            }
-        }
-        """ % locals()
-
-    def c_code_cache_version(self):
-        return ()
+from theano.tensor.signal.downsample import *

--- a/theano/sandbox/test_downsample.py
+++ b/theano/sandbox/test_downsample.py
-import unittest, sys, time
-import numpy as N
-import theano.tensor as T
-from theano.tests import unittest_tools as utt
-from theano.sandbox.downsample import DownsampleFactorMax
-from theano import function, Mode
-
-def max_pool(images=None, imshp=None, maxpoolshp=None, ignore_border=True):
-    """Implements a max pooling layer
-
-    Uses the same API as sp.max_pool but uses the Downsample op instead.
-
-    Takes as input a 2D tensor of shape batch_size x img_size and performs max pooling.
-    Max pooling downsamples by taking the max value in a given area, here defined by
-    maxpoolshp. Outputs a 2D tensor of shape batch_size x output_size.
-
-    Parameters are keyword arguments in order to use func_to_mod.
-
-    @param images: 2D tensor containing images on which to apply convolution.
-                   Assumed to be of shape batch_size x img_size
-    @param imgshp: tuple containing image dimensions
-    @param maxpoolshp: tuple containing shape of area to max pool over
-    
-    @output out1: symbolic result (2D tensor)
-    @output out2: logical shape of the output
-
-    """
-    if len(imshp) == 2:
-        imshp = (1,) + imshp
-    elif len(imshp)!=3:
-        raise NotImplementedError("!")
-    
-    # all these reshapes should happen in place
-    imrshp = T.stack(images.shape[0],
-                          *[T.as_tensor(x) for x in imshp])
-    imtensor = T.reshape(images, imrshp)
-
-    maxpop = DownsampleFactorMax(maxpoolshp, ignore_border)
-    rval = maxpop(imtensor)
-
-    return T.flatten(rval,2), maxpop.out_shape(imshp, maxpoolshp, ignore_border)
-
-class TestDownsampleFactorMax(unittest.TestCase):
-    def test_maxpool(self):
-        # generate flatted images
-        maxpoolshps = ((1,1),(2,2),(3,3),(2,3))
-        imval = N.random.rand(4,10,64,64)
-        images = T.dmatrix()
-        dmatrix4=T.TensorType('float64', (False, False, False, False))
-        images4=dmatrix4()
-        tctot, tpytot, ntot = [],[],[]
-        for maxpoolshp in maxpoolshps:
-            for border in [True,False]:
-                print 'maxpoolshp', maxpoolshp,'border', border
-           
-                # numeric verification
-                xi=0
-                yi=0
-                if not border:
-                    if imval.shape[-2] % maxpoolshp[0]:
-                        xi += 1
-                    if imval.shape[-1] % maxpoolshp[1]:
-                        yi += 1
-                my_output_val = N.zeros((imval.shape[0], imval.shape[1],
-                                         imval.shape[2]/maxpoolshp[0]+xi,
-                                         imval.shape[3]/maxpoolshp[1]+yi))
-            
-                time1=time.time()
-                for n in range(imval.shape[0]):
-                    for k in range(imval.shape[1]):
-                        for i in range(my_output_val.shape[2]):
-                            ii =  i*maxpoolshp[0]
-                            for j in range(my_output_val.shape[3]):
-                                jj = j*maxpoolshp[1]
-                                patch = imval[n,k,ii:ii+maxpoolshp[0],jj:jj+maxpoolshp[1]]
-                                my_output_val[n,k,i,j] = N.max(patch)
-                my_output_val = my_output_val.reshape(imval.shape[0],-1)
-                ntot+=[time.time()-time1]
-
-                # symbolic stuff
-            #### wrapper to DownsampleFactorMax op ####
-                output, outshp = max_pool(images, imval.shape[1:], maxpoolshp, border)
-                assert N.prod(my_output_val.shape[1:]) == N.prod(outshp)
-                assert N.prod(my_output_val.shape[1:]) == N.prod(outshp)
-                f = function([images,],[output,])
-                imval2=imval.reshape(imval.shape[0],-1)
-                output_val = f(imval2)
-                assert N.all(output_val == my_output_val)
-                
-                #DownsampleFactorMax op
-                maxpool_op = DownsampleFactorMax(maxpoolshp, ignore_border=border)(images4)
-                f = function([images4],maxpool_op,mode=Mode(linker="py"))
-                f2 = function([images4],maxpool_op,mode=Mode(linker="c"))
-                f3 = function([images4],maxpool_op)#for when we want to use the debug mode
-                time1=time.time()
-                output_val = f(imval)
-                tctot+=[time.time()-time1]
-                assert (N.abs(my_output_val.flatten()-output_val.flatten())<1e-5).all()
-                time1=time.time()
-                output_val = f2(imval)
-                tpytot+=[time.time()-time1]
-                assert (N.abs(my_output_val.flatten()-output_val.flatten())<1e-5).all()
-                output_val = f3(imval)
-
-        print 'Numpy processing time: %.3fs'%sum(ntot),ntot
-        print 'c Theano(DownsampleFactorMax) processing time: %.3fs'%sum(tctot),tctot
-        print 'py Theano(DownsampleFactorMax) processing time: %.3fs'%sum(tpytot),tpytot
-        d=N.asarray(ntot)/tctot
-        print 'speed up c theano(DownsampleFactorMax) vs manual: %.3f'%d.mean(),d
-        d=N.asarray(ntot)/tpytot
-        print 'speed up py theano(DownsampleFactorMax) vs manual: %.3f'%d.mean(),d
-
-    def test_DownsampleFactorMax_grad(self):
-        # generate flatted images
-        maxpoolshps = ((1,1),(3,2),(2,3))
-        imval = N.random.rand(2,3,3,4) * 10.0 #more variance means numeric gradient will be more accurate
-        do_theano=True
-        for maxpoolshp in maxpoolshps:
-            for border in [True,False]:
-                print 'maxpoolshp', maxpoolshp, 'border', border
-                def mp(input):
-                    return DownsampleFactorMax(maxpoolshp, ignore_border=border)(input)
-                utt.verify_grad(mp, [imval])
-
-if __name__ == '__main__':
-    t = TestDownsampleFactorMax("test_maxpool").run()
-    #t.test_maxpool()
-    from theano.tests import main
-#    main("test_sp")
--- a/theano/tensor/__init__.py
+++ b/theano/tensor/__init__.py
@@ -28,5 +28,3 @@ import nnet # used for softmax, sigmoid, etc.



-
-
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -261,11 +261,14 @@ def _wrap_tensor_into_member(x):
 compile.module.register_wrapper(_obj_is_wrappable_as_tensor, _wrap_tensor_into_member)

 if int(config.tensor.cmp_sloppy)>1:
-    # This environment variable is a quick-and-dirty way to get low-precision comparisons.
-    # For a more precise setting of these tolerances set them explicitly in your user code by
-    # assigning, for example, "theano.tensor.basic.float32_atol = ..."
-
-    #when THEANO_CMP_SLOPPY>1 we are even more sloppy. This is usefull to test the gpu as they don't use extended precision and this cause some difference bigger then the normal sloppy.
+    # This config variable is a quick-and-dirty way to get low-precision
+    # comparisons.  For a more precise setting of these tolerances set
+    # them explicitly in your user code by assigning, for example,
+    # "theano.tensor.basic.float32_atol = ..."
+
+    # When config.tensor.cmp_sloppy>1 we are even more sloppy. This is
+    # useful to test the GPU as they don't use extended precision and
+    # this cause some difference bigger then the normal sloppy.
    float32_atol = 5e-4
    float32_rtol = 1e-3 
    float64_rtol = 1e-4
@@ -3597,8 +3600,10 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast

        o_fn = function(tensor_pt, o_output)
        o_fn_out = o_fn(*[p.copy() for p in pt])
-        
-        random_projection = rng.rand(*o_fn_out.shape)
+
+        # random_projection should not have elements too small,
+        # otherwise too much precision is lost in numerical gradient
+        random_projection = rng.rand(*o_fn_out.shape) + 0.5
        if cast_to_output_type:
            random_projection = numpy.array(random_projection,
                                            dtype=o_output.dtype)

--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
@@ -44,7 +44,7 @@ def ldflags(libs=True, flags=False):
    """Return a list of libraries against which an Op's object file should be
    linked to benefit from a BLAS implementation.
    
-    Default: ['blas'], but environment variable THEANO_BLAS_LDFLAGS overrides this.
+    Default: ['blas'], but configuration variable config.blas.ldflags overrides this.
    """
    rval = []
    for t in config.blas.ldflags.split():
@@ -52,9 +52,9 @@ def ldflags(libs=True, flags=False):
            t0, t1, t2 = t[0:3]
            assert t0 == '-'
        except:
-            raise ValueError('invalid token in THEANO_BLAS_LDFLAGS', t)
+            raise ValueError('invalid token in config.blas.ldflags', t)
        if t1 == 'L':
-            raise ValueError('library dir not allowed in THEANO_BLAS_LDFLAGS', t)
+            raise ValueError('library dir not allowed in config.blas.ldflags', t)
        elif libs and t1=='l': # example -lmkl
            rval.append(t[2:])
        elif flags and t1!='l': # example -openmp

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -822,7 +822,14 @@ class CAReduce(Op):
        to_reduce = reversed(sorted(axis))
        if to_reduce:
            for dimension in to_reduce:
-                variable = self.ufunc.reduce(variable, dimension)
+                # If it's a zero-size array, use scalar_op.identity if available
+                if variable.shape[dimension] == 0:
+                    if hasattr(self.scalar_op, 'identity'):
+                        variable = self.scalar_op.identity
+                    else:
+                        raise ValueError("Input (%s) has zero-size on axis %s, but self.scalar_op (%s) has no attribute 'identity'" % (variable, dimension, self.scalar_op))
+                else:
+                    variable = self.ufunc.reduce(variable, dimension)
            output[0] = theano._asarray(variable, dtype = node.outputs[0].type.dtype)
        else:
            output[0] = numpy.copy(variable)

--- a/theano/tensor/signal/__init__.py
+++ b/theano/tensor/signal/__init__.py
--- a/theano/tensor/signal/conv.py
+++ b/theano/tensor/signal/conv.py
+"""
+Contains an op for convolving input images with a set of filters. This was
+developed especially for Convolutional Neural Networks.
+"""
+
+__docformat__ = "restructuredtext en"
+
+import numpy
+import theano
+import theano.tensor as tensor
+from theano import gof, Op, tensor, config
+
+import logging
+_logger=logging.getLogger("theano.signal.conv")
+def _debug(*msg):
+    _logger.debug(' '.join(msg))
+def _warn(*msg):
+    _logger.warn(' '.join(msg))
+
+
+def conv2d(input, filters, image_shape=None, filter_shape=None,
+                border_mode='valid', subsample=(1,1), **kargs):
+    """
+    This function will build the symbolic graph for convolving a stack of input
+    images with a set of filters. The implementation is modelled after
+    Convolutional Neural Networks (CNN). It is simply a wrapper to the ConvOp but
+    provides a much cleaner interface.
+
+    :type input: symbolic 4D tensor
+    :param input: mini-batch of feature map stacks, of shape image_shape.
+
+    :type filters: symbolic 4D tensor
+    :param filters: set of filters used in CNN layer of shape filter_shape
+
+    :param border_mode:
+       'valid'-- only apply filter to complete patches of the image. Generates
+                 output of shape: image_shape - filter_shape + 1 
+       'full' -- zero-pads image to multiple of filter shape to generate output of
+                 shape: image_shape + filter_shape - 1
+
+    :type subsample: tuple of len 2
+    :param subsample: factor by which to subsample the output
+                       
+    :type image_shape: tuple of len 4
+    :param image_shape: (batch size, stack size, nb row, nb col)
+    :type filter_shape: tuple of len 4
+    :param filter_shape: (nb filters, stack size, nb row, nb col)
+
+    :param kwargs: kwargs are passed onto ConvOp. Can be used to set the following:
+                   unroll_batch, unroll_kern, unroll_patch (see ConvOp doc)
+    """
+    if image_shape and filter_shape:
+        assert image_shape[1]==filter_shape[1]
+
+    if filter_shape is not None:
+        nkern = filter_shape[0]
+        kshp = filter_shape[2:]
+    else:
+        nkern, kshp = None, None
+
+    if image_shape is not None:
+        bsize = image_shape[0]
+        imshp = image_shape[1:]
+    else:
+        bsize, imshp = None, None
+
+    op = ConvOp(output_mode=border_mode, dx=subsample[0], dy=subsample[1],
+                imshp=imshp, kshp=kshp, nkern=nkern, bsize=bsize,**kargs)
+
+    return op(input, filters)
+
+
+class ConvOp(Op):
+    """
+    This Op serves a dual purpose: it can implement a vanilla 2D convolution
+    (as taught in any signal processing class) or implement the
+    convolutional layers found in Convolutional Neural Networks. 
+    
+    In this setting, a set of 3D images is convolved with a set of 3D kernels,
+    with the particularity that their leading dimensions are of equal length.
+    Vanilla 2D convolution is treated as a special case of this.
+   
+    The input parameter represents a mini-batch of multiple images. Its shape is:
+        batch size x num. input feature maps x image height x image width
+
+    The kernel parameter represents a set of 3D kernels. Its shape is:
+        number of filters x num. input images x filter height x filter width 
+
+    The output of ConvOp is a 4D tensor, generated as follows:
+        output[b,k,:,:] = \sum_i input[b,i,:,:] * filter[k,i,:,:] \forall b,k
+    where b is the mini-batch index, k the filter index and * is the convolution
+    operator.  
+    """
+
+    __attrnames = ['imshp', 'kshp', 'nkern', 'bsize', 'dx', 'dy', 'out_mode', 
+            'unroll_batch', 'unroll_kern', 'unroll_patch',
+            'imshp_logical', 'kshp_logical', 'kshp_logical_top_aligned']
+    """These attributes uniquely identify the behaviour of this op for given inputs"""
+
+    @staticmethod
+    def getOutputShape(inshp, kshp, (dx,dy)=(1,1), mode='valid'):
+        """
+        Computes the output dimensions of convolving an image of shape "inshp"
+        with kernels of shape "kshp".
+        
+        :param inshp: (rows,cols) of input image
+        :param kshp: (rows,cols) of filters
+        :param mode: 'valid' or 'full' (see 'border_mode' in conv2d's doc)
+        :return: (rows,cols) of output image
+        """
+        if mode=='valid': s = -1
+        else: s = 1
+        inshp, kshp = numpy.array(inshp), numpy.array(kshp)
+        return  numpy.int64(numpy.ceil((inshp[1:] + s*kshp - s*1)/\
+                numpy.array([dx,dy], dtype='float')))
+
+
+    def __init__(self, imshp=None, kshp=None, nkern=None, bsize=None, 
+            dx=None, dy=None,
+            output_mode='valid', unroll_batch=0,
+            unroll_kern=0,
+            unroll_patch=True,
+            imshp_logical=None,
+            kshp_logical=None,
+            kshp_logical_top_aligned=True,
+            verbose=0,
+            version=-1):
+        """
+        Initializes a ConvOp with given output_mode (full/valid). All other
+        parameters are optional and are only used to generate more optimized c
+        code.
+
+        NOTES ON OPTIMIZATION:
+        If ALL (imshp, kshp, nkern and bsize) parameters are provided, we can
+        generate faster c-code. This make a significant difference for the
+        'full' output_mode with unroll_patch=True. The current fastest
+        implementation on x86-64 uses {unroll_batch=4, unroll_kern=4,
+        unroll_patch=False} with all other shape parameters being provided.
+
+        For optimizing other architectures, see:
+        Kazushige Goto and Robert A. Van De Geijn, Anatomy of High-Performance
+        Matrix Multiplication, (mr x nr). ACM Transactions on Mathematical
+        Software, May 2008.  
+        Figure 12: (mr x nr). For x86 use 2x4, itanium 8x8, etc.
+
+        :type output_mode: string
+        :param output_mode: 'valid' -- gives an output smaller then the image
+                            'full' -- gives an output bigger then the image
+
+        Optional parameters: (will generate more optimal c code)
+
+        :type imshp: tuple of len 2 or 3: 2 for 2d image, 3 for a stack of 2d images.
+        :param imshp: (stacksize, nb image row, nb image col)
+        :type kshp: tuple of len 2
+        :param kshp: (nb kernel row, nb kernel col)
+        :type nkern: int
+        :param nkern: the number of kernel
+        :type bsize: int
+        :param bsize: the size of the minibatch
+        :type dx: int
+        :param dx: patch stride rows
+        :type dy: int
+        :param dx: patch stride cols
+
+        Params which select the version of code used:
+
+        :type unroll_patch: bool
+        :param unroll_patch: use a version of c_code that unroll the patch loop that don't
+        request all shape information to work, but if all shape information are present, will
+        use it to hardcode the value in the code for faster code.
+        :type unroll_batch:int
+        :param unroll_batch: use a version of c_code that unroll the batch(by unroll_batch) and
+        the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern
+        respectively.
+        :type unroll_kern:int
+        :param unroll_kern: use a version of c_code that unroll the batch(by unroll_batch) and
+        the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern
+        respectively.
+
+        :type verbose: int
+        :param verbose: passed to GpuConv
+        :type version: int
+        :param version: passed to GpuConv
+
+        :param imshp_logical: used internally when we generate the gradient when dx!=1 or dy!=1
+        :param kshp_logical: idem
+        :param kshp_logical_top_aligned: idem
+        """
+
+        all_shape = imshp is not None and kshp is not None and \
+                    nkern is not None and bsize is not None
+
+        if (unroll_batch>0 or unroll_kern>0) and not all_shape:
+            raise Exception("In ConvOp, when using unroll_batch and unroll_nkern, all shape are needed")
+
+        
+        if not all_shape:
+            unroll_patch = True
+
+        if imshp is not None:
+            imshp = tuple(imshp)
+
+            if len(imshp)==2:
+                imshp = (1,)+imshp
+            elif len(imshp)==3:
+                imshp = imshp
+            else:
+                raise Exception("bad len for imshp")
+
+        self.imshp = imshp
+        if kshp is not None:
+            kshp = tuple(kshp)
+
+        self.kshp = kshp
+        self.nkern = nkern
+        self.bsize=bsize
+        self.dx=dx
+        self.dy=dy
+        self.verbose=verbose
+        self.version=version
+
+        # a triple
+        self.imshp_logical = self.imshp
+        if imshp_logical is not None: self.imshp_logical = tuple(imshp_logical)
+        assert (self.imshp is None and self.imshp_logical is None) or \
+               (len(self.imshp) == len(self.imshp_logical))
+
+        # a pair
+        self.kshp_logical = self.kshp
+        if kshp_logical is not None: self.kshp_logical = tuple(kshp_logical)
+        self.kshp_logical_top_aligned = kshp_logical_top_aligned
+
+        self.unroll_batch=unroll_batch
+        self.unroll_kern=unroll_kern
+        self.unroll_patch=unroll_patch
+
+        if self.unroll_batch>0 and self.bsize % self.unroll_batch!=0:
+
+            if self.bsize<=self.unroll_batch:
+                self.unroll_batch = self.bsize
+            else:
+                #find the maximum value under unroll_batch that would work
+                new=self.unroll_batch
+                assert(new>=1)
+                while self.bsize % new!=0:
+                    new-=1
+
+                warnstr = "OPTIMISATION WARNING: in ConvOp.__init__() unroll_batch(%i)"\
+                      "must be 0 or a divisor of bsize(%i). We revert it to %i. This"\
+                      "won't change the result, but may make it slower."
+                _warn(warnstr % (self.unroll_batch, self.bsize, new))
+
+                self.unroll_batch=new
+
+        if self.unroll_kern>0 and self.nkern % unroll_kern!=0:
+
+            if self.nkern<=self.unroll_kern:
+                self.unroll_kern = self.nkern
+            else:
+                #find the maximum value under unroll_kern that would work
+                new=self.unroll_kern
+                assert(new>=1)
+                while self.nkern % new!=0:
+                    new-=1
+
+                warnstr = "OPTIMISATION WARNING: in ConvOp.__init__() unroll_kern(%i)"\
+                      "should be 0 or a divisor of nkern(%i). We revert it to %i."\
+                      "This won't change the result, but may make it slower."
+                _warn(warnstr % (self.unroll_kern, self.nkern, new))
+                self.unroll_kern=new
+
+        if all_shape:
+            self.outshp = ConvOp.getOutputShape(self.imshp_logical, self.kshp_logical, (dx,dy), output_mode)
+            self.fulloutshp = ConvOp.getOutputShape(self.imshp_logical, self.kshp_logical, (1,1), output_mode)
+        else:
+            self.outshp = None
+            self.fulloutshp = None
+
+        self.out_mode = output_mode
+
+        if not self.out_mode in ["valid", "full"]:
+            raise Exception("Mode %s not implemented"%self.out_mode)
+       
+        if all_shape and not (self.outshp > 0).all():
+            raise Exception(("Bad size for the output shape. Verify that [post-"\
+                    "supersampling] input shape (%s) and kern shape(%s) are ok. "\
+                    "(Hint: kerns must fit inside image in valid mode)")%
+                    (self.imshp_logical,self.kshp_logical))
+
+        self._rehash()
+        if config.op.set_flops:
+            self.set_flops()
+
+    def __eq__(self, other):
+        if type(self) != type(other):
+            return False
+        for a in self.__attrnames:
+            if getattr(self, a) != getattr(other, a):
+                return False
+        return True
+
+    def __setstate__(self, d):
+        self.__dict__.update(d)
+        self._rehash()
+
+    def _rehash(self):
+        hashval = hash(type(self))
+        for a in self.__attrnames:
+            hashval = hashval ^ hash(getattr(self, a))
+        self.__hashval = hashval
+
+    def __hash__(self):
+        return self.__hashval
+
+    def __str__(self):
+        return "ConvOp{" +",".join(str((a, getattr(self, a))) for a in self.__attrnames)  + "}"
+
+    def set_flops(self):
+        """ Usefull with the hack in profilemode to print the MFlops"""
+        if self.out_mode=="valid":
+            self.flops=self.kshp[0]*self.kshp[1]*2#nb mul and add by output pixed
+            self.flops*=self.outshp[0]*self.outshp[1]#nb flops by output image
+            self.flops*=self.imshp[0]*self.nkern*self.bsize#for all outputs images#n_stack==self.imshp[0]
+        else: #full mode not implemented
+
+            self.flops=0
+            for out_row in range(self.outshp[0]):#loop over output row
+                for out_col in range(self.outshp[0]):#loop over output col
+                    for row in range(self.kshp[0]):#loop over kern row
+
+                        if (row+out_row-self.kshp[0]+1<0 or 
+                            row+out_row-self.kshp[0]+1>=self.imshp[1]): 
+                            continue
+
+                        col=0
+                        max_col=self.kshp[1]
+                        img_col=out_col-self.kshp[1]+1
+                        max_col=min(max_col,self.imshp[2]-img_col)
+
+                        if img_col<0:
+                            col=-img_col
+                            img_col+=col
+                        while col < max_col: #loop over kern col
+                            self.flops+=2
+                            col+=1
+            
+            self.flops*=self.imshp[0]*self.nkern*self.bsize#for all outputs images#n_stack==self.imshp[0]
+            
+            assert self.flops == self.bsize * self.nkern * self.imshp[0] * \
+                    self.kshp[0] * self.kshp[1] * self.imshp[1] * self.imshp[2] * 2
+
+    def make_node(self, inputs, kerns):
+        # TODO: find a way to make ConvOp work for N-D (after NIPS09)
+        """
+        inputs - 4 dim: batches x stacksize x rows x cols
+        kerns - 4 dim: nkern x stackidx x rows x cols
+        """
+        outdim = kerns.ndim
+        _inputs = tensor.as_tensor_variable(inputs)
+        _kerns = tensor.as_tensor_variable(kerns)
+        # TODO: lift this restriction by upcasting either inputs or kerns
+        if _inputs.ndim != 4:
+            raise TypeError('make_node requires 4D tensor of inputs')
+        if _kerns.ndim != 4:
+            raise TypeError('make_node requires 4D tensor of kernels')
+        if _inputs.type.dtype != _kerns.type.dtype:
+            raise NotImplementedError("The image and the kernel must have the same type."
+                            "inputs(%s), kerns(%s)"%(_inputs.dtype, _kerns.dtype))
+        output = tensor.tensor(dtype=_inputs.type.dtype,
+                               broadcastable=[_inputs.broadcastable[0],
+                                   _kerns.broadcastable[0], False, False]); 
+
+        return gof.Apply(self, [_inputs, _kerns], [output])
+
+    def perform(self,node, (img2d, filtersflipped), (z,)):
+        """
+        By default if len(img2d.shape)==3, we
+        """
+        # TODO: move these back out to global scope when they no longer cause an atexit error
+        from scipy.signal.signaltools import  _valfrommode, _bvalfromboundary
+        from scipy.signal.sigtools import _convolve2d
+        imshp = self.imshp
+        if imshp is None:
+            imshp = tuple(img2d.shape[1:])
+        kshp = self.kshp
+        if kshp is None:
+            kshp = tuple(filtersflipped.shape[2:])
+        bsize = self.bsize
+        if bsize is None:
+            bsize = img2d.shape[0]
+        nkern = self.nkern
+        if nkern is None:
+            nkern = filtersflipped.shape[0]
+        
+        imshp_logical = self.imshp_logical
+        if imshp_logical is None:
+            imshp_logical = imshp
+        kshp_logical = self.kshp_logical
+        if kshp_logical is None:
+            kshp_logical = kshp
+            
+        if self.fulloutshp is not None:
+            fulloutshp = tuple(self.fulloutshp)
+        else:
+            fulloutshp = tuple(ConvOp.getOutputShape(imshp_logical, kshp_logical, (1,1), self.out_mode))
+
+        if z[0] is None:
+            z[0] = numpy.zeros((bsize,)+(nkern,)+fulloutshp,
+                           dtype=img2d.dtype)
+        zz=z[0]
+        val = _valfrommode(self.out_mode)
+        bval = _bvalfromboundary('fill')
+
+        stacklen = imshp[0]
+
+        img2d = img2d.reshape((bsize,)+ imshp)
+        filtersflipped = filtersflipped.reshape((nkern,stacklen)+kshp)
+
+        if self.imshp != self.imshp_logical:
+            # assuming that to get from imshp to imshp logical we insert zeros in missing spots
+            rstride = int(numpy.ceil(imshp_logical[1] / float(imshp[1])))
+            cstride = int(numpy.ceil(imshp_logical[2] / float(imshp[2])))
+            buf = numpy.zeros((bsize,)+ imshp_logical, dtype=img2d.dtype)
+            buf[:,:,::rstride, ::cstride] = img2d
+            img2d = buf
+            del buf, rstride, cstride
+
+        if kshp != kshp_logical:
+            rstride = int(numpy.ceil(kshp_logical[0] / float(kshp[0])))
+            cstride = int(numpy.ceil(kshp_logical[1] / float(kshp[1])))
+            buf = numpy.zeros((nkern,stacklen)+ self.kshp_logical, dtype=filtersflipped.dtype)
+            if self.kshp_logical_top_aligned:
+                roffset=coffset=0
+            else:
+                roffset=(kshp_logical[0] - (kshp[0]*rstride) - 1+rstride) % rstride
+                coffset=(kshp_logical[1] - (kshp[1]*cstride) - 1+cstride) % cstride
+                assert roffset >= 0
+                assert coffset >= 0
+            buf[:,:,roffset::rstride, coffset::cstride] = filtersflipped
+            filtersflipped = buf
+            del buf, rstride, cstride
+
+        for b in range(bsize):
+            for n in range(nkern):
+                zz[b,n,...].fill(0)
+                for im0 in range(stacklen):
+                    zz[b,n,...] +=  _convolve2d(\
+                        img2d[b,im0,...], filtersflipped[n,im0,...],1,val, bval, 0)
+
+        #We copy it to remove the Stride mismatch warning from DEBUG_MODE.
+        #The copy make that we return an object with the same stride as the c version.
+        #The copy don't affect the performence during our experience as in that case we
+        #execute the c version which is much faster.
+        if self.dx>1 or self.dy>1:
+            zz = zz[:,:,0::self.dx,0::self.dy].copy()
+
+        z[0]=zz
+
+
+    def grad(self, (inputs, kerns), (gz,)):
+
+        if self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
+            raise NotImplementedError('todo')
+
+        #if self.dx!=1 or self.dy!=1:
+            #raise Exception("ERROR: We disable ConvOp.grad now when dx!=1 or "\
+                    #"dy!=1 as we think their is a high probability of bug in it."\
+                    #"We need to raise the error on the gradient to .1!")
+
+        all_shape = self.imshp is not None and self.kshp is not None and \
+                    self.nkern is not None and self.bsize is not None
+
+        if not all_shape and (self.dx!=1 or self.dy!=1):
+            raise Exception("ConvOp.grad when dx!=1 or dy!=1 we must have all "\
+                            "the optional shape information")
+        
+        ####### Determine gradient on kernels ########
+        assert inputs.ndim==4 and kerns.ndim==4
+
+        newin = inputs.dimshuffle((1,0,2,3))
+        newgz = gz.dimshuffle((1,0,2,3))
+    
+        (bsize, nkern) = None, None
+        imshp = None
+        kshp = None
+        un_p = self.unroll_patch
+        imshp_logical = None
+
+        if self.out_mode == 'valid':
+            (img, filters) = (newin, newgz)
+            kshp_logical = self.fulloutshp
+            kshp_logical_top_aligned=False
+            if all_shape:
+                (bsize, nkern) = (self.imshp[0], self.nkern)
+                imshp = (self.bsize, self.imshp[1], self.imshp[2])
+            kshp  = self.outshp
+            un_b = self.unroll_batch
+            un_k = self.unroll_kern
+        elif self.out_mode == 'full':
+            (img, filters) = (newgz, newin)
+            kshp_logical = None
+            kshp_logical_top_aligned=True
+            if all_shape:
+                imshp_logical = (self.bsize, self.fulloutshp[0], self.fulloutshp[1])
+                (bsize, nkern) = (self.nkern, self.imshp[0])
+                imshp = (self.bsize, self.outshp[0], self.outshp[1])
+                kshp  = self.imshp[1:]
+            un_b = self.unroll_kern
+            un_k = self.unroll_batch
+        else:
+            raise NotImplementedError('Only [full,valid] modes are currently supported.')
+
+        filters = filters[:,:,::-1,::-1] #flip them
+        
+        #find good value for the unroll
+        if all_shape and un_b!=0 and bsize%un_b!=0:
+            if bsize<un_b:
+                un_b = bsize
+            else:
+                un_b = 1
+                _warn("OPTIMISATION WARNING: in ConvOp.grad() we can't determine "\
+                      "a good unroll value for the batch. Maybe you can optimize this!")
+
+        if un_k!=0 and nkern%un_k!=0:
+            if nkern<un_k:
+                un_k = nkern
+            else:
+                un_k = 1
+                _warn("OPTIMISATION WARNING: in ConvOp.grad() we can't determine "\
+                      "a good unroll value for the kernel. Maybe you can optimize this!")
+
+        dw = ConvOp(imshp, kshp, nkern, bsize, 1,1, output_mode='valid',
+                    unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p,
+                    imshp_logical=imshp_logical,
+                    kshp_logical=kshp_logical,
+                    kshp_logical_top_aligned=kshp_logical_top_aligned,
+                    version=self.version,
+                    verbose=self.verbose)
+
+        if hasattr(self,'flops'):
+            dw.set_flops()
+
+        dw = dw(img,filters)
+
+        if all_shape:
+            assert (dw.owner.op.outshp==self.kshp).all()
+        if self.out_mode == 'valid':
+            # before DimShuffle, dw is of shape visdim x nkern x kshp[0] x kshp[1]
+            dw = dw.dimshuffle((1,0,2,3))
+            dw = dw[:,:,::-1,::-1]
+
+        ####### Determine gradient on inputs ########
+        mode = 'valid'
+        if not self.out_mode == 'full': 
+            mode = 'full'
+
+        filters = kerns.dimshuffle((1,0,2,3))
+        filters = filters[:,:,::-1,::-1]
+        nkern = None
+        imshp = None
+        imshp_logical = None
+        kshp = None
+
+        if all_shape:
+            nkern = self.imshp[0]
+            imshp = (self.nkern, self.outshp[0], self.outshp[1])
+            imshp_logical=(self.nkern, self.fulloutshp[0], self.fulloutshp[1])
+
+        din = ConvOp(imshp, self.kshp, nkern, self.bsize, 
+                     1,1, output_mode=mode,
+                     unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p,
+                     imshp_logical=imshp_logical,
+                     kshp_logical=None,
+                     version=-1,#we we change the mode, we don't forward the version.
+                     verbose=self.verbose)
+
+        if hasattr(self,'flops'):
+            din.set_flops()
+
+        din = din(gz,filters)
+
+        print 'self.imshp = ', self.imshp
+        print 'din.owner.op.outshp = ', din.owner.op.outshp
+        assert (din.owner.op.outshp is None and self.imshp is None) or \
+               (din.owner.op.outshp is None) or \
+               (din.owner.op.outshp==self.imshp[1:]).all()
+        return [din, dw]
+
+    def c_headers(self):
+        return ['<numpy/noprefix.h>', '<iostream>', '<sstream>' ]
+
+    def c_code_cache_version(self):
+        return (1)
+    
+    def c_support_code(self):
+        return """
+#define STRIDES(arr) ((arr)->strides)
+#define FULL  2
+#define SAME  1
+#define VALID 0
+#define MOD %
+using namespace std;
+""" + tensor.blas.blas_header_text()
+
+    def c_libraries(self):
+        return tensor.blas.ldflags()
+
+    def c_code(self, node, name, (img2d, filtersflipped), (z, ), sub):
+        if node.inputs[0].type.dtype != node.inputs[1].type.dtype:
+            raise NotImplementedError()
+        assert node.inputs[0].type.dtype == node.inputs[1].type.dtype
+        d=locals()
+        d.update(sub)
+
+        all_shape = self.imshp is not None and self.kshp is not None and \
+                    self.nkern is not None and self.bsize is not None
+
+        d["self_out_mode"]=self.out_mode
+        d["self_dx"]=self.dx
+        d["self_dy"]=self.dy
+        d["mode"]=self.out_mode.upper()
+        d["mode"]=self.out_mode.upper()
+        d["affectation"]="="
+        if all_shape:
+            d["self_bsize"]=self.bsize
+            d["self_nkern"]=self.nkern
+            d["self_outshp0"]=self.outshp[0]
+            d["self_outshp1"]=self.outshp[1]
+            d["self_imshp0"]=self.imshp[0]
+            d["self_imshp1"]=self.imshp[1]
+            d["self_imshp2"]=self.imshp[2]
+            d["self_kshp0"]=self.kshp[0]
+            d["self_kshp1"]=self.kshp[1]
+            d["self_kshp_logical_r"] = self.kshp_logical[0]
+            d["self_kshp_logical_c"] = self.kshp_logical[1]
+            d["self_kshp_logical_stride_r"] = int(numpy.ceil(self.kshp_logical[0] / float(self.kshp[0])))
+            d["self_kshp_logical_stride_c"] = int(numpy.ceil(self.kshp_logical[1] / float(self.kshp[1])))
+            d["self_imshp_logical_r"] = self.imshp_logical[1] #numpy.B. 1  not 0
+            d["self_imshp_logical_c"] = self.imshp_logical[2]#numpy.B. 2  not 1
+            d["self_imshp_logical_stride_r"] = int(numpy.ceil(self.imshp_logical[1] / float(self.imshp[1])))
+            d["self_imshp_logical_stride_c"] = int(numpy.ceil(self.imshp_logical[2] / float(self.imshp[2])))
+            if not self.imshp[0]==1: d["affectation"]="+="
+            d["all_shape"]=1
+            d["dim_zz_const"]="const"
+        else:
+            d["self_bsize"]="%(img2d)s->dimensions[0]"%d
+            d["self_nkern"]="%(filtersflipped)s->dimensions[0]"%d
+            d["self_outshp0"]="-1"
+            d["self_outshp1"]="-1"
+            d["self_imshp0"]="%(img2d)s->dimensions[1]"%d
+            d["self_imshp1"]="%(img2d)s->dimensions[2]"%d
+            d["self_imshp2"]="%(img2d)s->dimensions[3]"%d
+            d["self_kshp0"]="%(filtersflipped)s->dimensions[2]"%d
+            d["self_kshp1"]="%(filtersflipped)s->dimensions[3]"%d
+            d["affectation"]="+="
+            d["all_shape"]=0
+            d["dim_zz_const"]=""
+
+        if self.kshp_logical_top_aligned:
+            d["self_kshp_logical_offset_r"] = 0
+            d["self_kshp_logical_offset_c"] = 0
+        elif self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
+            rstride = d["self_kshp_logical_stride_r"]
+            cstride = d["self_kshp_logical_stride_c"]
+            d["self_kshp_logical_offset_r"] = (self.kshp_logical[0] - (self.kshp[0]*rstride) - 1+rstride) % rstride
+            d["self_kshp_logical_offset_c"] = (self.kshp_logical[1] - (self.kshp[1]*cstride) - 1+cstride) % cstride
+            del rstride, cstride
+        
+        if node.inputs[0].type.dtype=="float32": d["type"]="float"
+        elif node.inputs[0].type.dtype=="float64": d["type"]="double"
+        else: raise Exception("Type %s not implemented"%node.inputs[0].type.dtype)
+        d["gemm"]='dgemm_'
+        if not d["type"]=="double":d["gemm"]='sgemm_'
+
+        if self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
+            if self.verbose:
+                _debug("return imshp!=imshp_logical or self.kshp != self.kshp_logical shape version")
+            return _conv_op_code_a % d
+
+        if self.unroll_patch:
+            if self.verbose:
+                _debug("return unroll patch version. all_shape=", all_shape)
+            return _conv_op_code_unroll_patch%d
+        if self.unroll_batch>0 or self.unroll_kern>0:
+            if self.unroll_batch<=0: self.unroll_batch=1
+            if self.unroll_kern<=0: self.unroll_kern=1
+            if self.verbose:
+                _debug("return unrolled batch (%s) and kern code (%s)",
+                        str(self.unroll_batch), str(self.unroll_kern))
+            return gen_conv_code_unroll_batch_kern(d, self.unroll_batch,
+                                                   self.unroll_kern)
+
+        #TODO: should we choose the unroll size automatically with the bigger divisor under 5? 
+        if self.out_mode == 'valid' and self.dx==0 and self.dy==0:
+            if self.verbose:
+                _debug("return gemm version")
+            return _conv_op_code_valid_gemm % d
+        else:
+            if self.verbose:
+                _debug("return no gemm version")
+            return _conv_op_code_a % d
+
+
+_conv_op_code_a = """
+const int mode=%(mode)s;
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *filtersflipped_arr=NULL, *img2d_arr=NULL;
+const %(type)s fill_value = 0;
+
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+
+npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+npy_intp dim_im_phys[2]={%(self_imshp1)s,%(self_imshp2)s};
+npy_intp dim_im_log[2]={%(self_imshp_logical_r)s,%(self_imshp_logical_c)s};
+npy_intp dim_ker_phys[2]={%(self_kshp0)s,%(self_kshp1)s};
+npy_intp dim_ker_log[2]={%(self_kshp_logical_r)s,%(self_kshp_logical_c)s};
+
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig, *filtersflipped=NULL;
+
+
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    PyErr_SetString(PyExc_ValueError, "img don't have a good shape");
+    %(fail)s;
+}
+
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    std:stringstream temp;
+    temp << "nddim="<<%(filtersflipped)s->nd;
+    std::string param = temp.str();
+    PyErr_SetString(PyExc_ValueError,
+      ("kernel don't have a good shape. " + param).c_str());
+    %(fail)s;
+}
+
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != (npy_intp)sizeof(%(type)s))
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*(npy_intp)sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+
+filtersflipped = PyArray_Newshape(%(filtersflipped)s,&kerns_shape, PyArray_CORDER);
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if ((filtersflipped_arr->strides[3] != (npy_intp)sizeof(%(type)s)) 
+     || (filtersflipped_arr->strides[2] != filtersflipped_arr->dimensions[3]*(npy_intp)sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)filtersflipped));
+    Py_DECREF(filtersflipped);
+    filtersflipped = contig;
+    if (!PyArray_ISCONTIGUOUS(filtersflipped)){
+        PyErr_SetString(PyExc_ValueError, "filtersflipped isn't contiguous");
+        %(fail)s;
+    }
+}
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+
+if(mode != VALID && mode != FULL){
+  PyErr_SetString(PyExc_ValueError, "invalid mode, only full and valid are supported"); %(fail)s;
+}
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+
+if (!img2d) %(fail)s;
+if (!filtersflipped) %(fail)s;
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  {Py_XDECREF(%(z)s);}
+  npy_intp dims[4] = {0,0,0,0};
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  //PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+
+int Os[2];
+Os[0]=%(self_outshp0)s;
+Os[1]=%(self_outshp1)s;
+
+for(int b=0;b< %(self_bsize)s;b++){
+  for(int n_kern=0;n_kern<%(self_nkern)s;n_kern++){
+
+    //assertions
+    if (%(z)s->strides[0] != %(z)s->dimensions[1] *%(z)s->dimensions[2] *%(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[1] != %(z)s->dimensions[2] * %(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[2] != %(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[3] != (npy_intp)sizeof(%(type)s)) %(fail)s;
+
+    %(type)s * __restrict__ out=(%(type)s *)(PyArray_GETPTR2(%(z)s,b,n_kern));
+    for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) out[i] = 0;
+
+    for(int stack_size=0;stack_size<%(self_imshp0)s;stack_size++){
+
+      const %(type)s * __restrict__ in=(%(type)s *)(PyArray_GETPTR2(img2d,b,stack_size));
+      const %(type)s * __restrict__ hvals=(%(type)s *)(PyArray_GETPTR2(filtersflipped,n_kern,stack_size));
+
+
+      for (int iter_m=0; iter_m < Os[0]; iter_m++) {
+                                               /// Reposition index into input image based on requested output size
+        int pos_m = iter_m*%(self_dx)s;        //row position in logical output image
+        int new_m;                             //row anchor in logical input image (we will loop upward from here)
+        if (mode == FULL) new_m = pos_m ;
+        else new_m = (pos_m+dim_ker_log[0]-1);
+
+        for (int iter_n=0; iter_n < Os[1]; iter_n++) {  // loop over columns
+          int pos_n=iter_n*%(self_dy)s;        // current col position in logical output image
+          %(type)s sum=0;
+
+          // Sum over kernel, if index into image is out of bounds
+          // fill with the value
+          for (int j_log=0; j_log < %(self_kshp_logical_r)s; j_log++) { // loop over logical rows in kernel
+
+            int ind0_log = (new_m-j_log);                                   // ind0_log: row position in logical input image
+
+            if ((j_log < %(self_kshp_logical_offset_r)s) || (j_log - %(self_kshp_logical_offset_r)s) MOD %(self_kshp_logical_stride_r)s)
+                continue;
+
+            if (ind0_log MOD %(self_imshp_logical_stride_r)s)
+                continue;
+
+            int j_phys = ((j_log- %(self_kshp_logical_offset_r)s) / %(self_kshp_logical_stride_r)s);
+            int ind0_phys = (ind0_log / %(self_imshp_logical_stride_r)s);
+            //std::cerr <<"j_log" << j_log << " j_phys " << j_phys << " " << ind0_phys << "\\n";
+
+            if(mode==FULL){
+              const %(type)s * idx_hvals=&hvals[j_phys*dim_ker_phys[1]]; //This is a pointer to the current row of the kernel
+              if(ind0_log < 0 || ind0_log >= dim_im_log[0]){
+                   // the current row of the kernel is off the image
+              }else{
+                int k = max((int)(pos_n-dim_im_log[1])+1,0);
+                int max_k=min(pos_n+1,(int)dim_ker_log[1]);
+                const %(type)s * idx_in=&in[ind0_phys*dim_im_phys[1]];
+                for (int ind1_log=pos_n-k; k<max_k; k++,ind1_log--) {
+                    if (1)
+                    {
+                                if ((k < %(self_kshp_logical_offset_c)s) || (k - %(self_kshp_logical_offset_c)s) MOD %(self_kshp_logical_stride_c)s)
+                                    continue;
+
+                                if (ind1_log MOD %(self_imshp_logical_stride_c)s)
+                                    continue;
+                    }
+                  sum+= idx_hvals[(k-%(self_kshp_logical_offset_c)s) / %(self_kshp_logical_stride_c)s] * idx_in[ind1_log / %(self_imshp_logical_stride_c)s];
+                }
+              }
+            }else{
+              const %(type)s* idx_in=&in[ind0_phys*dim_im_phys[1]]; //JB: should be dim_im[1] right? (was dim_im[0])
+              const %(type)s* idx_hvals=&hvals[j_phys*dim_ker_phys[1]];
+              int new_n = (pos_n+dim_ker_log[1]-1);
+              if (%(self_imshp_logical_stride_c)s != 1)  // a general loop
+              {
+                  for (int k=0,last=new_n; k < dim_ker_log[1]; k++,last--) {
+                        if ((k < %(self_kshp_logical_offset_c)s) || (k - %(self_kshp_logical_offset_c)s) MOD %(self_kshp_logical_stride_c)s)
+                            continue;
+
+                        else if (last MOD %(self_imshp_logical_stride_c)s)
+                            continue;
+                            else
+                            {
+                    sum+=idx_hvals[(k-%(self_kshp_logical_offset_c)s) / %(self_kshp_logical_stride_c)s]*idx_in[last/%(self_imshp_logical_stride_c)s];
+                    }
+                  }
+              }
+              else  // self_imshp_stride_c == 1
+              {
+                  int offset = %(self_kshp_logical_offset_c)s;
+                  int k_phys=0;
+                  for (int k_log=offset,last=new_n-offset; k_log < dim_ker_log[1]; ) {
+                    sum += idx_hvals[k_phys]*idx_in[last];
+                    ++k_phys;
+                    last -= %(self_kshp_logical_stride_c)s;
+                    k_log += %(self_kshp_logical_stride_c)s;
+                  }
+              }
+            }
+          }//for j
+          out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+        }//for n
+      }//for m
+    }//for stack_size
+    if (0 && (mode==FULL)){
+      for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) 
+        std::cout << " " << out[i];
+      std::cout << "\\n";
+    }
+  }//for n_kern
+}//for b
+Py_XDECREF(img2d);
+Py_XDECREF(filtersflipped);
+"""
+
+
+#########  
+#########  ConvOp c_code for valid mode (uses gemm)
+#########
+
+_conv_op_code_valid_gemm = """
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *img2d_arr=NULL;
+const int NKERN = %(self_nkern)s;
+
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+
+npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+npy_intp dim_im[2]={%(self_imshp1)s,%(self_imshp2)s};
+npy_intp dim_ker[2]={%(self_kshp0)s,%(self_kshp1)s};
+
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig;
+
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    PyErr_SetString(PyExc_ValueError, "img don't have a good shape");
+    %(fail)s;
+}
+
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    std:stringstream temp;
+    temp << "nddim="<<%(filtersflipped)s->nd;
+    std::string param = temp.str();
+    PyErr_SetString(PyExc_ValueError,
+      ("kernel don't have a good shape. " + param).c_str());
+    %(fail)s;
+}
+if (NKERN != kerns_dim[0])
+{
+    PyErr_SetString(PyExc_NotImplementedError, "nonsense nkern");
+    %(fail)s;
+}
+
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != (npy_intp)sizeof(%(type)s)) 
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*(npy_intp)sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+
+if (!img2d) {
+    PyErr_SetString(PyExc_ValueError, "Null argument img2d");
+    %(fail)s;
+}
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  {Py_XDECREF(%(z)s);}
+  npy_intp dims[4] = {0,0,0,0};
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+
+int Os[2];
+Os[0] = dim_im[0]-dim_ker[0]+1;
+Os[1] = dim_im[1]-dim_ker[1]+1;
+
+// allocate a temporary buffer for storing the inner product of each nth kernel row 
+// with each row of an image
+{
+%(type)s * kbuf = (%(type)s *)malloc((Os[0] * NKERN + PyArray_Size((PyObject*)%(filtersflipped)s))* (npy_intp)sizeof(%(type)s));
+int kbufstride = NKERN;
+%(type)s * myfilters = kbuf + Os[0] * NKERN;
+
+//copy out filtersflipped into filters un-flipped format
+//std::cerr << "__filling myfilters__\\n";
+for(int i=0;i < kerns_dim[0];++i){
+    for(int j=0;j < kerns_dim[1];++j){
+        for(int k=0;k < kerns_dim[2];++k){
+            for(int l=0;l < kerns_dim[3];++l){
+                %(type)s * ff = ((%(filtersflipped)s)->nd == 3)
+                    ? (%(type)s *)PyArray_GETPTR3(%(filtersflipped)s, i, kerns_dim[2]-1-k, kerns_dim[3]-1-l)
+                    : (%(type)s *)PyArray_GETPTR4(%(filtersflipped)s, i, j, kerns_dim[2]-1-k, kerns_dim[3]-1-l);
+                myfilters[i * (kerns_dim[1]*kerns_dim[2]*kerns_dim[3]) 
+                          + j * (kerns_dim[2]*kerns_dim[3])
+                          + k * (kerns_dim[3])
+                          + l] = ff[0];
+                //std::cerr << " " << ff[0];
+            }
+            //std::cerr << "\\n";
+        }
+        //std::cerr << "(end of stack/batch " <<j << "/" << i << "  ) \\n";
+    }
+}
+
+//std::cerr << "-----new loop ----\\n";
+for(int b=0;b< %(self_bsize)s;b++){
+    for (int img_col = 0; img_col < Os[1]; ++img_col){
+        for (int filter_row = 0; filter_row < kerns_dim[2]; ++filter_row){
+            for (int stackidx = 0; stackidx < %(self_imshp0)s; ++stackidx){
+                %(type)s * img_colview = 
+                    (%(type)s *)(PyArray_GETPTR4(img2d, b, stackidx, filter_row, img_col));
+                %(type)s * filter_rows = myfilters + stackidx * (kerns_dim[2]*kerns_dim[3]) +
+                filter_row * kerns_dim[3];
+                //std::cerr << "filterview offset: " << filter_rows - myfilters << "\\n";
+
+                char N = 'N'; char T = 'T';
+                int Nz0 = Os[0]; 
+                int Nz1 = NKERN;
+                int K = kerns_dim[3];
+                %(type)s alpha = 1.0;
+                %(type)s beta = stackidx ? 1.0 : 0.0;
+                int imgview_stride = dim_im[1];
+                int filter_rows_stride =kerns_dim[1]*kerns_dim[2]*kerns_dim[3];
+                //remember, Fortran wants a column-major interpretation
+                assert(img2d->strides[3] == (npy_intp)sizeof(%(type)s));
+
+                if (0){
+                    std::cerr << "b " << b << " img_col " << img_col << " filterrow " << filter_row << " stackidx " <<stackidx << "\\n";
+                    std::cerr << "colview (physical layout) stride: " << imgview_stride << "\\n";
+                    for (int ii = 0; ii < Nz0; ++ii){
+                        for (int jj = 0; jj < K; ++jj){
+                            std::cerr << " " << img_colview[ii * imgview_stride + jj];
+                        }
+                        std::cerr << "\\n";
+                    }
+                    std::cerr << "filterview ("<<filter_row<<"'th rows) stride: " << filter_rows_stride << "\\n";
+                    for (int ii = 0; ii < Nz1; ++ii){
+                        for (int jj = 0; jj < K; ++jj){
+                            std::cerr << " " << filter_rows[ii * filter_rows_stride + jj];
+                        }
+                        std::cerr << "\\n";
+                    }
+
+                    std::cerr << Nz1 << " " << Nz0 << " " << K << "\\n" ;
+                }
+
+                %(gemm)s(&T, &N, 
+                    &Nz1, &Nz0, &K,
+                    &alpha, 
+                    filter_rows, &filter_rows_stride,
+                    img_colview, &imgview_stride, 
+                    &beta, kbuf, &kbufstride);
+
+                if (0){
+                    std::cerr << "z (logical layout) beta" << beta << "\\n";
+                    for (int ii = 0; ii < Nz0; ++ii){
+                        for (int jj = 0; jj < Nz1; ++jj){
+                            std::cerr << " " << kbuf[ii * kbufstride + jj];
+                        }
+                        std::cerr << "\\n";
+                    }
+                }
+            }
+            // now kbuf the sum over the stack, put it into the outbuf
+            for (int img_row = 0; img_row < Os[0]; ++img_row) {
+                for (int kernel_idx = 0; kernel_idx < NKERN; ++kernel_idx) {
+                    %(type)s * z_p =  (%(type)s *)PyArray_GETPTR4(%(z)s, b, kernel_idx, img_row, img_col);
+                    if (0)
+                    {
+                        if (b >= %(z)s->dimensions[0]) %(fail)s;
+                        if (kernel_idx >= %(z)s->dimensions[1]) %(fail)s;
+                        if (img_row >= %(z)s->dimensions[2]) %(fail)s;
+                        if (img_col >= %(z)s->dimensions[3]) %(fail)s;
+                    }
+                    z_p[0] += kbuf[img_row * kbufstride + kernel_idx];
+                }
+            }
+        }
+    }
+}
+free(kbuf);
+}
+Py_XDECREF(img2d);
+"""
+
+def gen_conv_code_unroll_batch_kern(d,unroll_bsize=1, unroll_ksize=1):
+    """ c_code for ConvOp that unroll the batch size loop
+    """
+    assert unroll_bsize>0 and unroll_ksize>0
+    if d.has_key("unroll_bsize") or d.has_key("unroll_ksize") or d.has_key("unroll_iter") or d.has_key("unroll_biter") or d.has_key("unroll_kiter"):
+        raise Exception("We can't use this dictionnary as we will overwrite some of its containt")
+    d=d.copy()
+        
+    d["unroll_bsize"]=unroll_bsize
+    d["unroll_ksize"]=unroll_ksize
+    def my_dup(st,size):
+        s=""
+        for i in range(size):
+            d["unroll_iter"]=i
+            s+=st%d
+        return s+"\n"
+    def my_dup2(st):
+        s=""
+        iter=0
+        for i in range(unroll_bsize):
+            d["unroll_biter"]=i
+            for j in range(unroll_ksize):
+                d["unroll_kiter"]=j
+                d["unroll_iter"]=iter
+                iter+=1
+                s+=st%d
+        return s+"\n"
+    ret = """
+const int mode=%(mode)s;
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *filtersflipped_arr=NULL, *img2d_arr=NULL;
+const %(type)s fill_value = 0;
+
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+
+npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+npy_intp dim_im[2]={%(self_imshp1)s,%(self_imshp2)s};
+npy_intp dim_ker[2]={%(self_kshp0)s,%(self_kshp1)s};
+
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig, *filtersflipped=NULL;
+
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    std:stringstream temp;
+    temp << "nddim="<<%(img2d)s->nd;
+    std::string param = temp.str();
+    PyErr_SetString(PyExc_ValueError,
+      ("img don't have a good shape. " + param).c_str());
+    %(fail)s;
+}
+
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    PyErr_SetString(PyExc_ValueError, "kernel don't have a good shape");
+    %(fail)s;
+}
+
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != (npy_intp)sizeof(%(type)s)) 
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*(npy_intp)sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+
+filtersflipped = PyArray_Newshape(%(filtersflipped)s,&kerns_shape, PyArray_CORDER);
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if ((filtersflipped_arr->strides[3] != (npy_intp)sizeof(%(type)s)) 
+     || (filtersflipped_arr->strides[2] != filtersflipped_arr->dimensions[3]*(npy_intp)sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)filtersflipped));
+    Py_DECREF(filtersflipped);
+    filtersflipped = contig;
+    if (!PyArray_ISCONTIGUOUS(filtersflipped)){
+        PyErr_SetString(PyExc_ValueError, "filtersflipped isn't contiguous");
+        %(fail)s;
+    }
+}
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+
+if(mode != VALID && mode != FULL){
+  PyErr_SetString(PyExc_ValueError, "invalid mode, only full and valid are supported"); %(fail)s;
+}
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+
+if (!img2d) %(fail)s;
+if (!filtersflipped) %(fail)s;
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  {Py_XDECREF(%(z)s);}
+  npy_intp dims[4] = {0,0,0,0};
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  //PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+
+int Os[2];
+Os[0]=%(self_outshp0)s;
+Os[1]=%(self_outshp1)s;
+
+for(int b=0;b< %(self_bsize)s ;b+=%(unroll_bsize)s){
+  for(int n_kern=0;n_kern<%(self_nkern)s;n_kern+=%(unroll_ksize)s){
+
+    //assertions
+    if (%(z)s->strides[0] != %(z)s->dimensions[1] *%(z)s->dimensions[2] *%(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[1] != %(z)s->dimensions[2] * %(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[2] != %(z)s->dimensions[3] * (npy_intp)sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[3] != (npy_intp)sizeof(%(type)s)) %(fail)s;
+"""%d
+    ret+=my_dup2("%(type)s * __restrict__ out%(unroll_iter)s=(%(type)s *)(PyArray_GETPTR2(%(z)s,b+%(unroll_biter)s,n_kern+%(unroll_kiter)s));")
+    ret+=my_dup("for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) out%(unroll_iter)s[i] = 0;",unroll_bsize*unroll_ksize)
+    ret+="""
+    for(int stack_size=0;stack_size<%(self_imshp0)s;stack_size++){
+"""%d
+    ret+=my_dup("const %(type)s * __restrict__ in%(unroll_iter)d=(%(type)s *)(PyArray_GETPTR2(img2d,b+%(unroll_iter)s,stack_size));", unroll_bsize)
+    ret+=my_dup("const %(type)s * __restrict__ hvals%(unroll_iter)s=(%(type)s *)(PyArray_GETPTR2(filtersflipped,n_kern+%(unroll_iter)s,stack_size));",unroll_ksize)
+    ret+="""
+
+      int new_m;
+
+      for (int iter_m=0; iter_m < Os[0]; iter_m++) {
+        // Reposition index into input image based on requested output size
+        int pos_m = iter_m*%(self_dx)s;//The position of the patch in the image
+        if (mode == FULL) new_m = pos_m ;
+        else new_m = (pos_m+dim_ker[0]-1);
+
+        for (int iter_n=0; iter_n < Os[1]; iter_n++) {  // loop over columns 
+          int pos_n=iter_n*%(self_dy)s;
+        """%d
+    ret+=my_dup("%(type)s sum%(unroll_iter)s=0;", unroll_bsize*unroll_ksize)
+    ret+="""
+
+          // Sum over kernel, if index into image is out of bounds
+          // fill with the value
+          for (int j=0; j < dim_ker[0]; j++) {
+            int ind0 = (new_m-j);
+
+            if(mode==FULL){
+"""%d
+    ret+=my_dup("const %(type)s * idx_hvals%(unroll_iter)s=&hvals%(unroll_iter)s[j*dim_ker[1]];",unroll_ksize)
+    ret+="""
+              if(ind0 < 0 || ind0 >= dim_im[0]){
+                if(fill_value!=0)
+                  for (int k=0; k < dim_ker[1]; k++) {
+"""%d
+    ret+=my_dup2("sum%(unroll_iter)s += idx_hvals%(unroll_kiter)s[k] * fill_value;")
+    ret+="""
+                  }
+              }else{
+                //do the part where kernel is to the right of the img
+
+                int k=0,max_k=max((int)(pos_n-dim_im[1])+1,0);
+                if(fill_value!=0){ 
+                
+                  for(k=0;k<max_k;k++){
+"""%d
+    ret+=my_dup2("sum%(unroll_iter)s += idx_hvals%(unroll_kiter)s[k] * fill_value;")
+    ret+="""
+                  }
+                }else {k=max_k;}
+                
+                //do the part where the kernel is on the img
+                max_k=min(pos_n+1,(int)dim_ker[1]);
+"""%d
+    ret+=my_dup("const %(type)s * idx_in%(unroll_iter)s=&in%(unroll_iter)s[ind0*dim_im[1]];", unroll_bsize)
+    ret+="""
+                for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+
+"""%d
+    ret+=my_dup2("sum%(unroll_iter)s+= idx_hvals%(unroll_kiter)s[k] * idx_in%(unroll_biter)s[ind1];")
+    ret+="""
+                }
+                //do the part to the left of the img
+                if(fill_value!=0)
+                  for(;k<dim_ker[1];k++){
+"""%d
+    ret+=my_dup2("sum%(unroll_iter)s += idx_hvals%(unroll_kiter)s[k] * fill_value;")
+    ret+="""
+                  }
+              }
+            }else{//valid mode
+"""%d
+    ret+=my_dup("const %(type)s* idx_in%(unroll_iter)s=&in%(unroll_iter)s[ind0*dim_im[1]];", unroll_bsize)
+    ret+=my_dup("const %(type)s* idx_hvals%(unroll_iter)s=&hvals%(unroll_iter)s[j*dim_ker[1]];",unroll_ksize)
+    ret+="""
+              int new_n = (pos_n+dim_ker[1]-1);
+
+              for (int k=0,last=new_n; k < dim_ker[1]; k++,last--) {
+"""%d
+    ret+=my_dup2("sum%(unroll_iter)s+=idx_hvals%(unroll_kiter)s[k]*idx_in%(unroll_biter)s[last];")
+    ret+="""
+              }
+            }
+
+          }//for j
+"""%d
+    ret+=my_dup("out%(unroll_iter)s[iter_m*dim_zz[1]+iter_n] %(affectation)s sum%(unroll_iter)s;", unroll_bsize*unroll_ksize)
+    ret+="""
+        }//for n
+      }//for m
+    }//for stack_size
+  }//for n_kern
+}//for b
+Py_XDECREF(img2d);
+Py_XDECREF(filtersflipped);
+"""
+    return ret
+
+_conv_op_code_unroll_patch = """
+const int mode=%(mode)s;
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *filtersflipped_arr=NULL, *img2d_arr=NULL;
+const %(type)s fill_value = 0;//only value of 0 are currently tested and correctly implemented
+
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+
+const npy_intp dim_im[2]={%(self_imshp1)s,%(self_imshp2)s};
+const npy_intp dim_ker[2]={%(self_kshp0)s,%(self_kshp1)s};
+%(dim_zz_const)s npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+#if !%(all_shape)s
+  if (mode == FULL) {
+    dim_zz[0] = (int)ceil((dim_im[0]+dim_ker[0]-1)/float(%(self_dx)s));
+    dim_zz[1] = (int)ceil((dim_im[1]+dim_ker[1]-1)/float(%(self_dy)s));
+  } else {
+    dim_zz[0] = (int)ceil((dim_im[0]-dim_ker[0]+1)/float(%(self_dx)s));
+    dim_zz[1] = (int)ceil((dim_im[1]-dim_ker[1]+1)/float(%(self_dy)s));
+  }
+#endif
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig, *filtersflipped=NULL;
+
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    PyErr_Format(PyExc_ValueError,
+      "image don't have a good number of dimensions %%d. ", %(filtersflipped)s->nd);
+    %(fail)s;
+}
+
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    PyErr_Format(PyExc_ValueError,
+      "kernel don't have a good number of dimensions %%d. ", %(filtersflipped)s->nd);
+    %(fail)s;
+}
+
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != sizeof(%(type)s)) 
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+
+filtersflipped = PyArray_Newshape(%(filtersflipped)s,&kerns_shape, PyArray_CORDER);
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if ((filtersflipped_arr->strides[3] != sizeof(%(type)s)) 
+     || (filtersflipped_arr->strides[2] != filtersflipped_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)filtersflipped));
+    Py_DECREF(filtersflipped);
+    filtersflipped = contig;
+    if (!PyArray_ISCONTIGUOUS(filtersflipped)){
+        PyErr_SetString(PyExc_ValueError, "filtersflipped isn't contiguous");
+        %(fail)s;
+    }
+}
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+
+if(mode != VALID && mode != FULL){
+  PyErr_SetString(PyExc_ValueError, "invalid mode, only full and valid are supported"); %(fail)s;
+}
+
+if(dim_zz[0]<=0 || dim_zz[1]<=0){
+PyErr_Format(PyExc_ValueError,
+      "Output dimensions are not valid %%dx%%d",dim_zz[0],dim_zz[1]);
+      %(fail)s;
+}
+
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+
+if (!img2d) %(fail)s;
+if (!filtersflipped) %(fail)s;
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  if (%(z)s) Py_DECREF(%(z)s);
+  npy_intp dims[4] = {0,0,0,0};
+  if(!dims) %(fail)s;
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  //PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+
+for(int b=0;b< %(self_bsize)s;b++){
+  for(int n_kern=0;n_kern<%(self_nkern)s;n_kern++){
+
+    //assertions
+    if (%(z)s->strides[0] != %(z)s->dimensions[1] *%(z)s->dimensions[2] *%(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[1] != %(z)s->dimensions[2] * %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[2] != %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[3] != sizeof(%(type)s)) %(fail)s;
+
+    %(type)s * __restrict__ out=(%(type)s *)(PyArray_GETPTR2(%(z)s,b,n_kern));
+    for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) out[i] = 0;
+
+    for(int stack_size=0;stack_size<%(self_imshp0)s;stack_size++){
+
+      const %(type)s * __restrict__ in=(%(type)s *)(PyArray_GETPTR2(img2d,b,stack_size));
+      const %(type)s * __restrict__ hvals=(%(type)s *)(PyArray_GETPTR2(filtersflipped,n_kern,stack_size));
+
+      int new_m;
+
+      for (int iter_m=0; iter_m < dim_zz[0]; iter_m++) {
+        // Reposition index into input image based on requested output size
+        int pos_m = iter_m*%(self_dx)s;//The position of the patch in the image
+        if (mode == FULL) new_m = pos_m ;
+        else new_m = (pos_m+dim_ker[0]-1);
+
+        for (int iter_n=0; iter_n < dim_zz[1]; iter_n++) {  // loop over columns
+          int pos_n=iter_n*%(self_dy)s;
+          %(type)s sum=0;
+          %(type)s sum2=0;
+          %(type)s sum3=0;
+          %(type)s sum4=0;
+          int nb_sum=0;
+          // Sum over kernel, if index into image is out of bounds
+          // fill with the value
+          for (int j=0; j < dim_ker[0]; j++) {
+            int ind0 = (new_m-j);
+
+            if(mode==FULL){
+              const %(type)s * idx_hvals=&hvals[j*dim_ker[1]];
+              if(ind0 < 0 || ind0 >= dim_im[0]){
+                if(fill_value!=0)
+                  for (int k=0; k < dim_ker[1]; k++) {
+                    sum+= idx_hvals[k] * fill_value;
+                  }
+              }else{
+                //do the part where kernel is to the right of the img
+                int k=0,max_k=max((int)(pos_n-dim_im[1])+1,0);
+                if(fill_value!=0){ 
+                
+                  for(k=0;k<max_k;k++){
+                    sum+= idx_hvals[k]*fill_value;
+                  }
+                }else {k=max_k;}
+                
+                //do the part where the kernel is on the img
+                max_k=min(pos_n+1,(int)dim_ker[1]);
+                const %(type)s * idx_in=&in[ind0*dim_im[1]];
+
+                if(iter_n + 4*%(self_dy)s < dim_zz[1]
+                         && iter_n>dim_ker[1]-1+3 
+                         && iter_n<dim_im[1]-dim_ker[1]+1-3){
+                  nb_sum=4;
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                    sum3+=idx_hvals[k]*idx_in[ind1+2*%(self_dy)s];
+                    sum4+=idx_hvals[k]*idx_in[ind1+3*%(self_dy)s];
+                  }
+                }else if(iter_n + 2*%(self_dy)s < dim_zz[1] 
+                         && iter_n>dim_ker[1]-1
+                         && iter_n<dim_im[1]-dim_ker[1]+1){
+                  nb_sum=2;
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                  }
+                }else{
+                  nb_sum=1;
+                  /*
+                  %(type)s sum_=0;
+                  if((k-max_k) & 0x1 != 0){
+                    sum+= idx_hvals[k] * idx_in[pos_n-k];
+                  }
+                  for (int ind1=pos_n-k; k<max_k; k+=2,ind1-=2) {
+                    sum+= idx_hvals[k] * idx_in[ind1];
+                    sum_+= idx_hvals[k+1] * idx_in[ind1-1];
+                  }
+                  sum+=sum_;
+                  */
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                  }
+                }
+                //do the part to the left of the img
+                if(fill_value!=0)
+                  for(;k<dim_ker[1];k++) sum+= idx_hvals[k]*fill_value;
+              }
+            }else{//valid mode
+              const %(type)s* idx_in=&in[ind0*dim_im[1]];
+              const %(type)s* idx_hvals=&hvals[j*dim_ker[1]];
+              if(iter_n + 4*%(self_dy)s < dim_zz[1]){
+                nb_sum=4;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                  sum3+=idx_hvals[k]*idx_in[im_idx+2*%(self_dy)s];
+                  sum4+=idx_hvals[k]*idx_in[im_idx+3*%(self_dy)s];
+                }
+              }else if(iter_n + 2*%(self_dy)s < dim_zz[1]){
+                nb_sum=2;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                }
+              }else{
+                nb_sum=1;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                }
+              }
+            }//else valid mode
+          }//for j
+          switch(nb_sum){
+          case 4: out[iter_m*dim_zz[1]+iter_n+3] %(affectation)s sum4;
+          case 3: out[iter_m*dim_zz[1]+iter_n+2] %(affectation)s sum3;
+          case 2: out[iter_m*dim_zz[1]+iter_n+1] %(affectation)s sum2;
+          case 1: out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+          }
+          iter_n+=nb_sum-1;
+        }//for iter_n
+      }//for iter_m
+    }//for stack_size
+  }//for n_kern
+}//for b
+Py_XDECREF(img2d);
+Py_XDECREF(filtersflipped);
+"""
--- a/theano/tensor/signal/downsample.py
+++ b/theano/tensor/signal/downsample.py
+""" Ops for downsampling images.
+
+Planned: 
+DownsampleFactorMax, DownsampleAvg, DownsampleSoftmax.
+
+"""
+#This file should move along with conv.py
+
+from theano import gof, Op, tensor, Variable, Apply
+import numpy, theano
+import __builtin__
+
+def max_pool2D(input, ds, ignore_border=False):
+    """
+    Takes as input a N-D tensor, where N >= 2. It downscales the input image by
+    the specified factor, by keeping only the maximum value of non-overlapping
+    patches of size (ds[0],ds[1])
+    :type input: N-D theano tensor of input images. 
+    :param input: input images. Max pooling will be done over the 2 last dimensions.
+    :type ds: tuple of length 2
+    :param ds: factor by which to downscale. (2,2) will halve the image in each
+               dimension.
+    :param ignore_border: boolean value. When True, (5,5) input with ds=(2,2)
+                         will generate a (2,2) output. (3,3) otherwise.
+    """
+    if input.ndim < 2:
+        raise NotImplementedError('max_pool2D requires a dimension >= 2')
+
+    # extract image dimensions
+    img_shape = input.shape[-2:]
+
+    # count the number of "leading" dimensions, store as dmatrix
+    batch_size = tensor.prod(input.shape[:-2])
+    batch_size = tensor.shape_padright(batch_size,1)
+
+    # store as 4D tensor with shape: (batch_size,1,height,width)
+    new_shape = tensor.cast(tensor.join(0, batch_size, 
+        tensor.as_tensor([1,]), img_shape), 'int64')
+    input_4D = tensor.reshape(input, new_shape, ndim=4)
+
+    # downsample mini-batch of images
+    op = DownsampleFactorMax(ds, ignore_border)
+    output = op(input_4D)
+
+    # restore to original shape
+    outshp = tensor.join(0, input.shape[:-2], output.shape[-2:])
+    return tensor.reshape(output, outshp, ndim=input.ndim)
+
+
+class DownsampleFactorMax(Op):
+    """
+    For N-dimensional tensors, consider that the last two dimensions span images.  
+    This Op downsamples these images by a factor ds, by taking the max over non-
+    overlapping rectangular regions.
+    """
+
+    @staticmethod
+    def out_shape(imgshape, ds, ignore_border=False):
+        """Return the shape of the output from this op, for input of given shape and flags.
+
+        :param imgshape: the shape of a tensor of images. The last two elements are interpreted
+        as the number of rows, and the number of cols.
+        :type imgshape: tuple, list, or similar.
+
+        :param ds: downsample factor over rows and columns
+        :type ds: list or tuple of two ints
+
+        :param ignore_border: if ds doesn't divide imgshape, do we include an extra row/col of
+        partial downsampling (False) or ignore it (True).
+        :type ignore_border: bool
+
+        :rtype: list
+        :returns: the shape of the output from this op, for input of given shape.  This will
+        have the same length as imgshape, but with last two elements reduced as per the
+        downsampling & ignore_border flags.  
+        """
+        if len(imgshape) < 2:
+            raise TypeError('imgshape must have at least two elements (rows, cols)')
+        r, c = imgshape[-2:]
+        rval = list(imgshape[:-2])+[ r/ds[0], c/ds[1]]
+        if not ignore_border:
+            if r % ds[0]:
+                rval[-2] += 1
+            if c % ds[1]:
+                rval[-1] += 1
+        return rval
+
+    def __init__(self, ds, ignore_border=False):
+        """
+        :param ds: downsample factor over rows and columns
+        :type ds: list or tuple of two ints
+
+        :param ignore_border: if ds doesn't divide imgshape, do we include an extra row/col of
+        partial downsampling (False) or ignore it (True).
+        :type ignore_border: bool
+
+        TODO: why is poolsize an op parameter here?
+        """
+        self.ds = tuple(ds)
+        self.ignore_border = ignore_border
+
+    def __eq__(self, other):
+        return type(self) == type(other) and self.ds == other.ds and self.ignore_border == other.ignore_border
+
+    def __hash__(self):
+        return hash(type(self)) ^ hash(self.ds) ^ hash(self.ignore_border)
+
+    def __str__(self):
+        return '%s{%s,%s}' % (self.__class__.__name__, self.ds, self.ignore_border)
+
+    def make_node(self, x):
+        if x.type.ndim != 4:
+            raise TypeError()
+        # TODO: consider restrucing the dtype?
+        return gof.Apply(self, [x], [x.type()])
+
+    def perform(self, node, (x,), (z,)):
+        """
+        """
+        if len(x.shape)!=4:
+            raise NotImplementedError('DownsampleFactorMax requires 4D input for now')
+        if z[0] is None:
+            z[0] = numpy.zeros(self.out_shape(x.shape, self.ds, self.ignore_border)) -float('inf')
+            z[0] = theano._asarray(z[0], dtype=x.dtype)
+        zz=z[0]
+        ds0, ds1 = self.ds
+        if self.ignore_border:
+            x_usable2 = (x.shape[2] / ds0 * ds0)
+        else: x_usable2 = x.shape[2]
+        if self.ignore_border:
+            x_usable3 = (x.shape[3] / ds1 * ds1)
+        else: x_usable3 = x.shape[3]
+        for n in xrange(x.shape[0]):
+            for k in xrange(x.shape[1]):
+                for i in xrange(x_usable2):
+                    zi = i / ds0
+                    for j in xrange(x_usable3):
+                        zj = j / ds1
+                        zz[n,k,zi,zj] = __builtin__.max(zz[n,k,zi,zj], x[n,k,i,j])
+
+    def grad(self,(x,), (gz,)):
+        maxout = self(x)
+        return [DownsampleFactorMaxGrad(self.ds, ignore_border=self.ignore_border)(x, maxout, gz)]
+
+    def c_code(self, node, name, (x,), (z, ), sub):
+        fail=sub['fail']
+        ignore_border = int(self.ignore_border)
+        ds0, ds1 = self.ds
+        return """
+        int typenum = PyArray_ObjectType((PyObject*)%(x)s, 0);
+        int x_shp0_usable;
+        int x_shp1_usable;
+        int z_shp0, z_shp1;
+        if(%(x)s->nd!=4) 
+        {
+            PyErr_SetString(PyExc_ValueError, "x must be a 4d ndarray");
+            %(fail)s;
+        }
+        z_shp0 = %(x)s->dimensions[2] / %(ds0)s;
+        z_shp1 = %(x)s->dimensions[3] / %(ds1)s;
+        if (%(ignore_border)s)
+        {
+            x_shp0_usable = z_shp0 * %(ds0)s;
+            x_shp1_usable = z_shp1 * %(ds1)s;
+        }
+        else
+        {
+            z_shp0 += (%(x)s->dimensions[2] %% %(ds0)s) ? 1 : 0;
+            z_shp1 += (%(x)s->dimensions[3] %% %(ds1)s) ? 1 : 0;
+            x_shp0_usable = %(x)s->dimensions[2];
+            x_shp1_usable = %(x)s->dimensions[3];
+        }
+        if ((!%(z)s)
+          || *PyArray_DIMS(%(z)s)!=4
+          ||(%(z)s->dimensions[0] != %(x)s->dimensions[0])
+          ||(%(z)s->dimensions[1] != %(x)s->dimensions[1])
+          ||(%(z)s->dimensions[2] != z_shp0)
+          ||(%(z)s->dimensions[3] != z_shp1)
+          )
+        {
+          if (%(z)s) Py_XDECREF(%(z)s);
+          npy_intp dims[4] = {0,0,0,0};
+          dims[0]=%(x)s->dimensions[0];
+          dims[1]=%(x)s->dimensions[1];
+          dims[2]=z_shp0;
+          dims[3]=z_shp1;
+          %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0); //TODO: zeros not necessary
+        }
+
+        if (z_shp0 && z_shp1)
+        {
+            for(int b=0;b<%(x)s->dimensions[0];b++){
+              for(int k=0;k<%(x)s->dimensions[1];k++){
+                int mini_i = 0;
+                int zi = 0;
+                for(int i=0;i< x_shp0_usable; i++){
+                  int mini_j = 0;
+                  int zj = 0;
+                  for(int j=0; j<x_shp1_usable; j++){
+                    dtype_%(x)s a = ((dtype_%(x)s*)(PyArray_GETPTR4(%(x)s,b,k,i,j)))[0];
+                    dtype_%(z)s * __restrict__ z = ((dtype_%(z)s*)(PyArray_GETPTR4(%(z)s,b,k,zi,zj)));
+                    z[0] = (((mini_j|mini_i) == 0) || z[0] < a) ? a : z[0];
+                    mini_j = ((mini_j + 1) == %(ds1)s) ? 0 : mini_j+1;
+                    zj += (mini_j == 0);
+                  }
+                  mini_i = ((mini_i + 1) == %(ds0)s) ? 0 : mini_i+1;
+                  zi += (mini_i == 0);
+                }
+              }
+            }
+        }
+        """ % locals()
+
+    def c_code_cache_version(self):
+        return ()
+
+
+class DownsampleFactorMaxGrad(Op):
+
+    def __init__(self, ds, ignore_border):
+        self.ds = tuple(ds)
+        self.ignore_border = ignore_border
+
+    def __eq__(self, other):
+        return type(self) == type(other) and self.ds == other.ds and self.ignore_border == other.ignore_border
+
+    def __hash__(self):
+        return hash(type(self)) ^ hash(self.ds) ^ hash(self.ignore_border)
+
+    def __str__(self):
+        return '%s{%s,%s}' % (self.__class__.__name__, self.ds, self.ignore_border)
+
+    def make_node(self, x, maxout, gz):
+        # make_node should only be called by the grad function of DownsampleFactorMax,
+        # so these asserts should not fail.
+        assert isinstance(x, Variable) and x.ndim==4
+        assert isinstance(maxout, Variable) and maxout.ndim==4
+        assert isinstance(gz, Variable) and gz.ndim==4
+
+        return Apply(self, [x, maxout, gz], [x.type()])
+
+    def perform(self, node, (x, maxout, gz), (gx_stg,)):
+        gx = numpy.zeros_like(x)
+
+        ds0, ds1 = self.ds
+        shape2 = (x.shape[2] / ds0 * ds0)
+        if not self.ignore_border: shape2 = x.shape[2]
+        shape3 = (x.shape[3] / ds1 * ds1)
+        if not self.ignore_border: shape3 = x.shape[3]
+        for n in xrange(x.shape[0]):
+            for k in xrange(x.shape[1]):
+                for i in xrange(shape2):
+                    zi = i / ds0
+                    for j in xrange(shape3):
+                        zj = j / ds1
+                        if (maxout[n,k,zi,zj] == x[n,k,i,j]):
+                            gx[n,k,i,j] = gz[n,k,zi,zj]
+                        else: gx[n,k,i,j] = 0
+        gx_stg[0] = gx
+
+    def c_code(self, node, name, (x, z, gz), (gx,), sub):
+        fail = sub['fail']
+        ignore_border = int(self.ignore_border)
+        ds0, ds1 = self.ds
+        return """
+        int x_typenum = PyArray_ObjectType((PyObject*)%(x)s, 0);
+        int z_typenum = PyArray_ObjectType((PyObject*)%(z)s, 0);
+        int gz_typenum = PyArray_ObjectType((PyObject*)%(gz)s, 0);
+        int x_shp0_usable;
+        int x_shp1_usable;
+        int z_shp0, z_shp1;
+        if ((x_typenum != z_typenum) || (x_typenum != gz_typenum))
+        {
+            PyErr_SetString(PyExc_ValueError, "input types must all match");
+            %(fail)s;
+        }
+        if(%(x)s->nd!=4) 
+        {
+            PyErr_SetString(PyExc_ValueError, "x must be a 4d ndarray");
+            %(fail)s;
+        }
+        if(%(z)s->nd!=4) 
+        {
+            PyErr_SetString(PyExc_ValueError, "z must be a 4d ndarray");
+            %(fail)s;
+        }
+        if(%(gz)s->nd!=4) 
+        {
+            PyErr_SetString(PyExc_ValueError, "gz must be a 4d ndarray");
+            %(fail)s;
+        }
+        z_shp0 = %(z)s->dimensions[2];
+        z_shp1 = %(z)s->dimensions[3];
+        if (%(ignore_border)s)
+        {
+            x_shp0_usable = z_shp0 * %(ds0)s;
+            x_shp1_usable = z_shp1 * %(ds1)s;
+        }
+        else
+        {
+            x_shp0_usable = %(x)s->dimensions[2];
+            x_shp1_usable = %(x)s->dimensions[3];
+        }
+        if ((!%(gx)s)
+          || *PyArray_DIMS(%(gx)s)!=4
+          ||(%(gx)s->dimensions[0] != %(x)s->dimensions[0])
+          ||(%(gx)s->dimensions[1] != %(x)s->dimensions[1])
+          ||(%(gx)s->dimensions[2] != %(x)s->dimensions[2])
+          ||(%(gx)s->dimensions[3] != %(x)s->dimensions[3])
+          )
+        {
+          Py_XDECREF(%(gx)s);
+          %(gx)s = (PyArrayObject*) PyArray_ZEROS(4, %(x)s->dimensions, x_typenum,0);
+        }
+
+        for(int b=0;b<%(x)s->dimensions[0];b++){
+          for(int k=0;k<%(x)s->dimensions[1];k++){
+            int mini_i = 0;
+            int zi = 0;
+            for(int i=0;i< x_shp0_usable; i++){
+               int mini_j = 0;
+               int zj = 0;
+               for(int j=0; j< x_shp1_usable; j++){
+                 dtype_%(x)s * __restrict__ xp = ((dtype_%(x)s*)(PyArray_GETPTR4(%(x)s,b,k,i,j)));
+                 dtype_%(gx)s * __restrict__ gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
+                 dtype_%(z)s * __restrict__ zp = ((dtype_%(z)s*)(PyArray_GETPTR4(%(z)s,b,k,zi,zj)));
+                 dtype_%(gz)s * __restrict__ gzp = ((dtype_%(gz)s*)(PyArray_GETPTR4(%(gz)s,b,k,zi,zj)));
+                 gxp[0] = (zp[0] == xp[0]) ? gzp[0] : 0;
+                 mini_j = (mini_j + 1 == %(ds1)s) ? 0 : mini_j+1;
+                 zj += (mini_j == 0);
+              }//for j
+              mini_i = (mini_i + 1 == %(ds0)s) ? 0 : mini_i+1;
+              zi += (mini_i == 0);
+
+              for (int j = x_shp1_usable; j < %(x)s->dimensions[3]; ++j) {
+                dtype_%(gx)s * gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
+                gxp[0] = 0;
+              }
+            }//for i
+
+            for(int i = x_shp0_usable; i < %(x)s->dimensions[2]; i++){
+                for (int j = 0; j < %(x)s->dimensions[3]; ++j) {
+                    dtype_%(gx)s * gxp = ((dtype_%(gx)s*)(PyArray_GETPTR4(%(gx)s,b,k,i,j)));
+                    gxp[0] = 0;
+                }
+            }
+          }//for k
+        }//for b
+        """ %locals()
+
+    def c_code_cache_version(self):
+        return ()
--- a/theano/tensor/signal/tests/__init__.py
+++ b/theano/tensor/signal/tests/__init__.py
--- a/theano/tensor/signal/tests/speed_test_conv.py
+++ b/theano/tensor/signal/tests/speed_test_conv.py
+import sys, time, unittest
+
+import numpy
+import numpy as N
+
+from theano.tests import unittest_tools as utt
+
+from theano import function, Mode
+import theano.tensor as T
+from theano.tensor.signal.conv import ConvOp
+
+def flip(kern, kshp):
+    "flip the kernel as scipy.convolv2d do it flipped."
+    flip = N.zeros(kern.shape)
+    if len(kern.shape)==2:
+        kern=kern.reshape(-1)
+        it = reversed(kern)
+        for i in range(kshp[0]):
+            for j in range(kshp[1]):
+                flip[i,j] = it.next()
+    elif len(kern.shape)==3:
+        kern=kern.reshape(kern.shape[0],-1)
+        for k in range(kern.shape[0]):
+            it = reversed(kern[k,:])
+            for i in range(kshp[0]):
+                for j in range(kshp[1]):
+                    flip[k,i,j] = it.next()
+    elif len(kern.shape)==4:
+        kern=kern.reshape(kern.shape[0],kern.shape[1],-1)
+        for k in range(kern.shape[0]):
+            for m in range(kern.shape[1]):
+                it = reversed(kern[k,m,:])
+                for i in range(kshp[0]):
+                    for j in range(kshp[1]):
+                        flip[k,m,i,j] = it.next()
+    else:
+        raise NotImplementedError()
+    
+    return flip
+
+global_rng = N.random.RandomState(3423489)
+
+dmatrix4=T.TensorType('float64', (False, False, False, False))
+
+def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, 
+        unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, 
+        conv_op_py=False, do_print=True, repeat=1, 
+        unroll_patch=False, unroll_patch_size=False, verbose=0):
+
+        # build actual input images
+        imgval = global_rng.rand(bsize, imshp[0], imshp[1], imshp[2])
+
+        a=T.dmatrix()
+        kerns = [a for i in nkerns]
+        inputs4=dmatrix4()
+        kerns4=dmatrix4()
+
+        # for each layer
+        ntot=0 
+        tctot=0
+        tpytot=0
+
+        for kshp, kern, nkern, n_layer in zip(kshps, kerns, nkerns, range(len(nkerns))):
+            if do_print:
+                print '************* layer %i ***************' % n_layer
+                
+                print conv_mode, ss, n_layer, kshp, nkern
+
+            # actual values
+            w = global_rng.random_sample(N.r_[nkern,imshp[0],kshp])
+            w_flip = flip(w,kshp).reshape(w.shape)
+
+            ## manual implementation
+            # check first stage
+            padimg = imgval
+            if conv_mode == 'full':
+                padimg_shp = N.array(imshp[1:]) + 2*(N.array(kshp) - N.array([1,1]))
+                padimg = N.zeros(N.r_[bsize,imshp[0],padimg_shp])
+                padimg[:, :, kshp[0]-1:-kshp[0]+1, 
+                             kshp[1]-1:-kshp[1]+1] = imgval
+
+            outshp = N.hstack((nkern, ConvOp.getOutputShape(imshp, kshp, ss, conv_mode)))
+
+            time1 = time.time()
+            outval = N.zeros(N.r_[bsize,outshp])
+            if validate:
+                # causes an atexit problem
+                from scipy.signal.sigtools import _convolve2d
+                from scipy.signal.signaltools import  _valfrommode, _bvalfromboundary
+                val = _valfrommode(conv_mode)
+                bval = _bvalfromboundary('fill')
+                for b in range(bsize): # loop over batches
+                    for n in range(nkern): # loop over filters
+                        for i in range(imshp[0]): # loop over input feature maps
+                            outval[b,n,...] +=  _convolve2d(\
+                                imgval[b,i,...], w_flip[n,i,...],1,val, bval, 0)[0::ss[0],0::ss[1]]
+                ntot += time.time() - time1
+
+            # ConvOp
+            if unroll_patch and not unroll_patch_size:
+                conv_op = ConvOp(dx=ss[0],dy=ss[1], output_mode=conv_mode,
+                                 unroll_patch=unroll_patch, verbose=verbose)(inputs4, kerns4)
+            else:
+                conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode,
+                                 unroll_batch=unroll_batch, unroll_kern=unroll_kern, unroll_patch=unroll_patch, verbose=verbose)(inputs4, kerns4)
+            l1shp=N.hstack((nkern,
+                            ConvOp.getOutputShape(imshp, kshp, ss, conv_mode)))
+            propup2 = function([inputs4, kerns4], conv_op)
+            propup3 = function([inputs4, kerns4], conv_op, mode=Mode(linker="py"))
+
+            time1 = time.time()
+            for i in range(repeat):
+                hidval2_ = propup2(imgval,w_flip)
+            hidval2 = hidval2_#[:,:,0::ss[0],0::ss[1]]
+            tctot += time.time() - time1
+
+            if conv_op_py:
+                time1 = time.time()
+                for i in range(repeat):
+                    hidval3_ = propup3(imgval,w_flip)
+                hidval3 = hidval3_#[:,:,0::ss[0],0::ss[1]]
+                tpytot += time.time() - time1
+                assert (N.abs(hidval2-hidval3)<1e-5).all()
+            else:
+                tpytot += 0
+
+            if validate:
+                temp = N.abs(outval - hidval2)
+                assert (temp < 1e-5).all()
+            if validate and conv_op_py:
+                temp = N.abs(outval - hidval3)
+                assert (temp < 1e-5).all()
+
+            imshp = tuple(outshp)
+            imgval = outval.reshape(bsize,outshp[0],outshp[1],outshp[2])
+
+        return tctot, tpytot, ntot
+
+
+def speed_multilayer_conv():
+        # calculate the speed up of different combination of unroll
+        # put the paramter to the same you will try. 
+        
+        validate=False# we don't validate the result to have it much faster!
+        verbose=1
+        unroll_batch = [1,2,4,5,10,20]
+        unroll_kern = [1,2,4,5,10,20]
+        unroll_batch = [1,4,5]
+        unroll_kern = [1,4,5]
+        unroll_patch = [True, False]
+        
+        bsize = 20 # batch size
+        imshp_start = (1,48,48)#un square shape to test more corner case.
+        kshps = ([11,12],[12,11])#un square shape to test more corner case.
+        nkerns = [20,20] # per output pixel
+        ssizes = [(1,1),]#(1,1)]#(2,2) bugged
+        convmodes = ['valid','full']
+        do_convolve2=False
+        a=T.dmatrix()
+        kerns = [a for i in nkerns]
+
+        assert len(kshps)==len(nkerns)==len(kerns)
+    
+        timing = N.zeros((len(unroll_batch),len(unroll_kern),3,len(convmodes)*len(ssizes)))
+        t_b_k=[]
+        #calculate the timing with unrolling
+
+        print 'time unroll batch kern'
+        t_=[[ 7.60572791,  3.95069814,  3.74271464], [ 4.05631089,  2.90384555,  2.93613672], [ 3.90551591,  2.92595196,  3.00102282]]
+        best=[0.52690219879150391, 2.4266397953033447]
+        worst=[0.92042708396911621, 6.8822150230407715]
+        best=[]
+        worst=[]
+        t_=[]
+        for unroll_b, n_b in zip(unroll_batch,range(len(unroll_batch))):
+            for unroll_k, n_k in zip(unroll_kern,range(len(unroll_kern))):
+                t_b_k.append(str(unroll_b)+"/"+str(unroll_k))
+                if not t_:
+                    tctot, tpytot, ntot=[],[],[]
+                    for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                        for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                            tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=unroll_b, unroll_kern=unroll_k, validate=validate, verbose=verbose,do_print=False)
+                            tctot+=[tctot_]
+                            tpytot+=[tpytot_]
+                            ntot+=[ntot_]
+                    if unroll_b==4 and unroll_k==4:
+                        #print "unroll 4/4",tctot
+                        best=tctot
+                    if unroll_b==1 and unroll_k==1:
+                        #print "unroll 1/1",tctot
+                        worst=tctot
+                    timing[n_b,n_k]=[tctot, tpytot, ntot]#[sum(tctot), sum(tpytot), sum(ntot)]
+        if not t_:
+            t=timing[:,:,0,:]#We select only the c timing.
+        else:
+            t=t_
+        t=N.asarray(t)
+        #calculate the old timing
+        print 'time old version'
+        tctot_=[0.52555489540100098, 6.6634182929992676]
+        tctot,tpytot,ntot=[],[],[]
+        tctot_=[]
+        if not tctot_:
+            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                    tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate, verbose=verbose,do_print=False)
+                    tctot+=[tctot_]
+                    tpytot+=[tpytot_]
+                    ntot+=[ntot_]
+        else: tctot=N.asarray(tctot_)
+        print "old code timing %.3fs"%sum(tctot),tctot
+        best=N.asarray(best)
+        worst=N.asarray(worst)
+        print "timing for unrolled version"
+        print t_b_k
+        print t
+        t_detail=t
+        t = t.sum(axis=2)
+        print "max %.3fs"%t.max(), "max param(batch unloop size/kernel unloop size)", t_b_k[t.argmax()]
+        print "min %.3fs"%t.min(), "min param(batch unloop size/kernel unloop size)", t_b_k[t.argmin()]
+        print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),sum(tctot)/t.min())
+        print worst/best,tctot/best
+
+        #calculate the timing of unroll_patch
+        print 'time unroll_patch'
+        tctot_patch = []
+        tctot_patch_size = []
+        for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+            for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=True,verbose=verbose,do_print=False)
+                tctot_patch += [tctot_]
+                tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=True,verbose=verbose,do_print=False,unroll_patch_size=True)
+                tctot_patch_size += [tctot_]
+
+        t_patch=sum(tctot_patch)
+        print "unroll_patch without shape time", tctot_patch
+        print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch,sum(tctot)/t_patch)
+        print best/tctot_patch, worst/tctot_patch
+        t_patch_size=sum(tctot_patch_size)
+        print "unroll_patch with shape time", tctot_patch_size
+        print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch_size,sum(tctot)/t_patch_size)
+        print best/tctot_patch_size, worst/tctot_patch_size
+        
+        return
+
+if __name__ == '__main__':
+    speed_multilayer_conv()
--- a/theano/tensor/signal/tests/test_conv.py
+++ b/theano/tensor/signal/tests/test_conv.py
+import sys, time, unittest
+import numpy
+from scipy import signal
+
+import theano
+import theano.tensor as T
+from theano import function, Mode
+from theano.tests import unittest_tools as utt
+
+from theano.tensor.signal import conv
+
+from theano.tensor.basic import _allclose
+
+class TestConv2D(unittest.TestCase):
+
+    def setUp(self):
+        utt.seed_rng()
+        self.input = T.dtensor4('input')
+        self.filters = T.dtensor4('filters')
+
+    def validate(self, image_shape, filter_shape,
+                 border_mode='valid', subsample=(1,1),
+                 N_image_shape=None, N_filter_shape=None,
+                 input=None, filters=None, 
+                 unroll_batch=0, unroll_kern=0, unroll_patch=True,
+                 verify_grad=True):
+
+        if N_image_shape is None:
+            N_image_shape = image_shape
+        if N_filter_shape is None:
+            N_filter_shape = filter_shape
+    
+        if not input:
+            input = self.input
+        if not filters:
+            filters = self.filters
+        
+        ############# THEANO IMPLEMENTATION ############
+        
+        # we create a symbolic function so that verify_grad can work
+        def sym_conv2d(input, filters):
+            # define theano graph and function
+            return conv.conv2d(input, filters, image_shape, filter_shape,
+                          border_mode, subsample, unroll_batch=unroll_batch,
+                          unroll_kern=unroll_kern, unroll_patch=unroll_patch)
+
+        output = sym_conv2d(input, filters)
+        theano_conv = theano.function([input, filters], output)
+          
+        # initialize input and compute result
+        image_data  = numpy.random.random(N_image_shape)
+        filter_data = numpy.random.random(N_filter_shape)
+        theano_output = theano_conv(image_data, filter_data)
+
+        ############# REFERENCE IMPLEMENTATION ############
+        s = 1. if border_mode is 'full' else -1.
+        out_shape2d = numpy.array(N_image_shape[-2:]) +\
+                      s*numpy.array(N_filter_shape[-2:]) - s
+        out_shape2d = numpy.ceil(out_shape2d / numpy.array(subsample))
+        out_shape = (N_image_shape[0],N_filter_shape[0]) + tuple(out_shape2d)
+        ref_output = numpy.zeros(out_shape)
+
+        # loop over output feature maps
+        for k in range(N_filter_shape[0]):
+            # loop over input feature maps
+            for l in range(N_filter_shape[1]):
+
+                filter2d = filter_data[k,l,:,:]
+
+                # loop over mini-batches
+                for b in range(N_image_shape[0]):
+                    image2d = image_data[b,l,:,:]
+                    output2d = signal.convolve2d(image2d, filter2d, border_mode)
+
+                    ref_output[b,k,:,:] +=\
+                       output2d[::subsample[0],::subsample[1]]
+
+        self.failUnless(_allclose(theano_output, ref_output))
+
+        ############# TEST GRADIENT ############
+        if verify_grad:
+            utt.verify_grad(sym_conv2d, [image_data, filter_data])
+
+
+    def test_basic(self):
+        """
+        Tests that basic convolutions work for odd and even dimensions of image and filter
+        shapes, as well as rectangular images and filters.
+        """
+        self.validate((3,2,8,8), (4,2,5,5), 'valid')
+        self.validate((3,2,7,5), (5,2,2,3), 'valid')
+        self.validate((3,2,7,5), (5,2,3,2), 'valid')
+        self.validate((3,2,8,8), (4,2,5,5), 'full')
+        self.validate((3,2,7,5), (5,2,2,3), 'full')
+        # test filter same size as input
+        self.validate((3,2,3,3), (4,2,3,3), 'valid')
+
+    def test_unroll_patch_false(self):
+        """
+        unroll_patch is True by default. Test basic convs with False.
+        """
+        self.validate((3,2,7,5), (5,2,2,3), 'valid', unroll_patch=False)
+        self.validate((3,2,7,5), (5,2,2,3), 'full', unroll_patch=False)
+        self.validate((3,2,3,3), (4,2,3,3), 'valid', unroll_patch=False)
+
+    def test_unroll_special(self):
+        """
+        (unroll_kern, unroll_batch) in (0,1),(1,0) is special case.
+        """
+        self.validate((6,2,3,3), (3,2,2,2), 'valid', unroll_batch=1)
+
+    def test_unroll_batch(self):
+        """
+        Test mini-batch unrolling for various legal values.
+        """
+        # mini-batch of size 6 is multiple of 2 and 3. Should work.
+        self.validate((6,2,3,3), (3,2,2,2), 'valid', unroll_batch=2, verify_grad=False)
+        self.validate((6,2,3,3), (3,2,2,2), 'valid', unroll_batch=3, verify_grad=False)
+
+    def test_unroll_kern(self):
+        """
+        Test kernel unrolling for various legal values.
+        """
+        # 6 filters is a multiple of 2 and 3. Should work.
+        self.validate((2,3,3,3), (6,3,2,2), 'valid', unroll_kern=2, verify_grad=False)
+        self.validate((2,3,3,3), (6,3,2,2), 'valid', unroll_kern=3, verify_grad=False)
+
+    def test_subsample(self):
+        """
+        Tests convolution where subsampling != (1,1)
+        """
+        self.validate((3,2,7,5), (5,2,2,3), 'valid', subsample=(2,2))
+        self.validate((3,2,7,5), (5,2,2,3), 'full', subsample=(2,2))
+        self.validate((3,2,7,5), (5,2,2,3), 'valid', subsample=(2,1))
+
+    def test_invalid_filter_shape(self):
+        """
+        Tests scenario where filter_shape[1] != input_shape[1]
+        """
+        def f():
+            self.validate((3,2,8,8), (4,3,5,5), 'valid')
+        self.failUnlessRaises(AssertionError, f)
+
+    def test_missing_info(self):
+        """
+        Test convolutions for various pieces of missing info.
+        """
+        self.validate(None, None, 
+                      N_image_shape=(3,2,8,8), 
+                      N_filter_shape=(4,2,5,5))
+        self.validate((3,2,None,None), None,
+                      N_image_shape=(3,2,8,8), 
+                      N_filter_shape=(4,2,5,5))
+        self.validate((None,2,None,None), (None,2,5,5),
+                      N_image_shape=(3,2,8,8), 
+                      N_filter_shape=(4,2,5,5))
+
+    def test_full_mode(self):
+        """
+        Tests basic convolution in full mode and case where filter 
+        is larger than the input image.
+        """
+        self.validate((3,2,5,5), (4,2,8,8), 'full')
+        def f():
+            self.validate((3,2,5,5), (4,2,8,8), 'valid')
+        self.failUnlessRaises(Exception, f)
+
+    def test_wrong_input(self):
+        """
+        Make sure errors are raised when image and kernel are not 4D tensors
+        """
+        try:
+            self.validate((3,2,8,8), (4,2,5,5), 'valid', input = T.dmatrix())
+            self.validate((3,2,8,8), (4,2,5,5), 'valid', filters = T.dvector())
+            self.validate((3,2,8,8), (4,2,5,5), 'valid', input = T.dtensor3())
+            # should never reach here
+            self.fail()
+        except: 
+            pass
--- a/theano/tensor/signal/tests/test_downsample.py
+++ b/theano/tensor/signal/tests/test_downsample.py
+import unittest, sys, time
+import numpy
+import theano.tensor as tensor
+from theano.tests import unittest_tools as utt
+from theano.tensor.signal.downsample import DownsampleFactorMax, max_pool2D
+from theano import function, Mode
+
+
+class TestDownsampleFactorMax(unittest.TestCase):
+    def setUp(self):
+        utt.seed_rng()
+
+    @staticmethod
+    def numpy_max_pool2D(input, ds, ignore_border=False):
+        '''Helper function, implementing max_pool2D in pure numpy'''
+        if len(input.shape) < 2:
+            raise NotImplementedError('input should have at least 2 dim, shape is %s'\
+                    % str(input.shape))
+
+        xi=0
+        yi=0
+        if not ignore_border:
+            if input.shape[-2] % ds[0]:
+                xi += 1
+            if input.shape[-1] % ds[1]:
+                yi += 1
+
+        out_shp = list(input.shape[:-2])
+        out_shp.append(input.shape[-2]/ds[0]+xi)
+        out_shp.append(input.shape[-1]/ds[1]+yi)
+
+        output_val = numpy.zeros(out_shp)
+
+        for k in numpy.ndindex(input.shape[:-2]):
+            for i in range(output_val.shape[-2]):
+                ii =  i*ds[0]
+                for j in range(output_val.shape[-1]):
+                    jj = j*ds[1]
+                    patch = input[k][ii:ii+ds[0],jj:jj+ds[1]]
+                    output_val[k][i,j] = numpy.max(patch)
+        return output_val
+
+    def test_DownsampleFactorMax(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        # generate random images
+        maxpoolshps = ((1,1),(2,2),(3,3),(2,3))
+        imval = rng.rand(4,10,64,64)
+        images = tensor.dtensor4()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+
+                ## Pure Numpy computation
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                f = function([images,],[output,])
+                output_val = f(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                #DownsampleFactorMax op
+                maxpool_op = DownsampleFactorMax(maxpoolshp, ignore_border=ignore_border)(images)
+                f = function([images], maxpool_op)
+                output_val = f(imval)
+                assert (numpy.abs(output_val - numpy_output_val) < 1e-5).all()
+
+    def test_DownsampleFactorMax_grad(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+        maxpoolshps = ((1,1),(3,2),(2,3))
+        imval = rng.rand(2,3,3,4) * 10.0 #more variance means numeric gradient will be more accurate
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                def mp(input):
+                    return DownsampleFactorMax(maxpoolshp, ignore_border=ignore_border)(input)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+    def test_max_pool2D_2D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = ((1,1),(3,2))
+        imval = rng.rand(4,7)
+        images = tensor.dmatrix()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                def mp(input):
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+    def test_max_pool2D_3D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = [(1,2)]
+        imval = rng.rand(2,3,4)
+        images = tensor.dtensor3()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                c = tensor.sum(output)
+                c_val = function([images], c)(imval)
+
+                g = tensor.grad(c, images)
+                g_val = function([images], [g.shape, tensor.min(tensor.min(tensor.min(g))), tensor.max(tensor.max(tensor.max(g)))])(imval)
+
+                def mp(input):
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+
+    def test_max_pool2D_6D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = [(3,2)]
+        imval = rng.rand(2,1,1,1,3,4)
+        images = tensor.TensorType('float64', [False]*6)()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                def mp(input):
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -133,6 +133,8 @@ class test_CAReduce(unittest.TestCase):
                           ((5, 6), (1, )),
                           ((5, 6), ()),
                           ((2, 3, 4, 5), (0, 1, 3)),
+                           ((5, 0), (0, )),
+                           ((5, 0), (1, )),
                           ((), ())]:
            x = TensorType('float64', [(entry == 1) for entry in xsh])('x')
            e = CAReduce(add, axis = tosum)(x)
@@ -149,7 +151,7 @@ class test_CAReduce(unittest.TestCase):

    def test_c(self):
        self.with_linker(gof.CLinker())
-        
+

 if __name__ == '__main__':
    unittest.main()
--- a/theano/tests/unittest_tools.py
+++ b/theano/tests/unittest_tools.py
 import unittest
 import numpy
 import theano.tensor as T
-from theano.configparser import config, AddConfigVar, IntParam
+from theano.configparser import config, AddConfigVar, StrParam
 import os, sys

 AddConfigVar('unittests.rseed',
-        "Seed to use for randomized unit tests",
-        IntParam(666))
+        "Seed to use for randomized unit tests. Special value 'random' means using a seed of None.",
+        StrParam(666))

 def fetch_seed(pseed=None):
    """
    Returns the seed to use for running the unit tests.
-    If an explicit seed is given, it will be used for seending numpy's rng.
-    If not, it will try to get a seed from the THEANO_UNITTEST_SEED variable.
-    If THEANO_UNITTEST_SEED is set to "random", it will seed the rng. with None,
+    If an explicit seed is given, it will be used for seeding numpy's rng.
+    If not, it will use config.unittest.rseed (its default value is 666).
+    If config.unittest.rseed is set to "random", it will seed the rng with None,
    which is equivalent to seeding with a random seed.
-    If THEANO_UNITTEST_SEED is not defined, it will use a default seed of 666.

    Useful for seeding RandomState objects.
    >>> rng = numpy.random.RandomState(unittest_tools.fetch_seed())
@@ -35,7 +34,7 @@ def fetch_seed(pseed=None):
        #backport
        #seed = int(seed) if seed else None
    except ValueError:
-        print >> sys.stderr, 'Error: THEANO_UNITTEST_SEED contains '\
+        print >> sys.stderr, 'Error: config.unittests.rseed contains '\
                'invalid seed, using None instead'
        seed = None

@@ -49,7 +48,7 @@ def seed_rng(pseed=None):

    seed = fetch_seed(pseed)
    if pseed and pseed!=seed:
-        print >> sys.stderr, 'Warning: using seed given by THEANO_UNITTEST_SEED=%i'\
+        print >> sys.stderr, 'Warning: using seed given by config.unittests.rseed=%i'\
                'instead of seed %i given as parameter' % (seed, pseed)
    numpy.random.seed(seed)
    return seed