Merge pull request #2356 from abergeron/cudnn_r2

Cudnn r2

Merge pull request #2356 from abergeron/cudnn_r2
67d7d461 · Frédéric Bastien · c2895dcf · 0acacf37 · 67d7d461 · 67d7d461
--- a/doc/tutorial/extending_theano_c.txt
+++ b/doc/tutorial/extending_theano_c.txt
@@ -688,14 +688,13 @@ To help with this, Theano defines a class, ``COp``, from which new C ops
 can inherit. The class ``COp`` aims to simplify the process of implementing
 C ops by doing the following :

-*       It allows you to define the C implementation of your op in a distinct
-        C code file. This makes it easier to keep your Python and C code
-        readable and well indented.
+*      It allows you to define the C implementation of your op in a distinct
+       C code file. This makes it easier to keep your Python and C code
+       readable and well indented.

-*       It automatically handles the methods :meth:`Op.c_code()`,
-        :meth:`Op.c_support_code()`, :meth:`Op.c_support_code_apply()` and
-        :meth:`Op.c_code_cache_version()` based on the provided external C
-        implementation.
+*      It can automatically handle all the methods that return C code,
+       in addition to :meth:`Op.c_code_cache_version()` based on the
+       provided external C implementation.

 To illustrate how much simpler the class ``COp`` makes the process of defining
 a new op with a C implementation, let's revisit the second example of this
@@ -740,7 +739,7 @@ C file named vectorTimesVector.c :

 .. code-block:: c

-    THEANO_SUPPORT_CODE_SECTION
+    #section support_code

    // Support code function
    bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2)
@@ -749,7 +748,7 @@ C file named vectorTimesVector.c :
    }


-    THEANO_APPLY_CODE_SECTION
+    #section support_code_apply

    // Apply-specific support function
    void APPLY_SPECIFIC(vector_elemwise_mult)(
@@ -822,43 +821,46 @@ this new version of the VectorTimesVector op :
 *       Parent class : instead of inheriting from the class :class:`Op`,
        VectorTimesVector inherits from the class ``COp``.

-*       Constructor : in our new op, the ``__init__()`` method has an important
-        use; to inform the constructor of the ``COp`` class of the location,
-        on the filesystem of the C implementation of this op. To do this, it
-        gives the path of file containing the C code as well as the name of
-        the function, in that file, that should be called to perform the
-        computation. The path should be given as a relative path from the
-        folder where the descendant of the ``COp`` class is defined.
-
-*       ``make_node()`` : the ``make_node()`` method is absolutely identical to
-        the one in our old example. Using the ``COp`` class doesn't change
-        anything here.
-
-*       External C code : the external C code performs the computation
-        associated with the op. It contains, at the very least, a 'main' function
-        having the same name as provided to the constructor of the Python class
-        ``COp``. Writing this C code involves a few subtleties which deserve their
-        own respective sections.
-
+*       Constructor : in our new op, the ``__init__()`` method has an
+        important use; to inform the constructor of the ``COp`` class
+        of the location, on the filesystem of the C implementation of
+        this op. To do this, it gives a list of file paths containing
+        the C code for this op.  To auto-generate the c_code method
+        with a function call you can specify the function name as the
+        second parameter.  The paths should be given as a relative
+        path from the folder where the descendant of the ``COp`` class
+        is defined.
+
+*       ``make_node()`` : the ``make_node()`` method is absolutely
+        identical to the one in our old example. Using the ``COp``
+        class doesn't change anything here.
+
+*       External C code : the external C code implements the various
+        functions associated with the op.  Writing this C code
+        involves a few subtleties which deserve their own respective
+        sections.

 Main function
 -------------

-The external C implementation must implement a main function whose name
-is passed by the op to the ``__init__()`` method of the ``COp`` class. This
-main C function must respect the following constraints :
+If you pass a function name to the ``__init__()`` method of the
+``COp`` class, it must respect the following constraints:

-*       It must return an int. The value of that int indicates whether the
-        op could perform its task or not. A value of 0 indicates success while
-        any non-zero value will interrupt the execution of the Theano function.
-        Before returning a non-zero integer, the main function should call the
-        function ``PyErr_Format()`` to setup a Python exception.
+*       It must return an int. The value of that int indicates whether
+        the op could perform its task or not. A value of 0 indicates
+        success while any non-zero value will interrupt the execution
+        of the Theano function.  When returning non-zero the function
+        must set a python exception indicating the details of the
+        problem.

-*       It must receive one pointer for each input to the op followed by one
-        pointer to a pointer for each output of the op.
+*       It must receive one argument for each input to the op followed
+        by one pointer to an argument for each output of the op.  The
+        types for the argument is dependant on the Types (that is
+        theano Types) of your inputs and outputs.

-For example, the main C function of an op that takes two scalars as inputs and
-returns both their sum and the difference between them would have four
+For example, the main C function of an op that takes two TensorTypes
+(which has ``PyArrayObject *`` as its C type) as inputs and returns
+both their sum and the difference between them would have four
 parameters (two for the op's inputs and two for its outputs) and it's
 signature would look something like this :

@@ -870,11 +872,21 @@ signature would look something like this :
 Macros
 ------

-The ``COp`` class defines a number of macros that can you can use in your C
-implementation to make it simpler and more generic.
+For certain section tags, your C code can benefit from a number of
+pre-defined macros.  These section tags have no macros: ``init_code``,
+``support_code``. All other tags will have the support macros
+discussed below.

-For every input array 'i' (indexed from 0) of the op, the following macros are
-defined:
+*      ``APPLY_SPECIFIC(str)`` which will automatically append a name
+       unique to the :ref:`Apply node that applies the Op at the end
+       of the provided ``str``. The use of this macro is discussed
+       futher below.
+
+For every input which has a :attr:`dtype` attribute (this means
+Tensors, and equivalent types on GPU), the following macros will be
+defined unless your Op class has an :attr:`Op.check_input` attribute
+defined to False. In these descrptions 'i' refers to the position
+(indexed from 0) in the input array.

 *       ``DTYPE_INPUT_{i}`` : NumPy dtype of the data in the array.
        This is the variable type corresponding to the NumPy dtype, not the
@@ -889,71 +901,87 @@ defined:

 *       ``TYPENUM_INPUT_{i}`` : Typenum of the data in the array

-*       ``ITEMSIZE_INPUT_{i}`` : Size, in bytes, of the elements in the array.
+*       ``ITEMSIZE_INPUT_{i}`` : Size, in bytes, of the elements in
+        the array.
+
+In the same way, the macros ``DTYPE_OUTPUT_{i}``,
+``ITEMSIZE_OUTPUT_{i}`` and ``TYPENUM_OUTPUT_{i}`` are defined for
+every output 'i' of the op.

-In the same way, the macros ``DTYPE_OUTPUT_{i}``, ``ITEMSIZE_OUTPUT_{i}`` and
-``TYPENUM_OUTPUT_{i}``  are defined for every output 'i' of the op.
+In addition to these macros, the ``init_code_struct``, ``code``, and
+``code_cleanup`` section tags also have the following macros:
+
+*     ``FAIL`` : Code to insert at error points.  A python exception
+      should be set prior to this code.  An invocation look like this:
+
+      .. code-block:: c
+
+        if (error) {
+	  // Set python exception
+	  FAIL
+        }

-The ``COp`` class also defines the macro ``APPLY_SPECIFIC(str)`` which will
-automatically append the name of the :ref:`Apply node that applies the Op at
-the end of the provided ``str``. The use of this macro is discussed below.
+      You can add a semicolon after the macro if it makes your editor
+      happy.

-You should be aware, however, that these macros are apply-specific. As such,
-any function that uses them is considered to contain apply-specific code.
+*     ``CONTEXT`` : Name of the context variable for this node.  (only
+      for Ops which have a context, which is discussed elsewhere)

+Finally the tag ``code`` and ``code_cleanup`` have macros to
+pass the inputs and output names.  These are name ``INPUT_{i}`` and
+``OUTPUT_{i}`` where `i` is the 0-based index position in the input
+and output arrays respectively.

 Support code
 ------------

-The file whose name is provided to the ``COp`` class is not constrained to
-contain only one function. It can in fact contain many functions, with every
-function but the main one acting as support code.
-
-When we defined the VectorTimesVector op without using the ``COp`` class, we
-had to make a distinction between two types of support_code : the support
-code that was apply-specific and the support code that wasn't.
-The apply-specific code was defined in the ` c_support_code_apply()`` method
-and the elements defined in that code (global variables and functions) had to
-include the name of the Apply node in their own names to avoid conflicts
-between the different versions of the apply-specific code. The code that
-wasn't apply-specific was simply defined in the ``c_support_code()`` method.
-
-When using the ``COp`` class, we still have to make the distinction between
-apply-specific and apply-agnostic support code but we express it differently
-in the code since it is all defined in the same external C file.
-These two types of support code should each be defined in their own section of
-the file, like in the example above. These sections should be delimited by the
-markers ``THEANO_SUPPORT_CODE_SECTION`` (to be put on its own line, at the
-beginning of the apply-agnostic support code section) and
-``THEANO_APPLY_CODE_SECTION`` (to be put on its own line at the beginning of
-the apply-specific code section). Moreover, just like in the previous examples
-of this tutorial, apply-specific functions and global variables need to
-include the name of the :ref:`Apply` node in their names. To achieve this,
-the macro ``APPLY_SPECIFIC(str)`` should be used when defining those elements
-as well as when referring to them. In the above example, this macro is used
-when defining the functions ``vector_elemwise_mult()`` and
+Certain section are limited in what you can place in them due to
+semantic and syntactic restrictions of the C++ language.  Most of
+these restrictions apply to the tags that end in ``_struct``.
+
+When we defined the VectorTimesVector op without using the ``COp``
+class, we had to make a distinction between two types of support_code
+: the support code that was apply-specific and the support code that
+wasn't. The apply-specific code was defined in the
+``c_support_code_apply()`` method and the elements defined in that
+code (global variables and functions) had to include the name of the
+Apply node in their own names to avoid conflicts between the different
+versions of the apply-specific code. The code that wasn't
+apply-specific was simply defined in the ``c_support_code()`` method.
+
+To make indentifiers that include the :ref:`Apply` node name use the
+``APPLY_SPECIFIC(str)`` macro. In the above example, this macro is
+used when defining the functions ``vector_elemwise_mult()`` and
 ``vector_times_vector()`` as well as when calling function
 ``vector_elemwise_mult()`` from inside ``vector_times_vector()``.

-:note:
-
-    The macro ``APPLY_SPECIFIC(str)`` should only ever be used for
-    apply-specific code. It should not be used for apply-agnostic code.
-
-The rules for knowing if a piece of code should be put in the apply-agnostic
-or the apply-specific support code section of the file are simple. If it uses
-any of the macros defined by the class ``COp`` then it is apply-specific and
-goes in the corresponding section. If it calls any apply-specific code then
-it is apply-specific. Otherwise, it is apply-agnostic and goes in the
-apply-agnostic support code section.
-
-In the above example, the ``function vector_same_shape()`` is apply-agnostic
-because it uses none of the macros defined by the class ``COp`` and it doesn't
-rely on any apply-specific code. The function ``vector_elemwise_mult()`` is
-apply-specific because it uses the macros defined by ``COp``. Finally, the
-function ``vector_times_vector()`` is apply-specific because it uses those
-same macros and also because it calls ``vector_elemwise_mult()`` which is an
-apply-specific function.
+When using the ``COp`` class, we still have to make the distinction
+between C code for each of the methods of a C class. These sections of
+code are separated by ``#section <tag>`` markers. The tag determines
+the name of the method this C code applies to with the rule that
+``<tag>`` applies to `c_<tag>`. Unknown tags are an error and will be
+reported. Duplicate tags will be merged together in the order the
+appear in the C files.
+
+The rules for knowing if where a piece of code should be put can be
+sometimes tricky.  The key thing to remember is that things that can
+be shared between instances of the op should be apply-agnostic and go
+into a section which does not end in ``_apply`` or ``_struct``.  The
+distinction of ``_apply`` and ``_struct`` mostly hinghes on how you
+want to manange the lifetime of the object.  Note that to use an
+apply-specific object, you have to be in a apply-specific section, so
+some portions of the code that might seem apply-agnostic may still be
+apply-specific because of the data they use (this does not include
+arguments).
+
+In the above example, the ``function vector_same_shape()`` is
+apply-agnostic because it uses none of the macros defined by the class
+``COp`` and it doesn't rely on any apply-specific code. The function
+``vector_elemwise_mult()`` is apply-specific because it uses the
+macros defined by ``COp``. Finally, the function
+``vector_times_vector()`` is apply-specific because it uses those same
+macros and also because it calls ``vector_elemwise_mult()`` which is
+an apply-specific function.

 Final Note
 ==========

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -17,6 +17,7 @@ import logging
 import numpy
 import os
 import sys
+import re
 import warnings

 import theano
@@ -973,6 +974,32 @@ int main( int argc, const char* argv[] )
                                                compute_map, no_recycling)


+def simple_meth(tag):
+    def f(self):
+        if tag in self.code_sections:
+            return self.code_sections[tag]
+        else:
+            raise utils.MethodNotDefined(
+                'c_' + tag, type(self), type(self).__name__)
+    f.__name__ = 'c_' + tag
+    return f
+
+
+def apply_meth(tag):
+    def f(self, node, name):
+        if tag in self.code_sections:
+            code = self.code_sections[tag]
+
+            define_macros, undef_macros = self.get_c_macros(node, name)
+            return os.linesep.join([define_macros, code,
+                                    undef_macros])
+        else:
+            raise utils.MethodNotDefined(
+                'c_' + tag, type(self), type(self).__name__)
+    f.__name__ = 'c_' + tag
+    return f
+
+
 class COp(Op):
    """ Class to allow an op to have an external C implementation.

@@ -981,118 +1008,98 @@ class COp(Op):
    the C implementation and the name of the function, in that file, to call
    to perform the computations for the op.
    """
+    section_re = re.compile(r'^#section ([a-zA-Z0-9_]+)$', re.MULTILINE)
+    backward_re = re.compile(r'^THEANO_(APPLY|SUPPORT)_CODE_SECTION$', re.MULTILINE)
+    # This is the set of allowed markers
+    SECTIONS = set([
+            'init_code', 'init_code_apply', 'init_code_struct',
+            'support_code', 'support_code_apply', 'support_code_struct',
+            'cleanup_code_struct',
+            'code', 'code_cleanup'])

-    def __init__(self, func_file, func_name):
-
-        self.func_file = func_file
-        self.func_name = func_name
-
-        # Define the markers that can be used to delimit sections in the
-        # external C code
-        self.support_code_marker = "THEANO_SUPPORT_CODE_SECTION"
-        self.apply_code_marker = "THEANO_APPLY_CODE_SECTION"
-        self.c_code_markers = [self.support_code_marker,
-                               self.apply_code_marker]
+    @classmethod
+    def get_path(cls, f):
+        """
+        Convert a path relative to the location of the class file into
+        an aboslute path. Paths that are already absolute are passed
+        through unchanged.
+        """
+        if not os.path.isabs(f):
+            class_file = inspect.getfile(cls)
+            class_dir = os.path.dirname(class_file)
+            f = os.path.realpath(os.path.join(class_dir, f))
+        return f

-        # Load the external C code
-        try:
-            # Attempt to find the file self.func_file in the folder where the
-            # concrete type of the COp instance is defined
-
-            # Get the name of the folder where the concrete type of the COp is
-            # defined
-            path_concrete_type = inspect.getfile(self.__class__)
-            folder_concrete_type = os.path.dirname(path_concrete_type)
-
-            # Try to open the file from there
-            f = open(os.path.join(folder_concrete_type, self.func_file), "r")
-            self.func_code = f.read()
-            f.close()
-
-        except IOError:
-
-            # Add information to the exception message to inform the user
-            # on the locations in which the class COp will look for the
-            # specified file
-            message = ("The path to the external C implementation should "
-                       "be given as a relative path from the folder "
-                       "where the Op is defined. ")
-
-            # Can't update the exception's message by modifying e.args
-            # because IOErrors don't use their attribute args to generate
-            # their error message
-            e.strerror = message + e.strerror
-            raise e
-
-        # Separate the contents of the file in sections and validate that at
-        # lest one of the necessary code sections has been defined
-        self.code_sections = self.parse_external_c_code(self.func_code)
-
-        if sum([marker in self.code_sections.keys()
-               for marker in self.c_code_markers]) == 0:
-
-            raise(RuntimeError, "The provided C implementation does not "
-                  "define a support code section or a support code apply "
-                  "section.")
-
-    def parse_external_c_code(self, code):
-
-        # Obtain the positions of the C code markers used in the C code
-        positions = [(code.index(marker), marker)
-                     for marker in self.c_code_markers if marker in code]
-
-        # Go over the markers in their order of occurence and extract
-        # the C code they concern
-        positions.sort()
-        code_sections = {}
-
-        for i in range(len(positions)):
-
-            marker_start, marker = positions[i]
-
-            if i < len(positions) - 1:
-                # This is not the last section in the code : extract the code
-                # between the beginning of the current marker and the
-                # beginning of the next one.
-                next_marker_start = positions[i+1][0]
-                section = code[marker_start: next_marker_start]
-            else:
-                # This is the last section in the code : extract the remaining
-                # C code
-                section = code[marker_start:]
+    def __init__(self, func_files, func_name=None):
+        """
+        Sections are loaded from files in order with sections in later
+        files overriding sections in previous files.
+        """
+        if not isinstance(func_files, list):
+            func_files = [func_files]

-            cleaned_section = section.replace(marker, "")
-            code_sections[marker] = cleaned_section
+        self.func_files = [self.get_path(f) for f in func_files]
+        self.func_name = func_name

-        return code_sections
+        self.load_c_code()
+
+        if len(self.code_sections) == 0:
+            raise ValueError("No sections where defined in C files")
+
+        if self.func_name is not None:
+            if 'op_code' in self.code_sections:
+                # maybe a warning instead (and clearing the key)
+                raise ValueError('Cannot have an "op_code" section and '
+                                 'specify the func_name')
+            if 'op_code_cleanup' in self.code_sections:
+                # maybe a warning instead (and clearing the key)
+                raise ValueError('Cannot have an "op_code_cleanup" section '
+                                 'and specify the func_name')
+
+    def load_c_code(self):
+        self.func_codes = []
+        for func_file in self.func_files:
+            with open(func_file, 'r') as f:
+                self.func_codes.append(f.read())
+
+        self.code_sections = dict()
+        for i, code in enumerate(self.func_codes):
+            if ('THEANO_APPLY_CODE_SECTION' in code or
+                'THEANO_SUPPORT_CODE_SECTION' in code):
+                # This is backward compat code that will go away in a while
+                split = self.backward_re.split(code)
+                n = 1
+                while n < len(split):
+                    if split[n] == 'APPLY':
+                        self.code_sections['support_code_apply'] = split[n+1]
+                    elif split[n] == 'SUPPORT':
+                        self.code_sections['support_code'] = split[n+1]
+                    n += 2
+                continue
+            split = self.section_re.split(code)
+            if split[0].strip() != '':
+                raise ValueError('Stray code before first #section '
+                                 'statement (in file %s): %s' %
+                                 (self.func_files[i], split[0]))
+            n = 1
+            while n < len(split):
+                if split[n] not in self.SECTIONS:
+                    raise ValueError("Unknown section type (in file %s): %s" %
+                                     (self.fun_files[i], split[n]))
+                if split[n] not in self.code_sections:
+                    self.code_sections[split[n]] = ""
+                self.code_sections[split[n]] += split[n+1]
+                n += 2

    def c_code_cache_version(self):
-        return hash(self.func_code)
-
-    def c_support_code(self):
-
-        if self.support_code_marker in self.code_sections:
-            return self.code_sections[self.support_code_marker]
-        else:
-            raise utils.MethodNotDefined("c_support_code",
-                type(self), self.__class__.__name__)
-
-    def c_support_code_apply(self, node, name):
-
-        if self.apply_code_marker in self.code_sections:
-            apply_code = self.code_sections[self.apply_code_marker]
-
-            if hasattr(self, 'check_inputs') and self.check_inputs == False:
-                return apply_code
-            else:
-                define_macros, undef_macros = self.get_c_macros(node, name)
-                return os.linesep.join([define_macros, apply_code,
-                                        undef_macros])
-
-        else:
-            raise utils.MethodNotDefined("c_support_code_apply",
-                type(self), self.__class__.__name__)
+        return hash(tuple(self.func_codes))

+    c_init_code = simple_meth('init_code')
+    c_init_code_apply = apply_meth('init_code_apply')
+    c_support_code = simple_meth('support_code')
+    c_support_code_apply = apply_meth('support_code_apply')
+    c_support_code_struct = apply_meth('support_code_struct')
+    c_cleanup_code_struct = apply_meth('cleanup_code_struct')

    def format_c_function_args(self, inp, out):
        # Generate an string containing the arguments sent to the external C
@@ -1100,73 +1107,149 @@ class COp(Op):
        # "input0, input1, input2, &output0, &output1"
        return ", ".join(list(inp) + ["&%s" % o for o in out])

-    def get_c_macros(self, node, name):
+    def get_c_macros(self, node, name, check_input=None):
+        define_template = "#define %s %s"
+        undef_template = "#undef %s"
+        define_macros = []
+        undef_macros = []
+
+        if check_input is None:
+            check_input = getattr(self, 'check_input', True)
+
+        if check_input:
+            # Extract the various properties of the input and output variables
+            variables = node.inputs + node.outputs
+            variable_names = (["INPUT_%i" % i for i in range(len(node.inputs))] +
+                              ["OUTPUT_%i" % i for i in range(len(node.inputs))])

-        define_template = "#define %s %s" + os.linesep
-        undef_template = "#undef %s" + os.linesep
-        define_macros = ""
-        undef_macros = ""
+            # Generate dtype macros
+            for i, v in enumerate(variables):
+                if not hasattr(v, 'dtype'):
+                    continue
+                vname = variable_names[i]

-        # Extract the various properties of the input and output variables
-        variables = node.inputs + node.outputs
-        variable_names = (["INPUT_%i" % i for i in range(len(node.inputs))] +
-                          ["OUTPUT_%i" % i for i in range(len(node.inputs))])
-        variable_dtypes_names = [v.dtype for v in variables]
-        variable_dtypes = [numpy.dtype(d) for d in variable_dtypes_names]
-        variable_typenums = [d.num for d in variable_dtypes]
-        variable_itemsizes = [d.itemsize for d in variable_dtypes]
+                macro_name = "DTYPE_" + vname
+                macro_value = "npy_" + v.dtype

-        # Generate dtype macros
-        for i in range(len(variables)):
-            macro_name = "DTYPE_" + variable_names[i]
-            macro_value = "npy_" + variable_dtypes_names[i]
+                define_macros.append(define_template % (macro_name, macro_value))
+                undef_macros.append(undef_template % macro_name)

-            define_macros += define_template % (macro_name, macro_value)
-            undef_macros += undef_template % macro_name
+                d = numpy.dtype(v.dtype)

-        # Generate typenum macros
-        for i in range(len(variables)):
-            macro_name = "TYPENUM_" + variable_names[i]
-            macro_value = variable_typenums[i]
+                macro_name = "TYPENUM_" + vname
+                macro_value = d.num

-            define_macros += define_template % (macro_name, macro_value)
-            undef_macros += undef_template % macro_name
+                define_macros.append(define_template % (macro_name, macro_value))
+                undef_macros.append(undef_template % macro_name)

-        # Generate itemsize macros
-        for i in range(len(variables)):
-            macro_name = "ITEMSIZE_" + variable_names[i]
-            macro_value = variable_itemsizes[i]
+                macro_name = "ITEMSIZE_" + vname
+                macro_value = d.itemsize

-            define_macros += define_template % (macro_name, macro_value)
-            undef_macros += undef_template % macro_name
+                define_macros.append(define_template % (macro_name, macro_value))
+                undef_macros.append(undef_template % macro_name)

        # Generate a macro to mark code as being apply-specific
-        define_macros += define_template % ("APPLY_SPECIFIC(str)",
-                                            "str##_%s" % name)
-        undef_macros += undef_template % "APPLY_SPECIFIC"
+        define_macros.append(define_template % ("APPLY_SPECIFIC(str)",
+                                                "str##_%s" % name))
+        undef_macros.append(undef_template % "APPLY_SPECIFIC")
+
+        return os.linesep.join(define_macros), os.linesep.join(undef_macros)
+
+    def _lquote_macro(self, txt):
+        res = []
+        spl = txt.split('\n')
+        for l in spl[:-1]:
+            res.append(l + ' \\')
+        res.append(spl[-1])
+        return os.linesep.join(res)
+
+    def get_sub_macros(self, sub):
+        define_macros = []
+        undef_macros = []
+        define_macros.append("#define FAIL %s" %
+                            (self._lquote_macro(sub['fail']),))
+        undef_macros.append("#undef FAIL")
+        if 'context' in sub:
+            define_macros.append("#define CONTEXT %s" % (sub['context'],))
+            undef_macos.append("#undef CONTEXT")
+
+        return os.linesep.join(define_macros), os.linesep.join(undef_macros)
+
+    def get_io_macros(self, inputs, outputs):
+        define_macros = []
+        undef_macros = []
+
+        for i, inp in enumerate(inputs):
+            define_macros.append("#define INPUT_%d %s" (i, inp))
+            undef_macros.append("#undef INPUT_%d", (i,))
+
+        for i, out in enumerate(outputs):
+            define_macros.append("#define OUTPUT_%d %s" (i, inp))
+            undef_macros.append("#undef OUTPUT_%d", (i,))
+
+    def c_init_code_struct(self, node, name, sub):
+        if 'init_code_struct' in self.code_sections:
+            op_code = self.code_sections['init_code_struct']
+
+            def_macros, undef_macros = self.get_c_macros(node, name)
+            def_sub, undef_sub = self.get_sub_macros(sub)
+
+            return os.linesep.join([def_macros, def_sub,
+                                    op_code,
+                                    undef_sub, undef_macros])
+        else:
+            raise utils.MethodNotDefined(
+                'c_init_code_struct', type(self), type(self).__name__)

-        return define_macros, undef_macros

    def c_code(self, node, name, inp, out, sub):
+        if self.func_name is not None:
+            assert 'code' not in self.code_sections
+            func_name = self.func_name
+            func_args = self.format_c_function_args(inp, out)
+            fail = sub['fail']
+
+            define_macros, undef_macros = self.get_c_macros(node, name,
+                                                            check_input=False)
+
+            # Generate the C code
+            return """
+%(define_macros)s
+{
+  if (%(func_name)s(%(func_args)s) != 0) {
+    %(fail)s
+  }
+}
+%(undef_macros)s
+""" % dict(func_name=self.func_name, fail=sub['fail'],
+           func_args=self.format_c_function_args(inp, out),
+           define_macros=define_macros, undef_macros=undef_macros)
+        else:
+            if 'code' in self.code_sections:
+                op_code = self.code_sections['code']

-        func_name = self.func_name
-        func_args = self.format_c_function_args(inp, out)
-        fail = sub['fail']
-
-        # Generate the code to define/undefine the C macros
-        define_macros, undef_macros = self.get_c_macros(node, name)
-
-        # Generate the C code
-        c_code = """
-        %(define_macros)s
-        {
-            int result = %(func_name)s(%(func_args)s);
-            if (result != 0)
-            {
-                %(fail)s;
-            }
-        }
-        %(undef_macros)s
-        """ % locals()
+                def_macros, undef_macros = self.get_c_macros(node, name)
+                def_sub, undef_sub = self.get_sub_macros(sub)
+                def_io, undef_io = self.get_io_macros(inp, out)

-        return c_code
+                return os.linesep.join([def_macros, def_sub, def_io,
+                                        op_code,
+                                        undef_io, undef_sub, undef_macros])
+            else:
+                raise utils.MethodNotDefined(
+                    'c_code', type(self), type(self).__name__)
+
+    def c_code_cleanup(self, node, name, inputs, outputs, sub):
+        if 'code_cleanup' in self.code_sections:
+            op_code = self.code_sections['code_cleanup']
+
+            def_macros, undef_macros = self.get_c_macros(node, name)
+            def_sub, undef_sub = self.get_sub_macros(sub)
+            def_io, undef_io = self.get_io_macros(inp, out)
+
+            return os.linesep.join([def_macros, def_sub, def_io,
+                                    op_code,
+                                    undef_io, undef_sub, undef_macros])
+        else:
+            raise utils.MethodNotDefined(
+                'c_code_cleanup', type(self), type(self).__name__)
--- a/theano/sandbox/cuda/cudnn_helper.h
+++ b/theano/sandbox/cuda/cudnn_helper.h
@@ -3,6 +3,12 @@

 #include <cudnn.h>

+#ifndef CUDNN_VERSION
+#include <assert.h>
+
+// Here we define the R2 API in terms of functions in the R1 interface
+// This is only for what we use
+
 static inline const char *cudnnGetErrorString(cudnnStatus_t err) {
  switch (err) {
  case CUDNN_STATUS_SUCCESS:
@@ -28,4 +34,118 @@ static inline const char *cudnnGetErrorString(cudnnStatus_t err) {
  }
 }

+// some macros to help support cudnn R1 while using R2 code.
+#define cudnnCreateTensorDescriptor cudnnCreateTensor4dDescriptor
+#define cudnnDestroyTensorDescriptor cudnnDestroyTensor4dDescriptor
+#define cudnnSetFilter4dDescriptor cudnnSetFilterDescriptor
+
+typedef cudnnTensor4dDescriptor_t cudnnTensorDescriptor_t;
+
+static inline cudnnStatus_t
+cudnnGetConvolution2dForwardOutputDim(
+  const cudnnConvolutionDescriptor_t convDesc,
+  const cudnnTensorDescriptor_t inputTensorDesc,
+  const cudnnFilterDescriptor_t filterDesc,
+  int *n,
+  int *c,
+  int *h,
+  int *w) {
+  return cudnnGetOutputTensor4dDim(convDesc, CUDNN_CONVOLUTION_FWD,
+				   n, c, h, w);
+}
+
+typedef int cudnnConvolutionFwdAlgo_t;
+typedef int cudnnConvolutionFwdPreference_t;
+
+#define CUDNN_CONVOLUTION_FWD_NO_WORKSPACE 0
+
+static inline cudnnStatus_t
+cudnnGetConvolutionForwardAlgorithm(
+  cudnnHandle_t handle,
+  const cudnnTensorDescriptor_t srcDesc,
+  const cudnnFilterDescriptor_t filterDesc,
+  const cudnnConvolutionDescriptor_t convDesc,
+  const cudnnTensorDescriptor_t destDesc,
+  cudnnConvolutionFwdPreference_t preference,
+  size_t memoryLimitInbytes,
+  cudnnConvolutionFwdAlgo_t *algo) {
+  *algo = 0;
+  return CUDNN_STATUS_SUCCESS;
+}
+
+static inline cudnnStatus_t
+cudnnConvolutionForward_v2(
+  cudnnHandle_t handle,
+  const void *alpha,
+  const cudnnTensorDescriptor_t srcDesc,
+  const void *srcData,
+  const cudnnFilterDescriptor_t filterDesc,
+  const void *filterData,
+  const cudnnConvolutionDescriptor_t convDesc,
+  cudnnConvolutionFwdAlgo_t algo,
+  void *workSpace,
+  size_t workSpaceSizeInBytes,
+  const void *beta,
+  const cudnnTensorDescriptor_t destDesc,
+  void *destData) {
+  assert(*(float *)alpha == 1.0);
+  assert(*(float *)beta == 0.0);
+  return cudnnConvolutionForward(handle, srcDesc, srcData,
+				 filterDesc, filterData,
+				 convDesc, destDesc, destData,
+				 CUDNN_RESULT_NO_ACCUMULATE);
+}
+#define cudnnConvolutionForward cudnnConvolutionForward_v2
+
+static inline cudnnStatus_t
+cudnnConvolutionBackwardFilter_v2(
+  cudnnHandle_t	handle,
+  const void *alpha,
+  const cudnnTensorDescriptor_t srcDesc,
+  const void *srcData,
+  const cudnnTensorDescriptor_t diffDesc,
+  const void *diffData,
+  const cudnnConvolutionDescriptor_t convDesc,
+  const void *beta,
+  const cudnnFilterDescriptor_t gradDesc,
+  void *gradData) {
+  assert(*(float *)alpha == 1.0);
+  assert(*(float *)beta == 0.0);
+  return cudnnConvolutionBackwardFilter(handle, srcDesc, srcData,
+					diffDesc, diffData,
+					convDesc, gradDesc, gradData,
+					CUDNN_RESULT_NO_ACCUMULATE);
+}
+
+#define cudnnConvolutionBackwardFilter cudnnConvolutionBackwardFilter_v2
+
+static inline cudnnStatus_t
+cudnnConvolutionBackwardData_v2(
+  cudnnHandle_t	handle,
+  const void *alpha,
+  const cudnnFilterDescriptor_t filterDesc,
+  const void *filterData,
+  const cudnnTensorDescriptor_t diffDesc,
+  const void *diffData,
+  const cudnnConvolutionDescriptor_t convDesc,
+  const void *beta,
+  const cudnnTensorDescriptor_t gradDesc,
+  void *gradData) {
+  assert(*(float *)alpha == 1.0);
+  assert(*(float *)beta == 0.0);
+  return cudnnConvolutionBackwardData(handle,
+				      (cudnnFilterDescriptor_t)filterDesc,
+				      filterData,
+				      (cudnnTensorDescriptor_t)diffDesc,
+				      diffData,
+				      (cudnnConvolutionDescriptor_t)convDesc,
+				      (cudnnTensorDescriptor_t)gradDesc,
+				      gradData,
+				      CUDNN_RESULT_NO_ACCUMULATE);
+}
+
+#define cudnnConvolutionBackwardData cudnnConvolutionBackwardData_v2
+
+#endif
+
 #endif
--- a/theano/sandbox/cuda/dnn.py
+++ b/theano/sandbox/cuda/dnn.py
@@ -2,8 +2,10 @@ import os

 import theano
 from theano import Apply, gof, tensor
-from theano.gof import Optimizer, local_optimizer
-from theano.gof.type import CDataType
+from theano.scalar import as_scalar
+from theano.gradient import DisconnectedType
+from theano.gof import Optimizer, local_optimizer, COp
+from theano.gof.type import CDataType, Generic
 from theano.compat import PY3
 from theano.tensor.nnet import SoftmaxGrad
 from theano.sandbox.cuda.type import CudaNdarrayType
@@ -50,12 +52,18 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
                try_run=True, output=True)

            dnn_available.avail = comp and run
-            if dnn_available.avail:
-                dnn_available.msg = "cuDNN should work"
-            else:
+            if not dnn_available.avail:
                dnn_available.msg = (
                    "Theano is not able to use cuDNN. We got this error: \n" +
                    str(err))
+            else:
+                v = version()
+                if isinstance(v, tuple) and v[0] != v[1]:
+                    dnn_available.avail = False
+                    dnn_available.msg = ("Mixed dnn version. The header is"
+                                         " from one version, but we link with"
+                                         " a different version %s" % str(v))
+                    raise RuntimeError(dnn_available.msg)
    return dnn_available.avail


@@ -77,14 +85,25 @@ def c_set_tensor4d(var, desc, err, fail):
    CudaNdarray_HOST_STRIDES(%(var)s)[3]?CudaNdarray_HOST_STRIDES(%(var)s)[3]:1
 );
 if (%(err)s != CUDNN_STATUS_SUCCESS) {
-    PyErr_Format(PyExc_RuntimeError, "could not set tensor4d descriptor: %%s",
-    cudnnGetErrorString(%(err)s));
+    PyErr_Format(PyExc_RuntimeError,
+    "could not set tensor4d descriptor: %%s"
+    "shapes=%%d %%d %%d %%d strides=%%d %%d %%d %%d",
+    cudnnGetErrorString(%(err)s),
+    CudaNdarray_HOST_DIMS(%(var)s)[0],
+    CudaNdarray_HOST_DIMS(%(var)s)[1],
+    CudaNdarray_HOST_DIMS(%(var)s)[2],
+    CudaNdarray_HOST_DIMS(%(var)s)[3],
+    CudaNdarray_HOST_STRIDES(%(var)s)[0]?CudaNdarray_HOST_STRIDES(%(var)s)[0]:CudaNdarray_HOST_DIMS(%(var)s)[2]*CudaNdarray_HOST_DIMS(%(var)s)[3]*CudaNdarray_HOST_DIMS(%(var)s)[1],
+    CudaNdarray_HOST_STRIDES(%(var)s)[1]?CudaNdarray_HOST_STRIDES(%(var)s)[1]:CudaNdarray_HOST_DIMS(%(var)s)[2]*CudaNdarray_HOST_DIMS(%(var)s)[3],
+    CudaNdarray_HOST_STRIDES(%(var)s)[2]?CudaNdarray_HOST_STRIDES(%(var)s)[2]:CudaNdarray_HOST_DIMS(%(var)s)[3],
+    CudaNdarray_HOST_STRIDES(%(var)s)[3]?CudaNdarray_HOST_STRIDES(%(var)s)[3]:1
+    );
    %(fail)s
 }
        """ % dict(var=var, err=err, desc=desc, fail=fail)


-class DnnBase(GpuOp):
+class DnnBase(GpuOp, COp):
    """
    Creates a handle for cudnn and pulls in the cudnn libraries and headers.
    """
@@ -92,6 +111,9 @@ class DnnBase(GpuOp):
    # the input broadcasting pattern.
    check_broadcast = False

+    def __init__(self):
+        COp.__init__(self, "dnn_base.c")
+
    def c_headers(self):
        return ['cudnn.h', 'cudnn_helper.h']

@@ -101,11 +123,6 @@ class DnnBase(GpuOp):
    def c_libraries(self):
        return ['cudnn']

-    def c_support_code(self):
-        return """
-cudnnHandle_t _handle = NULL;
-"""
-
    def c_init_code(self):
        if PY3:
            error_out = "NULL"
@@ -121,6 +138,52 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
 }""" % (error_out,)]


+class DnnVersion(GpuOp):
+    def c_compiler(self):
+        return NVCC_compiler
+
+    def c_headers(self):
+        return ['cudnn.h']
+
+    def c_libraries(self):
+        return ['cudnn']
+
+    def make_node(self):
+        return Apply(self, [], [Generic()()])
+
+    def c_code(self, node, name, inputs, outputs, sub):
+        o = outputs[0]
+        return """
+        #if defined(CUDNN_VERSION)
+        %(o)s = PyTuple_Pack(2, PyInt_FromLong(CUDNN_VERSION), PyInt_FromLong(cudnnGetVersion()));
+        #else
+        %(o)s = PyInt_FromLong(-1);
+        #endif
+        """ % locals()
+
+    def do_constant_folding(self, node):
+        # Needed as we do not want to cache this information.
+        return False
+
+    def c_code_cache_version(self):
+        # Not needed, but make it clear that we do not want to cache this.
+        return None
+
+
+def version():
+    """
+    return the current cuDNN version we compile with.
+
+    This only check the header version, the the library we link with.
+    """
+    if version.v is None:
+        f = theano.function([], DnnVersion()(),
+                            theano.Mode(optimizer=None))
+        version.v = f()
+    return version.v
+version.v = None
+
+
 class GpuDnnConvDesc(GpuOp):
    """This Op builds a convolution descriptor for use in the other
    convolution operations.
@@ -216,6 +279,15 @@ class GpuDnnConvDesc(GpuOp):
    PyErr_SetString(PyExc_ValueError, "bad border mode");
    %(fail)s
  }
+#if defined(CUDNN_VERSION) && CUDNN_VERSION >= 20
+  err = cudnnSetConvolution2dDescriptor(
+  %(desc)s,
+  pad_h%(name)s,
+  pad_w%(name)s,
+  %(subsx)d, %(subsy)d, 1, 1,
+  %(conv_flag)s
+  );
+#else
  err = cudnnSetConvolutionDescriptorEx(
  %(desc)s,
  *(npy_int64 *)PyArray_GETPTR1(%(img_shape)s, 0),
@@ -230,7 +302,7 @@ class GpuDnnConvDesc(GpuOp):
  %(subsx)d, %(subsy)d, 1, 1,
  %(conv_flag)s
  );
-
+#endif
  if (err != CUDNN_STATUS_SUCCESS) {
    PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s",
                 cudnnGetErrorString(err));
@@ -243,153 +315,10 @@ class GpuDnnConvDesc(GpuOp):
           pad_h_spec=pad_h_spec, pad_w_spec=pad_w_spec)

    def c_code_cache_version(self):
-        return (2,)
-
-
-class GpuDnnConvBase(DnnBase):
-    __props__ = ()
-
-    def c_support_code_struct(self, node, name):
-        return """
-cudnnTensor4dDescriptor_t input%(name)s;
-cudnnTensor4dDescriptor_t output%(name)s;
-cudnnFilterDescriptor_t kerns%(name)s;
-""" % dict(name=name)
-
-    def c_init_code_struct(self, node, name, sub):
-        return """
-cudnnStatus_t err%(name)s;
-input%(name)s = NULL;
-output%(name)s = NULL;
-kerns%(name)s = NULL;
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
-               "(inp): %%s", cudnnGetErrorString(err%(name)s));
-  %(fail)s
-}
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
-               "(out): %%s", cudnnGetErrorString(err%(name)s));
-  %(fail)s
-}
-if ((err%(name)s = cudnnCreateFilterDescriptor(&kerns%(name)s)) != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_MemoryError, "could not allocate filter descriptor: %%s",
-               cudnnGetErrorString(err%(name)s));
-  %(fail)s
-}
-""" % dict(name=name, fail=sub['fail'])
-
-    def c_cleanup_code_struct(self, node, name):
-        return """
-if (input%(name)s != NULL) {cudnnDestroyTensor4dDescriptor(input%(name)s);}
-if (output%(name)s != NULL) {cudnnDestroyTensor4dDescriptor(output%(name)s);}
-if (kerns%(name)s != NULL) {cudnnDestroyFilterDescriptor(kerns%(name)s);}
-""" % dict(name=name)
-
-    def c_set_filter(self, var, desc, err, fail):
-        return """
-%(err)s = cudnnSetFilterDescriptor(
-%(desc)s, CUDNN_DATA_FLOAT,
-CudaNdarray_HOST_DIMS(%(var)s)[0],
-CudaNdarray_HOST_DIMS(%(var)s)[1],
-CudaNdarray_HOST_DIMS(%(var)s)[2],
-CudaNdarray_HOST_DIMS(%(var)s)[3]
-);
-if (%(err)s != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_RuntimeError, "could not set filter descriptor: %%s",
-               cudnnGetErrorString(%(err)s));
-  %(fail)s
-}
-""" % dict(var=var, desc=desc, err=err, fail=fail)
-
-    def c_set_tensor4d(self, *arg):
-        return c_set_tensor4d(*arg)
-
-    def c_code(self, node, name, inputs, outputs, sub):
-        desc = inputs[2]
-        out, = outputs
-
-        checks = []
-        for v in inputs[:2]:
-            checks.append("""
-if (!CudaNdarray_is_c_contiguous(%s)) {
-  PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported.");
-  %s
-}
-""" % (v, sub['fail']))
-
-        sets = []
-        for p, v, d in zip(inputs[:2], self.conv_inputs, self.conv_types[:2]):
-            sets.append(getattr(self, 'c_set_'+d)(p, v + name,
-                                                  'err' + name, sub['fail']))
-
-        set_out = getattr(self, 'c_set_' + self.conv_types[2])(
-            out, self.conv_output + name, 'err' + name,
-            sub['fail'])
-
-        return """
-cudnnStatus_t err%(name)s;
-
-%(checks)s
-
-%(sets)s
-
-{
-  int out_dims[4];
-  err%(name)s = cudnnGetOutputTensor4dDim(
-  %(desc)s, %(path)s,
-  &out_dims[0], &out_dims[1],
-  &out_dims[2], &out_dims[3]
-  );
-  if (err%(name)s != CUDNN_STATUS_SUCCESS) {
-    PyErr_Format(PyExc_RuntimeError, "could not get output sizes: %%s",
-                 cudnnGetErrorString(err%(name)s));
-    %(fail)s
-  }
-  // workaround for cudnn R1 bug
-  if (%(path)s == CUDNN_CONVOLUTION_WEIGHT_GRAD &&
-      (out_dims[0] != CudaNdarray_HOST_DIMS(%(input2)s)[1] ||
-       out_dims[1] != CudaNdarray_HOST_DIMS(%(input1)s)[1])) {
-    out_dims[0] = CudaNdarray_HOST_DIMS(%(input2)s)[1];
-    out_dims[1] = CudaNdarray_HOST_DIMS(%(input1)s)[1];
-    // This is a horrible hack that is unfortulately necessary
-    int *dd = (int *)%(desc)s;
-    out_dims[2] = dd[5];
-    out_dims[3] = dd[6];
-  }
-  if (CudaNdarray_prep_output(&%(out)s, 4, out_dims) != 0) {
-    %(fail)s
-  }
-}
-
-%(set_out)s
-
-err%(name)s = %(method)s(
-_handle,
-%(input1_desc)s, CudaNdarray_DEV_DATA(%(input1)s),
-%(input2_desc)s, CudaNdarray_DEV_DATA(%(input2)s),
-%(desc)s,
-%(output_desc)s, CudaNdarray_DEV_DATA(%(out)s),
-CUDNN_RESULT_NO_ACCUMULATE
-);
-if (err%(name)s != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_RuntimeError, "error doing operation: %%s",
-               cudnnGetErrorString(err%(name)s));
-  %(fail)s
-}
-""" % dict(out=out, desc=desc, fail=sub['fail'],
-           name=name, checks='\n'.join(checks), sets='\n'.join(sets),
-           set_out=set_out, input1=inputs[0], input2=inputs[1],
-           input1_desc=self.conv_inputs[0]+name,
-           input2_desc=self.conv_inputs[1]+name,
-           output_desc=self.conv_output+name,
-           method=self.conv_op, path=self.path_flag)
-
-    def c_code_cache_version(self):
-        return (8,)
+        return (2, version())


-class GpuDnnConv(GpuDnnConvBase):
+class GpuDnnConv(DnnBase, COp):
    """
    The forward convolution.

@@ -398,11 +327,11 @@ class GpuDnnConv(GpuDnnConvBase):
    :param descr: the convolution descriptor

    """
-    conv_inputs = 'input', 'kerns'
-    conv_output = 'output'
-    conv_types = 'tensor4d', 'filter', 'tensor4d'
-    conv_op = 'cudnnConvolutionForward'
-    path_flag = 'CUDNN_CONVOLUTION_FWD'
+    __props__ = ()
+
+    def __init__(self):
+        COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_fwd.c"],
+                     "APPLY_SPECIFIC(conv_fwd)")

    def make_node(self, img, kern, desc):
        img = as_cuda_ndarray_variable(img)
@@ -428,8 +357,10 @@ class GpuDnnConv(GpuDnnConvBase):

        top = gpu_contiguous(top)

-        d_img = GpuDnnConvGradI()(kerns, top, desc)
-        d_kerns = GpuDnnConvGradW()(img, top, desc)
+        d_img = GpuDnnConvGradI()(kerns, top, desc,
+                                  img.shape[2], img.shape[3])
+        d_kerns = GpuDnnConvGradW()(img, top, desc,
+                                    kerns.shape[2], kerns.shape[3])

        return d_img, d_kerns, theano.gradient.DisconnectedType()()

@@ -438,7 +369,7 @@ class GpuDnnConv(GpuDnnConvBase):
        return [[1], [1], [0]]


-class GpuDnnConvGradW(GpuDnnConvBase):
+class GpuDnnConvGradW(DnnBase, COp):
    """
    The convolution gradient with respect to the weights.

@@ -447,29 +378,30 @@ class GpuDnnConvGradW(GpuDnnConvBase):
    :param descr: the convolution descriptor

    """
+    __props__ = ()

-    conv_inputs = 'input', 'output',
-    conv_output = 'kerns'
-    conv_types = 'tensor4d', 'tensor4d', 'filter'
-    path_flag = 'CUDNN_CONVOLUTION_WEIGHT_GRAD'
-    conv_op = 'cudnnConvolutionBackwardFilter'
+    def __init__(self):
+        COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_gw.c"],
+                     "APPLY_SPECIFIC(conv_gw)")

    def grad(self, inp, grads):
-        img, top, desc = inp
+        img, top, desc, h, w = inp
        kerns, = grads

        kerns = gpu_contiguous(kerns)

-        d_img = GpuDnnConvGradI()(kerns, top, desc)
+        d_img = GpuDnnConvGradI()(kerns, top, desc,
+                                  img.shape[2], img.shape[3])
        d_top = GpuDnnConv()(img, kerns, desc)

-        return d_img, d_top, theano.gradient.DisconnectedType()()
+        return (d_img, d_top, DisconnectedType()(), DisconnectedType()(),
+                DisconnectedType()())

    def connection_pattern(self, node):
-        # not connected to desc
-        return [[1], [1], [0]]
+        # not connected to desc, h, w
+        return [[1], [1], [0], [0], [0]]

-    def make_node(self, img, topgrad, desc):
+    def make_node(self, img, topgrad, desc, h, w):
        img = as_cuda_ndarray_variable(img)
        topgrad = as_cuda_ndarray_variable(topgrad)
        if img.type.ndim != 4:
@@ -481,14 +413,18 @@ class GpuDnnConvGradW(GpuDnnConvBase):
                or desc.type.ctype != 'cudnnConvolutionDescriptor_t':
            raise TypeError('desc must be cudnnConvolutionDescriptor_t')

+        h = as_scalar(h)
+        w = as_scalar(w)
+
        broadcastable = [topgrad.type.broadcastable[1],
                         img.type.broadcastable[1],
                         False, False]
-        return Apply(self, [img, topgrad, desc],
+
+        return Apply(self, [img, topgrad, desc, h, w],
                     [CudaNdarrayType(broadcastable)()])


-class GpuDnnConvGradI(GpuDnnConvBase):
+class GpuDnnConvGradI(DnnBase, COp):
    """
    The convolution gradient with respect to the inputs.

@@ -497,29 +433,29 @@ class GpuDnnConvGradI(GpuDnnConvBase):
    :param descr: the convolution descriptor

    """
+    __props__ = ()

-    conv_inputs = 'kerns', 'output',
-    conv_output = 'input'
-    conv_types = 'filter', 'tensor4d', 'tensor4d'
-    path_flag = 'CUDNN_CONVOLUTION_DATA_GRAD'
-    conv_op = 'cudnnConvolutionBackwardData'
+    def __init__(self):
+        COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_gi.c"],
+                     "APPLY_SPECIFIC(conv_gi)")

    def grad(self, inp, grads):
-        kerns, top, desc = inp
+        kerns, top, desc, h, w = inp
        img, = grads

        img = gpu_contiguous(img)

-        d_kerns = GpuDnnConvGradW()(img, top, desc)
+        d_kerns = GpuDnnConvGradW()(img, top, desc,
+                                    kerns.shape[2], kerns.shape[3])
        d_top = GpuDnnConv()(img, kerns, desc)
-
-        return d_kerns, d_top, theano.gradient.DisconnectedType()()
+        return (d_kerns, d_top, DisconnectedType()(), DisconnectedType()(),
+                DisconnectedType()())

    def connection_pattern(self, node):
-        # not connected to desc
-        return [[1], [1], [0]]
+        # not connected to desc, h, w
+        return [[1], [1], [0], [0], [0]]

-    def make_node(self, kern, topgrad, desc):
+    def make_node(self, kern, topgrad, desc, h, w):
        kern = as_cuda_ndarray_variable(kern)
        topgrad = as_cuda_ndarray_variable(topgrad)
        if kern.type.ndim != 4:
@@ -531,10 +467,14 @@ class GpuDnnConvGradI(GpuDnnConvBase):
                or desc.type.ctype != 'cudnnConvolutionDescriptor_t':
            raise TypeError('desc must be cudnnConvolutionDescriptor_t')

+        h = as_scalar(h)
+        w = as_scalar(w)
+
        broadcastable = [topgrad.type.broadcastable[0],
                         kern.type.broadcastable[1],
                         False, False]
-        return Apply(self, [kern, topgrad, desc],
+
+        return Apply(self, [kern, topgrad, desc, h, w],
                     [CudaNdarrayType(broadcastable)()])


@@ -581,7 +521,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1),
                                    img.shape[3] - kerns.shape[3] + 1)
        desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1),
                              conv_mode='cross')(img.shape, shape)
-        conv = GpuDnnConvGradW()(img, kerns, desc)
+        conv = GpuDnnConvGradW()(img, kerns, desc, shape[2], shape[3])
        return as_cuda_ndarray_variable(conv.dimshuffle(1, 0, 2, 3))

    elif (border_mode == 'full' and subsample == (1, 1) and
@@ -597,7 +537,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1),
                                    img.shape[3] + kerns.shape[3] - 1)
        desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1),
                              conv_mode=conv_mode)(shape, kerns.shape)
-        return GpuDnnConvGradI()(kerns, img, desc)
+        return GpuDnnConvGradI()(kerns, img, desc, shape[2], shape[3])

    # Standard case: We use GpuDnnConv with suitable padding.
    img = gpu_contiguous(img)
@@ -664,14 +604,22 @@ class GpuDnnPoolDesc(GpuOp):
                 "descriptor: %%s", cudnnGetErrorString(err));
    %(fail)s
  }
-
+#ifndef CUDNN_VERSION
  err = cudnnSetPoolingDescriptor(
  %(desc)s,
  %(mode_flag)s,
  %(wsX)d, %(wsY)d,
  %(stridex)d, %(stridey)d
  );
-
+#else
+  err = cudnnSetPooling2dDescriptor(
+  %(desc)s,
+  %(mode_flag)s,
+  %(wsX)d, %(wsY)d,
+  0, 0,
+  %(stridex)d, %(stridey)d
+  );
+#endif
  if (err != CUDNN_STATUS_SUCCESS) {
    PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s",
                 cudnnGetErrorString(err));
@@ -683,7 +631,7 @@ class GpuDnnPoolDesc(GpuOp):
           stridey=self.stride[1])

    def c_code_cache_version(self):
-        return (1,)
+        return (1, version())


 class GpuDnnPool(DnnBase):
@@ -709,8 +657,8 @@ class GpuDnnPool(DnnBase):

    def c_support_code_struct(self, node, name):
        return """
-cudnnTensor4dDescriptor_t input%(name)s;
-cudnnTensor4dDescriptor_t output%(name)s;
+cudnnTensorDescriptor_t input%(name)s;
+cudnnTensorDescriptor_t output%(name)s;
 """ % dict(name=name)

    def c_init_code_struct(self, node, name, sub):
@@ -718,12 +666,12 @@ cudnnTensor4dDescriptor_t output%(name)s;
 cudnnStatus_t err%(name)s;
 input%(name)s = NULL;
 output%(name)s = NULL;
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
               "(inp): %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
 }
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
               "(out): %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
@@ -732,8 +680,8 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STAT

    def c_cleanup_code_struct(self, node, name):
        return """
-if (input%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input%(name)s); }
-if (output%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output%(name)s); }
+if (input%(name)s != NULL) { cudnnDestroyTensorDescriptor(input%(name)s); }
+if (output%(name)s != NULL) { cudnnDestroyTensorDescriptor(output%(name)s); }
 """ % dict(name=name)

    def c_code(self, node, name, inputs, outputs, sub):
@@ -759,9 +707,19 @@ if (!CudaNdarray_is_c_contiguous(%(input)s)) {
 %(set_in)s

 cudnnPoolingMode_t mode;
-int wsX, wsY, strideX, strideY;
-
-err%(name)s = cudnnGetPoolingDescriptor(%(desc)s, &mode, &wsX, &wsY, &strideX, &strideY);
+int wsX, wsY, vpad, hpad, strideX, strideY;
+#ifndef CUDNN_VERSION
+err%(name)s = cudnnGetPoolingDescriptor(
+        %(desc)s, &mode,
+        &wsX, &wsY,
+        &strideX, &strideY);
+#else
+err%(name)s = cudnnGetPooling2dDescriptor(
+        %(desc)s, &mode,
+        &wsX, &wsY,
+        &vpad, &hpad,
+        &strideX, &strideY);
+#endif

 if (err%(name)s != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_RuntimeError,
@@ -781,13 +739,27 @@ if (CudaNdarray_prep_output(&%(out)s, 4, %(out)s_dims) != 0)
 }

 %(set_out)s
-
+#ifndef CUDNN_VERSION
+err%(name)s = cudnnPoolingForward(
+_handle,
+%(desc)s,
+%(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
+%(output_desc)s, CudaNdarray_DEV_DATA(%(out)s)
+);
+#else
+{
+const float alpha = 1;
+const float beta = 0;
 err%(name)s = cudnnPoolingForward(
 _handle,
 %(desc)s,
+&alpha,
 %(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
+&beta,
 %(output_desc)s, CudaNdarray_DEV_DATA(%(out)s)
 );
+}
+#endif
 if (err%(name)s != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_RuntimeError,
               "GpuDnnPool: error doing cudnnPoolingForward operation: %%s",
@@ -817,7 +789,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) {
        return [[1], [0]]

    def c_code_cache_version(self):
-        return (4,)
+        return (4, version())


 class GpuDnnPoolGrad(DnnBase):
@@ -853,10 +825,10 @@ class GpuDnnPoolGrad(DnnBase):

    def c_support_code_struct(self, node, name):
        return """
-cudnnTensor4dDescriptor_t input%(name)s;
-cudnnTensor4dDescriptor_t input_grad%(name)s;
-cudnnTensor4dDescriptor_t output%(name)s;
-cudnnTensor4dDescriptor_t output_grad%(name)s;
+cudnnTensorDescriptor_t input%(name)s;
+cudnnTensorDescriptor_t input_grad%(name)s;
+cudnnTensorDescriptor_t output%(name)s;
+cudnnTensorDescriptor_t output_grad%(name)s;
 """ % dict(name=name)

    def c_init_code_struct(self, node, name, sub):
@@ -866,25 +838,25 @@ input%(name)s = NULL;
 input_grad%(name)s = NULL;
 output%(name)s = NULL;
 output_grad%(name)s = NULL;
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError,
               "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
               "(input): %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
 }
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&input_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError,
               "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
               "(input_grad): %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
 }
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError,
               "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
               "(output): %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
 }
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
+if ((err%(name)s = cudnnCreateTensorDescriptor(&output_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_MemoryError,
               "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
               "(output_grad): %%s", cudnnGetErrorString(err%(name)s));
@@ -894,10 +866,10 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output_grad%(name)s)) != CUDNN

    def c_cleanup_code_struct(self, node, name):
        return """
-if (input%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input%(name)s); }
-if (input_grad%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input_grad%(name)s); }
-if (output%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output%(name)s); }
-if (output_grad%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output_grad%(name)s); }
+if (input%(name)s != NULL) { cudnnDestroyTensorDescriptor(input%(name)s); }
+if (input_grad%(name)s != NULL) { cudnnDestroyTensorDescriptor(input_grad%(name)s); }
+if (output%(name)s != NULL) { cudnnDestroyTensorDescriptor(output%(name)s); }
+if (output_grad%(name)s != NULL) { cudnnDestroyTensorDescriptor(output_grad%(name)s); }
 """ % dict(name=name)

    def c_code(self, node, name, inputs, outputs, sub):
@@ -949,7 +921,7 @@ if (CudaNdarray_prep_output(&%(output_grad)s, 4,
 }

 %(set_out)s
-
+#ifndef CUDNN_VERSION
 err%(name)s = cudnnPoolingBackward(
 _handle,
 %(desc)s,
@@ -958,6 +930,22 @@ _handle,
 %(output_desc)s, CudaNdarray_DEV_DATA(%(output)s),
 %(output_grad_desc)s, CudaNdarray_DEV_DATA(%(output_grad)s)
 );
+#else
+{
+const float alpha = 1;
+const float beta = 0;
+err%(name)s = cudnnPoolingBackward(
+_handle,
+%(desc)s,
+&alpha,
+%(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
+%(input_grad_desc)s, CudaNdarray_DEV_DATA(%(input_grad)s),
+%(output_desc)s, CudaNdarray_DEV_DATA(%(output)s),
+&beta,
+%(output_grad_desc)s, CudaNdarray_DEV_DATA(%(output_grad)s)
+);
+}
+#endif
 if (err%(name)s != CUDNN_STATUS_SUCCESS) {
  PyErr_Format(PyExc_RuntimeError,
               "GpuDnnPoolGrad: error doing operation: %%s",
@@ -974,7 +962,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) {
           output_grad_desc="output_grad"+name)

    def c_code_cache_version(self):
-        return (4,)
+        return (4, version())


 def dnn_pool(img, ws, stride=(1, 1), mode='max'):
@@ -1015,6 +1003,7 @@ class GpuDnnSoftmaxBase(DnnBase):

    def __init__(self, tensor_format, algo, mode):
        assert(tensor_format in ('bc01', 'b01c'))
+        DnnBase.__init__(self)
        self.tensor_format = tensor_format

        assert(algo in ('fast', 'accurate'))
@@ -1029,14 +1018,14 @@ class GpuDnnSoftmaxBase(DnnBase):

    def _define_tensor4d_desc(self, name, id):
        return """
-cudnnTensor4dDescriptor_t %(id)s_%(name)s;
+cudnnTensorDescriptor_t %(id)s_%(name)s;
 """ % dict(name=name, id=id)

    def _init_tensor4d_desc(self, name, id, fail):
        return """
 %(id)s_%(name)s = NULL;
-if ((err%(name)s = cudnnCreateTensor4dDescriptor(&%(id)s_%(name)s)) != CUDNN_STATUS_SUCCESS) {
-  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
+if ((err%(name)s = cudnnCreateTensorDescriptor(&%(id)s_%(name)s)) != CUDNN_STATUS_SUCCESS) {
+  PyErr_Format(PyExc_MemoryError, "could not allocate tensor descriptor "
               ": %%s", cudnnGetErrorString(err%(name)s));
  %(fail)s
 }
@@ -1045,7 +1034,7 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&%(id)s_%(name)s)) != CUDNN_STA
    def _clean_tensor4d_desc(self, name, id):
        return """
 if(%(id)s_%(name)s!= NULL)
-  cudnnDestroyTensor4dDescriptor(%(id)s_%(name)s);
+  cudnnDestroyTensorDescriptor(%(id)s_%(name)s);
 """ % dict(name=name, id=id)

    def c_support_code_struct(self, node, name):
@@ -1102,8 +1091,7 @@ if (%(algo)d == 1)
 cudnnSoftmaxMode_t mode%(name)s = CUDNN_SOFTMAX_MODE_CHANNEL;
 if (%(mode)d == 1)
  mode%(name)s = CUDNN_SOFTMAX_MODE_INSTANCE;
-""" % dict(name=name,
-           tensor_format=tensor_format, mode=mode, algo=algo)
+""" % dict(name=name, tensor_format=tensor_format, mode=mode, algo=algo)

        # Validate the input and build the input variables.
        for input_idx, input_name in enumerate(self.softmax_inputs):
@@ -1134,7 +1122,7 @@ if (CudaNdarray_prep_output(&%(outs)s, 4, CudaNdarray_HOST_DIMS(%(ins)s)) != 0)
        return result

    def c_code_cache_version(self):
-        return (0, 6)
+        return (0, 6, version())

    def method(self):
        raise NotImplementedError('GpuDnnSoftmaxBase::method')
@@ -1150,15 +1138,33 @@ class GpuDnnSoftmax(GpuDnnSoftmaxBase):

    def method(self):
        return """
+#ifndef CUDNN_VERSION
+err%(name)s = cudnnSoftmaxForward(
+  _handle,
+  algo%(name)s,
+  mode%(name)s,
+  softmax_input_%(name)s,
+  CudaNdarray_DEV_DATA(%(ins)s),
+  softmax_output_%(name)s,
+  CudaNdarray_DEV_DATA(%(outs)s)
+);
+#else
+{
+const float alpha = 1.;
+const float beta = 0.;
 err%(name)s = cudnnSoftmaxForward(
  _handle,
  algo%(name)s,
  mode%(name)s,
+  (void*) &alpha,
  softmax_input_%(name)s,
  CudaNdarray_DEV_DATA(%(ins)s),
+  (void*) &beta,
  softmax_output_%(name)s,
  CudaNdarray_DEV_DATA(%(outs)s)
 );
+}
+#endif
 """

    def grad(self, inp, grads):
@@ -1184,6 +1190,7 @@ class GpuDnnSoftmaxGrad(GpuDnnSoftmaxBase):

    def method(self):
        return """
+#ifndef CUDNN_VERSION
 err%(name)s = cudnnSoftmaxBackward(
  _handle,
  algo%(name)s,
@@ -1195,7 +1202,26 @@ err%(name)s = cudnnSoftmaxBackward(
  softmax_output_%(name)s,
  CudaNdarray_DEV_DATA(%(outs)s)
 );
-"""
+#else
+{
+const float alpha = 1.;
+const float beta = 0.;
+err%(name)s = cudnnSoftmaxBackward(
+  _handle,
+  algo%(name)s,
+  mode%(name)s,
+  (void*) &alpha,
+  %(name1)s_%(name)s,
+  CudaNdarray_DEV_DATA(%(ins1)s),
+  %(name0)s_%(name)s,
+  CudaNdarray_DEV_DATA(%(ins0)s),
+  (void*) &beta,
+  softmax_output_%(name)s,
+  CudaNdarray_DEV_DATA(%(outs)s)
+);
+}
+#endif
+        """


 # Intentation for history

--- a/theano/sandbox/cuda/dnn_base.c
+++ b/theano/sandbox/cuda/dnn_base.c
+#section support_code
+static cudnnHandle_t _handle = NULL;
+
+static int
+c_set_tensor4d(CudaNdarray *var, cudnnTensorDescriptor_t desc) {
+  cudnnStatus_t err = cudnnSetTensor4dDescriptorEx(
+    desc, CUDNN_DATA_FLOAT,
+    CudaNdarray_HOST_DIMS(var)[0],
+    CudaNdarray_HOST_DIMS(var)[1],
+    CudaNdarray_HOST_DIMS(var)[2],
+    CudaNdarray_HOST_DIMS(var)[3],
+    CudaNdarray_HOST_STRIDES(var)[0]?CudaNdarray_HOST_STRIDES(var)[0]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3]*CudaNdarray_HOST_DIMS(var)[1],
+    CudaNdarray_HOST_STRIDES(var)[1]?CudaNdarray_HOST_STRIDES(var)[1]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3],
+    CudaNdarray_HOST_STRIDES(var)[2]?CudaNdarray_HOST_STRIDES(var)[2]:CudaNdarray_HOST_DIMS(var)[3],
+    CudaNdarray_HOST_STRIDES(var)[3]?CudaNdarray_HOST_STRIDES(var)[3]:1
+    );
+  if (err != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError,
+		 "Could not set tensor4d descriptor: %s"
+		 "shapes=%d %d %d %d strides=%d %d %d %d",
+		 cudnnGetErrorString(err),
+		 CudaNdarray_HOST_DIMS(var)[0],
+		 CudaNdarray_HOST_DIMS(var)[1],
+		 CudaNdarray_HOST_DIMS(var)[2],
+		 CudaNdarray_HOST_DIMS(var)[3],
+		 CudaNdarray_HOST_STRIDES(var)[0]?CudaNdarray_HOST_STRIDES(var)[0]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3]*CudaNdarray_HOST_DIMS(var)[1],
+		 CudaNdarray_HOST_STRIDES(var)[1]?CudaNdarray_HOST_STRIDES(var)[1]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3],
+		 CudaNdarray_HOST_STRIDES(var)[2]?CudaNdarray_HOST_STRIDES(var)[2]:CudaNdarray_HOST_DIMS(var)[3],
+		 CudaNdarray_HOST_STRIDES(var)[3]?CudaNdarray_HOST_STRIDES(var)[3]:1
+      );
+    return -1;
+  }
+  return 0;
+}
+
+static int
+c_set_filter(CudaNdarray *var, cudnnFilterDescriptor_t desc) {
+  if (!CudaNdarray_is_c_contiguous(var)) {
+    PyErr_SetString(PyExc_ValueError,
+		    "Only contiguous filters (kernels) are supported.");
+    return -1;
+  }
+  cudnnStatus_t err = cudnnSetFilter4dDescriptor(
+    desc, CUDNN_DATA_FLOAT,
+    CudaNdarray_HOST_DIMS(var)[0],
+    CudaNdarray_HOST_DIMS(var)[1],
+    CudaNdarray_HOST_DIMS(var)[2],
+    CudaNdarray_HOST_DIMS(var)[3]
+    );
+  if (err != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError,
+		 "Could not set filter descriptor: %s."
+		 " dims= %d %d %d %d",
+		 cudnnGetErrorString(err),
+		 CudaNdarray_HOST_DIMS(var)[0],
+		 CudaNdarray_HOST_DIMS(var)[1],
+		 CudaNdarray_HOST_DIMS(var)[2],
+		 CudaNdarray_HOST_DIMS(var)[3]);
+    return -1;
+  }
+  return 0;
+}
+
+#section init_code
+
+{
+  cudnnStatus_t err;
+  if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError, "could not create cuDNN handle: %s",
+		 cudnnGetErrorString(err));
+#if PYTHON_MAJOR_VERSION >= 3
+    return NULL;
+#else
+    return;
+#endif
+  }
+}
--- a/theano/sandbox/cuda/dnn_conv_base.c
+++ b/theano/sandbox/cuda/dnn_conv_base.c
+#section support_code_struct
+cudnnTensorDescriptor_t APPLY_SPECIFIC(input);
+cudnnTensorDescriptor_t APPLY_SPECIFIC(output);
+cudnnFilterDescriptor_t APPLY_SPECIFIC(kerns);
+
+#section init_code_struct
+
+cudnnStatus_t APPLY_SPECIFIC(err);
+APPLY_SPECIFIC(input) = NULL;
+APPLY_SPECIFIC(output) = NULL;
+APPLY_SPECIFIC(kerns) = NULL;
+if ((APPLY_SPECIFIC(err) = cudnnCreateTensorDescriptor(&APPLY_SPECIFIC(input))) != CUDNN_STATUS_SUCCESS) {
+  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
+	       "(inp): %s", cudnnGetErrorString(APPLY_SPECIFIC(err)));
+  FAIL;
+}
+if ((APPLY_SPECIFIC(err) = cudnnCreateTensorDescriptor(&APPLY_SPECIFIC(output))) != CUDNN_STATUS_SUCCESS) {
+  PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
+               "(out): %s", cudnnGetErrorString(APPLY_SPECIFIC(err)));
+  FAIL;
+}
+if ((APPLY_SPECIFIC(err) = cudnnCreateFilterDescriptor(&APPLY_SPECIFIC(kerns))) != CUDNN_STATUS_SUCCESS) {
+  PyErr_Format(PyExc_MemoryError, "could not allocate filter descriptor: %s", 
+	       cudnnGetErrorString(APPLY_SPECIFIC(err)));
+  FAIL;
+}
+
+#section cleanup_code_struct
+
+if (APPLY_SPECIFIC(input) != NULL)
+  cudnnDestroyTensorDescriptor(APPLY_SPECIFIC(input));
+if (APPLY_SPECIFIC(output) != NULL)
+  cudnnDestroyTensorDescriptor(APPLY_SPECIFIC(output));
+if (APPLY_SPECIFIC(kerns) != NULL)
+  cudnnDestroyFilterDescriptor(APPLY_SPECIFIC(kerns));
--- a/theano/sandbox/cuda/dnn_fwd.c
+++ b/theano/sandbox/cuda/dnn_fwd.c
+#section support_code_struct
+
+int
+APPLY_SPECIFIC(conv_fwd)(CudaNdarray *input, CudaNdarray *kerns,
+			 cudnnConvolutionDescriptor_t desc,
+			 CudaNdarray **output) {
+  cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
+
+  if (c_set_tensor4d(input, APPLY_SPECIFIC(input)) == -1)
+    return 1;
+  if (c_set_filter(kerns, APPLY_SPECIFIC(kerns)) == -1)
+    return 1;
+
+  {
+    int out_dims[4];
+    err = cudnnGetConvolution2dForwardOutputDim(
+      desc,
+      APPLY_SPECIFIC(input),
+      APPLY_SPECIFIC(kerns),
+      &out_dims[0], &out_dims[1], &out_dims[2], &out_dims[3]);
+    if (err != CUDNN_STATUS_SUCCESS) {
+      PyErr_Format(PyExc_RuntimeError,
+		   "GpuDnnConv: error while computing the output shape: %s",
+		   cudnnGetErrorString(err));
+      return 1;
+    }
+    if (CudaNdarray_prep_output(output, 4, out_dims) != 0) {
+      return 1;
+    }
+  }
+
+  if (c_set_tensor4d(*output, APPLY_SPECIFIC(output)) == -1)
+    return 1;
+
+  {
+    cudnnConvolutionFwdAlgo_t algo;
+    err = cudnnGetConvolutionForwardAlgorithm(
+      _handle,
+      APPLY_SPECIFIC(input),
+      APPLY_SPECIFIC(kerns),
+      desc,
+      APPLY_SPECIFIC(output),
+      CUDNN_CONVOLUTION_FWD_NO_WORKSPACE, // TODO: add op param
+      0,
+      &algo);
+
+    if (err != CUDNN_STATUS_SUCCESS) {
+      PyErr_Format(PyExc_RuntimeError,
+		   "GpuDnnConv: Couldn't select convolution algorithm: %s",
+		   cudnnGetErrorString(err));
+      return 1;
+    }
+
+    const float alpha = 1;
+    const float beta = 0;
+
+    err = cudnnConvolutionForward(
+      _handle,
+      (void *)&alpha,
+      APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(input),
+      APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(kerns),
+      desc,
+      algo,
+      NULL, 0,
+      (void *)&beta,
+      APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(*output));
+  }
+  if (err != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError, "GpuDnnConv: error doing operation: %s",
+		 cudnnGetErrorString(err));
+    return 1;
+  }
+  return 0;
+}
--- a/theano/sandbox/cuda/dnn_gi.c
+++ b/theano/sandbox/cuda/dnn_gi.c
+#section support_code_struct
+
+int
+APPLY_SPECIFIC(conv_gi)(CudaNdarray *kerns, CudaNdarray *output,
+			cudnnConvolutionDescriptor_t desc,
+			int h, int w,
+			CudaNdarray **input) {
+  cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
+
+  if (c_set_tensor4d(output, APPLY_SPECIFIC(output)) == -1)
+    return 1;
+  if (c_set_filter(kerns, APPLY_SPECIFIC(kerns)) == -1)
+    return 1;
+
+  {
+    int out_dims[4];
+    out_dims[0] = CudaNdarray_HOST_DIMS(output)[0];
+    out_dims[1] = CudaNdarray_HOST_DIMS(kerns)[1];
+    out_dims[2] = h;
+    out_dims[3] = w;
+    if (CudaNdarray_prep_output(input, 4, out_dims) != 0) {
+      return 1;
+    }
+  }
+
+  if (c_set_tensor4d(*input, APPLY_SPECIFIC(input)) == -1)
+    return 1;
+
+  {
+    const float alpha = 1;
+    const float beta = 0;
+
+    err = cudnnConvolutionBackwardData(
+      _handle,
+      (void *)&alpha,
+      APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(kerns),
+      APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(output),
+      desc,
+      (void *)&beta,
+      APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(*input));
+  }
+  if (err != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError, "GpuDnnConvGradI: error doing operation: %s",
+		 cudnnGetErrorString(err));
+    return 1;
+  }
+  return 0;
+}
--- a/theano/sandbox/cuda/dnn_gw.c
+++ b/theano/sandbox/cuda/dnn_gw.c
+#section support_code_struct
+
+int 
+APPLY_SPECIFIC(conv_gw)(CudaNdarray *input, CudaNdarray *output,
+			cudnnConvolutionDescriptor_t desc,
+			int h, int w,
+			CudaNdarray **kerns) {
+  cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
+
+  if (c_set_tensor4d(input, APPLY_SPECIFIC(input)) == -1)
+    return 1;
+  if (c_set_tensor4d(output, APPLY_SPECIFIC(output)) == -1)
+    return 1;
+
+  {
+    int out_dims[4];
+    out_dims[0] = CudaNdarray_HOST_DIMS(output)[1];
+    out_dims[1] = CudaNdarray_HOST_DIMS(input)[1];
+    out_dims[2] = h;
+    out_dims[3] = w;
+    if (CudaNdarray_prep_output(kerns, 4, out_dims) != 0) {
+      return 1;
+    }
+  }
+
+  if (c_set_filter(*kerns, APPLY_SPECIFIC(kerns)) == -1)
+    return 1;
+
+  {
+    const float alpha = 1;
+    const float beta = 0;
+
+    err = cudnnConvolutionBackwardFilter(
+      _handle,
+      (void *)&alpha,
+      APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(input),
+      APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(output),
+      desc,
+      (void *)&beta,
+      APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(*kerns));
+  }
+  if (err != CUDNN_STATUS_SUCCESS) {
+    PyErr_Format(PyExc_RuntimeError, "GpuDnnConvGradW: error doing operation: %s",
+		 cudnnGetErrorString(err));
+    return 1;
+  }
+  return 0;
+}
--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
@@ -27,7 +27,7 @@ from theano.sandbox import cuda
 if cuda.cuda_available == False:
    raise SkipTest('Optional package cuda disabled')

-from theano.sandbox.cuda.dnn import GpuDnnConv, GpuDnnConvBase, dnn_conv
+from theano.sandbox.cuda.dnn import GpuDnnConv, DnnBase, dnn_conv

 #needed as the gpu conv don't have a perform implementation.
 if theano.config.mode == 'FAST_COMPILE':
@@ -596,7 +596,7 @@ def test_gemm_valid():
 def test_dnn_valid():
    if not cuda.dnn.dnn_available():
        raise SkipTest(cuda.dnn.dnn_available.msg)
-    for t in _test_valid(GpuDnnConvBase, mode=theano_mode.including("cudnn")):
+    for t in _test_valid(DnnBase, mode=theano_mode.including("cudnn")):
        yield t


@@ -710,7 +710,7 @@ def test_gemm_full():
 def test_dnn_full():
    if not cuda.dnn.dnn_available():
        raise SkipTest(cuda.dnn.dnn_available.msg)
-    for t in _test_full(GpuDnnConvBase, mode=theano_mode.including("cudnn")):
+    for t in _test_full(DnnBase, mode=theano_mode.including("cudnn")):
        yield t


@@ -762,13 +762,13 @@ def test_gemm_subsample():
 def test_dnn_subsample():
    if not cuda.dnn.dnn_available():
        raise SkipTest(cuda.dnn.dnn_available.msg)
-    for t in _test_subsample(GpuDnnConvBase, theano_mode.including('cudnn')):
+    for t in _test_subsample(DnnBase, theano_mode.including('cudnn')):
        yield t


 class TestConv2DGPU(unittest.TestCase):
    conv_ops = (cuda.blas.GpuConv,
-                cuda.dnn.GpuDnnConvBase,
+                cuda.dnn.DnnBase,
                cuda.blas.BaseGpuCorrMM)

    def test_logical_shapes(self):

--- a/theano/sandbox/cuda/tests/test_dnn.py
+++ b/theano/sandbox/cuda/tests/test_dnn.py
@@ -192,3 +192,9 @@ def test_dnn_tag():
        assert cuda.dnn.dnn_available()
        assert any([isinstance(n.op, cuda.dnn.GpuDnnPool)
                    for n in f.maker.fgraph.toposort()])
+
+
+def test_version():
+    if not cuda.dnn.dnn_available():
+        raise SkipTest(cuda.dnn.dnn_available.msg)
+    assert isinstance(cuda.dnn.version(), (int, tuple))
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -82,6 +82,7 @@ get_scalar_type.cache = {}


 def as_scalar(x, name=None):
+    from ..tensor import TensorType, scalar_from_tensor
    if isinstance(x, gof.Apply):
        if len(x.outputs) != 1:
            raise ValueError("It is ambiguous which output of a multi-output"
@@ -89,9 +90,12 @@ def as_scalar(x, name=None):
        else:
            x = x.outputs[0]
    if isinstance(x, Variable):
-        if not isinstance(x.type, Scalar):
+        if isinstance(x.type, Scalar):
+            return x
+        elif isinstance(x.type, TensorType) and x.ndim == 0:
+            return scalar_from_tensor(x)
+        else:
            raise TypeError("Variable type field must be a Scalar.", x, x.type)
-        return x
    try:
        return constant(x)
    except TypeError: