提交 67d7d461 authored 作者: Frédéric Bastien's avatar Frédéric Bastien

Merge pull request #2356 from abergeron/cudnn_r2

Cudnn r2
...@@ -688,14 +688,13 @@ To help with this, Theano defines a class, ``COp``, from which new C ops ...@@ -688,14 +688,13 @@ To help with this, Theano defines a class, ``COp``, from which new C ops
can inherit. The class ``COp`` aims to simplify the process of implementing can inherit. The class ``COp`` aims to simplify the process of implementing
C ops by doing the following : C ops by doing the following :
* It allows you to define the C implementation of your op in a distinct * It allows you to define the C implementation of your op in a distinct
C code file. This makes it easier to keep your Python and C code C code file. This makes it easier to keep your Python and C code
readable and well indented. readable and well indented.
* It automatically handles the methods :meth:`Op.c_code()`, * It can automatically handle all the methods that return C code,
:meth:`Op.c_support_code()`, :meth:`Op.c_support_code_apply()` and in addition to :meth:`Op.c_code_cache_version()` based on the
:meth:`Op.c_code_cache_version()` based on the provided external C provided external C implementation.
implementation.
To illustrate how much simpler the class ``COp`` makes the process of defining To illustrate how much simpler the class ``COp`` makes the process of defining
a new op with a C implementation, let's revisit the second example of this a new op with a C implementation, let's revisit the second example of this
...@@ -740,7 +739,7 @@ C file named vectorTimesVector.c : ...@@ -740,7 +739,7 @@ C file named vectorTimesVector.c :
.. code-block:: c .. code-block:: c
THEANO_SUPPORT_CODE_SECTION #section support_code
// Support code function // Support code function
bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2) bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2)
...@@ -749,7 +748,7 @@ C file named vectorTimesVector.c : ...@@ -749,7 +748,7 @@ C file named vectorTimesVector.c :
} }
THEANO_APPLY_CODE_SECTION #section support_code_apply
// Apply-specific support function // Apply-specific support function
void APPLY_SPECIFIC(vector_elemwise_mult)( void APPLY_SPECIFIC(vector_elemwise_mult)(
...@@ -822,43 +821,46 @@ this new version of the VectorTimesVector op : ...@@ -822,43 +821,46 @@ this new version of the VectorTimesVector op :
* Parent class : instead of inheriting from the class :class:`Op`, * Parent class : instead of inheriting from the class :class:`Op`,
VectorTimesVector inherits from the class ``COp``. VectorTimesVector inherits from the class ``COp``.
* Constructor : in our new op, the ``__init__()`` method has an important * Constructor : in our new op, the ``__init__()`` method has an
use; to inform the constructor of the ``COp`` class of the location, important use; to inform the constructor of the ``COp`` class
on the filesystem of the C implementation of this op. To do this, it of the location, on the filesystem of the C implementation of
gives the path of file containing the C code as well as the name of this op. To do this, it gives a list of file paths containing
the function, in that file, that should be called to perform the the C code for this op. To auto-generate the c_code method
computation. The path should be given as a relative path from the with a function call you can specify the function name as the
folder where the descendant of the ``COp`` class is defined. second parameter. The paths should be given as a relative
path from the folder where the descendant of the ``COp`` class
* ``make_node()`` : the ``make_node()`` method is absolutely identical to is defined.
the one in our old example. Using the ``COp`` class doesn't change
anything here. * ``make_node()`` : the ``make_node()`` method is absolutely
identical to the one in our old example. Using the ``COp``
* External C code : the external C code performs the computation class doesn't change anything here.
associated with the op. It contains, at the very least, a 'main' function
having the same name as provided to the constructor of the Python class * External C code : the external C code implements the various
``COp``. Writing this C code involves a few subtleties which deserve their functions associated with the op. Writing this C code
own respective sections. involves a few subtleties which deserve their own respective
sections.
Main function Main function
------------- -------------
The external C implementation must implement a main function whose name If you pass a function name to the ``__init__()`` method of the
is passed by the op to the ``__init__()`` method of the ``COp`` class. This ``COp`` class, it must respect the following constraints:
main C function must respect the following constraints :
* It must return an int. The value of that int indicates whether the * It must return an int. The value of that int indicates whether
op could perform its task or not. A value of 0 indicates success while the op could perform its task or not. A value of 0 indicates
any non-zero value will interrupt the execution of the Theano function. success while any non-zero value will interrupt the execution
Before returning a non-zero integer, the main function should call the of the Theano function. When returning non-zero the function
function ``PyErr_Format()`` to setup a Python exception. must set a python exception indicating the details of the
problem.
* It must receive one pointer for each input to the op followed by one * It must receive one argument for each input to the op followed
pointer to a pointer for each output of the op. by one pointer to an argument for each output of the op. The
types for the argument is dependant on the Types (that is
theano Types) of your inputs and outputs.
For example, the main C function of an op that takes two scalars as inputs and For example, the main C function of an op that takes two TensorTypes
returns both their sum and the difference between them would have four (which has ``PyArrayObject *`` as its C type) as inputs and returns
both their sum and the difference between them would have four
parameters (two for the op's inputs and two for its outputs) and it's parameters (two for the op's inputs and two for its outputs) and it's
signature would look something like this : signature would look something like this :
...@@ -870,11 +872,21 @@ signature would look something like this : ...@@ -870,11 +872,21 @@ signature would look something like this :
Macros Macros
------ ------
The ``COp`` class defines a number of macros that can you can use in your C For certain section tags, your C code can benefit from a number of
implementation to make it simpler and more generic. pre-defined macros. These section tags have no macros: ``init_code``,
``support_code``. All other tags will have the support macros
discussed below.
For every input array 'i' (indexed from 0) of the op, the following macros are * ``APPLY_SPECIFIC(str)`` which will automatically append a name
defined: unique to the :ref:`Apply node that applies the Op at the end
of the provided ``str``. The use of this macro is discussed
futher below.
For every input which has a :attr:`dtype` attribute (this means
Tensors, and equivalent types on GPU), the following macros will be
defined unless your Op class has an :attr:`Op.check_input` attribute
defined to False. In these descrptions 'i' refers to the position
(indexed from 0) in the input array.
* ``DTYPE_INPUT_{i}`` : NumPy dtype of the data in the array. * ``DTYPE_INPUT_{i}`` : NumPy dtype of the data in the array.
This is the variable type corresponding to the NumPy dtype, not the This is the variable type corresponding to the NumPy dtype, not the
...@@ -889,71 +901,87 @@ defined: ...@@ -889,71 +901,87 @@ defined:
* ``TYPENUM_INPUT_{i}`` : Typenum of the data in the array * ``TYPENUM_INPUT_{i}`` : Typenum of the data in the array
* ``ITEMSIZE_INPUT_{i}`` : Size, in bytes, of the elements in the array. * ``ITEMSIZE_INPUT_{i}`` : Size, in bytes, of the elements in
the array.
In the same way, the macros ``DTYPE_OUTPUT_{i}``,
``ITEMSIZE_OUTPUT_{i}`` and ``TYPENUM_OUTPUT_{i}`` are defined for
every output 'i' of the op.
In the same way, the macros ``DTYPE_OUTPUT_{i}``, ``ITEMSIZE_OUTPUT_{i}`` and In addition to these macros, the ``init_code_struct``, ``code``, and
``TYPENUM_OUTPUT_{i}`` are defined for every output 'i' of the op. ``code_cleanup`` section tags also have the following macros:
* ``FAIL`` : Code to insert at error points. A python exception
should be set prior to this code. An invocation look like this:
.. code-block:: c
if (error) {
// Set python exception
FAIL
}
The ``COp`` class also defines the macro ``APPLY_SPECIFIC(str)`` which will You can add a semicolon after the macro if it makes your editor
automatically append the name of the :ref:`Apply node that applies the Op at happy.
the end of the provided ``str``. The use of this macro is discussed below.
You should be aware, however, that these macros are apply-specific. As such, * ``CONTEXT`` : Name of the context variable for this node. (only
any function that uses them is considered to contain apply-specific code. for Ops which have a context, which is discussed elsewhere)
Finally the tag ``code`` and ``code_cleanup`` have macros to
pass the inputs and output names. These are name ``INPUT_{i}`` and
``OUTPUT_{i}`` where `i` is the 0-based index position in the input
and output arrays respectively.
Support code Support code
------------ ------------
The file whose name is provided to the ``COp`` class is not constrained to Certain section are limited in what you can place in them due to
contain only one function. It can in fact contain many functions, with every semantic and syntactic restrictions of the C++ language. Most of
function but the main one acting as support code. these restrictions apply to the tags that end in ``_struct``.
When we defined the VectorTimesVector op without using the ``COp`` class, we When we defined the VectorTimesVector op without using the ``COp``
had to make a distinction between two types of support_code : the support class, we had to make a distinction between two types of support_code
code that was apply-specific and the support code that wasn't. : the support code that was apply-specific and the support code that
The apply-specific code was defined in the ` c_support_code_apply()`` method wasn't. The apply-specific code was defined in the
and the elements defined in that code (global variables and functions) had to ``c_support_code_apply()`` method and the elements defined in that
include the name of the Apply node in their own names to avoid conflicts code (global variables and functions) had to include the name of the
between the different versions of the apply-specific code. The code that Apply node in their own names to avoid conflicts between the different
wasn't apply-specific was simply defined in the ``c_support_code()`` method. versions of the apply-specific code. The code that wasn't
apply-specific was simply defined in the ``c_support_code()`` method.
When using the ``COp`` class, we still have to make the distinction between
apply-specific and apply-agnostic support code but we express it differently To make indentifiers that include the :ref:`Apply` node name use the
in the code since it is all defined in the same external C file. ``APPLY_SPECIFIC(str)`` macro. In the above example, this macro is
These two types of support code should each be defined in their own section of used when defining the functions ``vector_elemwise_mult()`` and
the file, like in the example above. These sections should be delimited by the
markers ``THEANO_SUPPORT_CODE_SECTION`` (to be put on its own line, at the
beginning of the apply-agnostic support code section) and
``THEANO_APPLY_CODE_SECTION`` (to be put on its own line at the beginning of
the apply-specific code section). Moreover, just like in the previous examples
of this tutorial, apply-specific functions and global variables need to
include the name of the :ref:`Apply` node in their names. To achieve this,
the macro ``APPLY_SPECIFIC(str)`` should be used when defining those elements
as well as when referring to them. In the above example, this macro is used
when defining the functions ``vector_elemwise_mult()`` and
``vector_times_vector()`` as well as when calling function ``vector_times_vector()`` as well as when calling function
``vector_elemwise_mult()`` from inside ``vector_times_vector()``. ``vector_elemwise_mult()`` from inside ``vector_times_vector()``.
:note: When using the ``COp`` class, we still have to make the distinction
between C code for each of the methods of a C class. These sections of
The macro ``APPLY_SPECIFIC(str)`` should only ever be used for code are separated by ``#section <tag>`` markers. The tag determines
apply-specific code. It should not be used for apply-agnostic code. the name of the method this C code applies to with the rule that
``<tag>`` applies to `c_<tag>`. Unknown tags are an error and will be
The rules for knowing if a piece of code should be put in the apply-agnostic reported. Duplicate tags will be merged together in the order the
or the apply-specific support code section of the file are simple. If it uses appear in the C files.
any of the macros defined by the class ``COp`` then it is apply-specific and
goes in the corresponding section. If it calls any apply-specific code then The rules for knowing if where a piece of code should be put can be
it is apply-specific. Otherwise, it is apply-agnostic and goes in the sometimes tricky. The key thing to remember is that things that can
apply-agnostic support code section. be shared between instances of the op should be apply-agnostic and go
into a section which does not end in ``_apply`` or ``_struct``. The
In the above example, the ``function vector_same_shape()`` is apply-agnostic distinction of ``_apply`` and ``_struct`` mostly hinghes on how you
because it uses none of the macros defined by the class ``COp`` and it doesn't want to manange the lifetime of the object. Note that to use an
rely on any apply-specific code. The function ``vector_elemwise_mult()`` is apply-specific object, you have to be in a apply-specific section, so
apply-specific because it uses the macros defined by ``COp``. Finally, the some portions of the code that might seem apply-agnostic may still be
function ``vector_times_vector()`` is apply-specific because it uses those apply-specific because of the data they use (this does not include
same macros and also because it calls ``vector_elemwise_mult()`` which is an arguments).
apply-specific function.
In the above example, the ``function vector_same_shape()`` is
apply-agnostic because it uses none of the macros defined by the class
``COp`` and it doesn't rely on any apply-specific code. The function
``vector_elemwise_mult()`` is apply-specific because it uses the
macros defined by ``COp``. Finally, the function
``vector_times_vector()`` is apply-specific because it uses those same
macros and also because it calls ``vector_elemwise_mult()`` which is
an apply-specific function.
Final Note Final Note
========== ==========
......
...@@ -17,6 +17,7 @@ import logging ...@@ -17,6 +17,7 @@ import logging
import numpy import numpy
import os import os
import sys import sys
import re
import warnings import warnings
import theano import theano
...@@ -973,6 +974,32 @@ int main( int argc, const char* argv[] ) ...@@ -973,6 +974,32 @@ int main( int argc, const char* argv[] )
compute_map, no_recycling) compute_map, no_recycling)
def simple_meth(tag):
def f(self):
if tag in self.code_sections:
return self.code_sections[tag]
else:
raise utils.MethodNotDefined(
'c_' + tag, type(self), type(self).__name__)
f.__name__ = 'c_' + tag
return f
def apply_meth(tag):
def f(self, node, name):
if tag in self.code_sections:
code = self.code_sections[tag]
define_macros, undef_macros = self.get_c_macros(node, name)
return os.linesep.join([define_macros, code,
undef_macros])
else:
raise utils.MethodNotDefined(
'c_' + tag, type(self), type(self).__name__)
f.__name__ = 'c_' + tag
return f
class COp(Op): class COp(Op):
""" Class to allow an op to have an external C implementation. """ Class to allow an op to have an external C implementation.
...@@ -981,118 +1008,98 @@ class COp(Op): ...@@ -981,118 +1008,98 @@ class COp(Op):
the C implementation and the name of the function, in that file, to call the C implementation and the name of the function, in that file, to call
to perform the computations for the op. to perform the computations for the op.
""" """
section_re = re.compile(r'^#section ([a-zA-Z0-9_]+)$', re.MULTILINE)
backward_re = re.compile(r'^THEANO_(APPLY|SUPPORT)_CODE_SECTION$', re.MULTILINE)
# This is the set of allowed markers
SECTIONS = set([
'init_code', 'init_code_apply', 'init_code_struct',
'support_code', 'support_code_apply', 'support_code_struct',
'cleanup_code_struct',
'code', 'code_cleanup'])
def __init__(self, func_file, func_name): @classmethod
def get_path(cls, f):
self.func_file = func_file """
self.func_name = func_name Convert a path relative to the location of the class file into
an aboslute path. Paths that are already absolute are passed
# Define the markers that can be used to delimit sections in the through unchanged.
# external C code """
self.support_code_marker = "THEANO_SUPPORT_CODE_SECTION" if not os.path.isabs(f):
self.apply_code_marker = "THEANO_APPLY_CODE_SECTION" class_file = inspect.getfile(cls)
self.c_code_markers = [self.support_code_marker, class_dir = os.path.dirname(class_file)
self.apply_code_marker] f = os.path.realpath(os.path.join(class_dir, f))
return f
# Load the external C code def __init__(self, func_files, func_name=None):
try: """
# Attempt to find the file self.func_file in the folder where the Sections are loaded from files in order with sections in later
# concrete type of the COp instance is defined files overriding sections in previous files.
"""
# Get the name of the folder where the concrete type of the COp is if not isinstance(func_files, list):
# defined func_files = [func_files]
path_concrete_type = inspect.getfile(self.__class__)
folder_concrete_type = os.path.dirname(path_concrete_type)
# Try to open the file from there
f = open(os.path.join(folder_concrete_type, self.func_file), "r")
self.func_code = f.read()
f.close()
except IOError:
# Add information to the exception message to inform the user
# on the locations in which the class COp will look for the
# specified file
message = ("The path to the external C implementation should "
"be given as a relative path from the folder "
"where the Op is defined. ")
# Can't update the exception's message by modifying e.args
# because IOErrors don't use their attribute args to generate
# their error message
e.strerror = message + e.strerror
raise e
# Separate the contents of the file in sections and validate that at
# lest one of the necessary code sections has been defined
self.code_sections = self.parse_external_c_code(self.func_code)
if sum([marker in self.code_sections.keys()
for marker in self.c_code_markers]) == 0:
raise(RuntimeError, "The provided C implementation does not "
"define a support code section or a support code apply "
"section.")
def parse_external_c_code(self, code):
# Obtain the positions of the C code markers used in the C code
positions = [(code.index(marker), marker)
for marker in self.c_code_markers if marker in code]
# Go over the markers in their order of occurence and extract
# the C code they concern
positions.sort()
code_sections = {}
for i in range(len(positions)):
marker_start, marker = positions[i]
if i < len(positions) - 1:
# This is not the last section in the code : extract the code
# between the beginning of the current marker and the
# beginning of the next one.
next_marker_start = positions[i+1][0]
section = code[marker_start: next_marker_start]
else:
# This is the last section in the code : extract the remaining
# C code
section = code[marker_start:]
cleaned_section = section.replace(marker, "") self.func_files = [self.get_path(f) for f in func_files]
code_sections[marker] = cleaned_section self.func_name = func_name
return code_sections self.load_c_code()
if len(self.code_sections) == 0:
raise ValueError("No sections where defined in C files")
if self.func_name is not None:
if 'op_code' in self.code_sections:
# maybe a warning instead (and clearing the key)
raise ValueError('Cannot have an "op_code" section and '
'specify the func_name')
if 'op_code_cleanup' in self.code_sections:
# maybe a warning instead (and clearing the key)
raise ValueError('Cannot have an "op_code_cleanup" section '
'and specify the func_name')
def load_c_code(self):
self.func_codes = []
for func_file in self.func_files:
with open(func_file, 'r') as f:
self.func_codes.append(f.read())
self.code_sections = dict()
for i, code in enumerate(self.func_codes):
if ('THEANO_APPLY_CODE_SECTION' in code or
'THEANO_SUPPORT_CODE_SECTION' in code):
# This is backward compat code that will go away in a while
split = self.backward_re.split(code)
n = 1
while n < len(split):
if split[n] == 'APPLY':
self.code_sections['support_code_apply'] = split[n+1]
elif split[n] == 'SUPPORT':
self.code_sections['support_code'] = split[n+1]
n += 2
continue
split = self.section_re.split(code)
if split[0].strip() != '':
raise ValueError('Stray code before first #section '
'statement (in file %s): %s' %
(self.func_files[i], split[0]))
n = 1
while n < len(split):
if split[n] not in self.SECTIONS:
raise ValueError("Unknown section type (in file %s): %s" %
(self.fun_files[i], split[n]))
if split[n] not in self.code_sections:
self.code_sections[split[n]] = ""
self.code_sections[split[n]] += split[n+1]
n += 2
def c_code_cache_version(self): def c_code_cache_version(self):
return hash(self.func_code) return hash(tuple(self.func_codes))
def c_support_code(self):
if self.support_code_marker in self.code_sections:
return self.code_sections[self.support_code_marker]
else:
raise utils.MethodNotDefined("c_support_code",
type(self), self.__class__.__name__)
def c_support_code_apply(self, node, name):
if self.apply_code_marker in self.code_sections:
apply_code = self.code_sections[self.apply_code_marker]
if hasattr(self, 'check_inputs') and self.check_inputs == False:
return apply_code
else:
define_macros, undef_macros = self.get_c_macros(node, name)
return os.linesep.join([define_macros, apply_code,
undef_macros])
else:
raise utils.MethodNotDefined("c_support_code_apply",
type(self), self.__class__.__name__)
c_init_code = simple_meth('init_code')
c_init_code_apply = apply_meth('init_code_apply')
c_support_code = simple_meth('support_code')
c_support_code_apply = apply_meth('support_code_apply')
c_support_code_struct = apply_meth('support_code_struct')
c_cleanup_code_struct = apply_meth('cleanup_code_struct')
def format_c_function_args(self, inp, out): def format_c_function_args(self, inp, out):
# Generate an string containing the arguments sent to the external C # Generate an string containing the arguments sent to the external C
...@@ -1100,73 +1107,149 @@ class COp(Op): ...@@ -1100,73 +1107,149 @@ class COp(Op):
# "input0, input1, input2, &output0, &output1" # "input0, input1, input2, &output0, &output1"
return ", ".join(list(inp) + ["&%s" % o for o in out]) return ", ".join(list(inp) + ["&%s" % o for o in out])
def get_c_macros(self, node, name): def get_c_macros(self, node, name, check_input=None):
define_template = "#define %s %s"
undef_template = "#undef %s"
define_macros = []
undef_macros = []
if check_input is None:
check_input = getattr(self, 'check_input', True)
if check_input:
# Extract the various properties of the input and output variables
variables = node.inputs + node.outputs
variable_names = (["INPUT_%i" % i for i in range(len(node.inputs))] +
["OUTPUT_%i" % i for i in range(len(node.inputs))])
define_template = "#define %s %s" + os.linesep # Generate dtype macros
undef_template = "#undef %s" + os.linesep for i, v in enumerate(variables):
define_macros = "" if not hasattr(v, 'dtype'):
undef_macros = "" continue
vname = variable_names[i]
# Extract the various properties of the input and output variables macro_name = "DTYPE_" + vname
variables = node.inputs + node.outputs macro_value = "npy_" + v.dtype
variable_names = (["INPUT_%i" % i for i in range(len(node.inputs))] +
["OUTPUT_%i" % i for i in range(len(node.inputs))])
variable_dtypes_names = [v.dtype for v in variables]
variable_dtypes = [numpy.dtype(d) for d in variable_dtypes_names]
variable_typenums = [d.num for d in variable_dtypes]
variable_itemsizes = [d.itemsize for d in variable_dtypes]
# Generate dtype macros define_macros.append(define_template % (macro_name, macro_value))
for i in range(len(variables)): undef_macros.append(undef_template % macro_name)
macro_name = "DTYPE_" + variable_names[i]
macro_value = "npy_" + variable_dtypes_names[i]
define_macros += define_template % (macro_name, macro_value) d = numpy.dtype(v.dtype)
undef_macros += undef_template % macro_name
# Generate typenum macros macro_name = "TYPENUM_" + vname
for i in range(len(variables)): macro_value = d.num
macro_name = "TYPENUM_" + variable_names[i]
macro_value = variable_typenums[i]
define_macros += define_template % (macro_name, macro_value) define_macros.append(define_template % (macro_name, macro_value))
undef_macros += undef_template % macro_name undef_macros.append(undef_template % macro_name)
# Generate itemsize macros macro_name = "ITEMSIZE_" + vname
for i in range(len(variables)): macro_value = d.itemsize
macro_name = "ITEMSIZE_" + variable_names[i]
macro_value = variable_itemsizes[i]
define_macros += define_template % (macro_name, macro_value) define_macros.append(define_template % (macro_name, macro_value))
undef_macros += undef_template % macro_name undef_macros.append(undef_template % macro_name)
# Generate a macro to mark code as being apply-specific # Generate a macro to mark code as being apply-specific
define_macros += define_template % ("APPLY_SPECIFIC(str)", define_macros.append(define_template % ("APPLY_SPECIFIC(str)",
"str##_%s" % name) "str##_%s" % name))
undef_macros += undef_template % "APPLY_SPECIFIC" undef_macros.append(undef_template % "APPLY_SPECIFIC")
return os.linesep.join(define_macros), os.linesep.join(undef_macros)
def _lquote_macro(self, txt):
res = []
spl = txt.split('\n')
for l in spl[:-1]:
res.append(l + ' \\')
res.append(spl[-1])
return os.linesep.join(res)
def get_sub_macros(self, sub):
define_macros = []
undef_macros = []
define_macros.append("#define FAIL %s" %
(self._lquote_macro(sub['fail']),))
undef_macros.append("#undef FAIL")
if 'context' in sub:
define_macros.append("#define CONTEXT %s" % (sub['context'],))
undef_macos.append("#undef CONTEXT")
return os.linesep.join(define_macros), os.linesep.join(undef_macros)
def get_io_macros(self, inputs, outputs):
define_macros = []
undef_macros = []
for i, inp in enumerate(inputs):
define_macros.append("#define INPUT_%d %s" (i, inp))
undef_macros.append("#undef INPUT_%d", (i,))
for i, out in enumerate(outputs):
define_macros.append("#define OUTPUT_%d %s" (i, inp))
undef_macros.append("#undef OUTPUT_%d", (i,))
def c_init_code_struct(self, node, name, sub):
if 'init_code_struct' in self.code_sections:
op_code = self.code_sections['init_code_struct']
def_macros, undef_macros = self.get_c_macros(node, name)
def_sub, undef_sub = self.get_sub_macros(sub)
return os.linesep.join([def_macros, def_sub,
op_code,
undef_sub, undef_macros])
else:
raise utils.MethodNotDefined(
'c_init_code_struct', type(self), type(self).__name__)
return define_macros, undef_macros
def c_code(self, node, name, inp, out, sub): def c_code(self, node, name, inp, out, sub):
if self.func_name is not None:
assert 'code' not in self.code_sections
func_name = self.func_name
func_args = self.format_c_function_args(inp, out)
fail = sub['fail']
define_macros, undef_macros = self.get_c_macros(node, name,
check_input=False)
# Generate the C code
return """
%(define_macros)s
{
if (%(func_name)s(%(func_args)s) != 0) {
%(fail)s
}
}
%(undef_macros)s
""" % dict(func_name=self.func_name, fail=sub['fail'],
func_args=self.format_c_function_args(inp, out),
define_macros=define_macros, undef_macros=undef_macros)
else:
if 'code' in self.code_sections:
op_code = self.code_sections['code']
func_name = self.func_name def_macros, undef_macros = self.get_c_macros(node, name)
func_args = self.format_c_function_args(inp, out) def_sub, undef_sub = self.get_sub_macros(sub)
fail = sub['fail'] def_io, undef_io = self.get_io_macros(inp, out)
# Generate the code to define/undefine the C macros
define_macros, undef_macros = self.get_c_macros(node, name)
# Generate the C code
c_code = """
%(define_macros)s
{
int result = %(func_name)s(%(func_args)s);
if (result != 0)
{
%(fail)s;
}
}
%(undef_macros)s
""" % locals()
return c_code return os.linesep.join([def_macros, def_sub, def_io,
op_code,
undef_io, undef_sub, undef_macros])
else:
raise utils.MethodNotDefined(
'c_code', type(self), type(self).__name__)
def c_code_cleanup(self, node, name, inputs, outputs, sub):
if 'code_cleanup' in self.code_sections:
op_code = self.code_sections['code_cleanup']
def_macros, undef_macros = self.get_c_macros(node, name)
def_sub, undef_sub = self.get_sub_macros(sub)
def_io, undef_io = self.get_io_macros(inp, out)
return os.linesep.join([def_macros, def_sub, def_io,
op_code,
undef_io, undef_sub, undef_macros])
else:
raise utils.MethodNotDefined(
'c_code_cleanup', type(self), type(self).__name__)
...@@ -3,6 +3,12 @@ ...@@ -3,6 +3,12 @@
#include <cudnn.h> #include <cudnn.h>
#ifndef CUDNN_VERSION
#include <assert.h>
// Here we define the R2 API in terms of functions in the R1 interface
// This is only for what we use
static inline const char *cudnnGetErrorString(cudnnStatus_t err) { static inline const char *cudnnGetErrorString(cudnnStatus_t err) {
switch (err) { switch (err) {
case CUDNN_STATUS_SUCCESS: case CUDNN_STATUS_SUCCESS:
...@@ -28,4 +34,118 @@ static inline const char *cudnnGetErrorString(cudnnStatus_t err) { ...@@ -28,4 +34,118 @@ static inline const char *cudnnGetErrorString(cudnnStatus_t err) {
} }
} }
// some macros to help support cudnn R1 while using R2 code.
#define cudnnCreateTensorDescriptor cudnnCreateTensor4dDescriptor
#define cudnnDestroyTensorDescriptor cudnnDestroyTensor4dDescriptor
#define cudnnSetFilter4dDescriptor cudnnSetFilterDescriptor
typedef cudnnTensor4dDescriptor_t cudnnTensorDescriptor_t;
static inline cudnnStatus_t
cudnnGetConvolution2dForwardOutputDim(
const cudnnConvolutionDescriptor_t convDesc,
const cudnnTensorDescriptor_t inputTensorDesc,
const cudnnFilterDescriptor_t filterDesc,
int *n,
int *c,
int *h,
int *w) {
return cudnnGetOutputTensor4dDim(convDesc, CUDNN_CONVOLUTION_FWD,
n, c, h, w);
}
typedef int cudnnConvolutionFwdAlgo_t;
typedef int cudnnConvolutionFwdPreference_t;
#define CUDNN_CONVOLUTION_FWD_NO_WORKSPACE 0
static inline cudnnStatus_t
cudnnGetConvolutionForwardAlgorithm(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t srcDesc,
const cudnnFilterDescriptor_t filterDesc,
const cudnnConvolutionDescriptor_t convDesc,
const cudnnTensorDescriptor_t destDesc,
cudnnConvolutionFwdPreference_t preference,
size_t memoryLimitInbytes,
cudnnConvolutionFwdAlgo_t *algo) {
*algo = 0;
return CUDNN_STATUS_SUCCESS;
}
static inline cudnnStatus_t
cudnnConvolutionForward_v2(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t srcDesc,
const void *srcData,
const cudnnFilterDescriptor_t filterDesc,
const void *filterData,
const cudnnConvolutionDescriptor_t convDesc,
cudnnConvolutionFwdAlgo_t algo,
void *workSpace,
size_t workSpaceSizeInBytes,
const void *beta,
const cudnnTensorDescriptor_t destDesc,
void *destData) {
assert(*(float *)alpha == 1.0);
assert(*(float *)beta == 0.0);
return cudnnConvolutionForward(handle, srcDesc, srcData,
filterDesc, filterData,
convDesc, destDesc, destData,
CUDNN_RESULT_NO_ACCUMULATE);
}
#define cudnnConvolutionForward cudnnConvolutionForward_v2
static inline cudnnStatus_t
cudnnConvolutionBackwardFilter_v2(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t srcDesc,
const void *srcData,
const cudnnTensorDescriptor_t diffDesc,
const void *diffData,
const cudnnConvolutionDescriptor_t convDesc,
const void *beta,
const cudnnFilterDescriptor_t gradDesc,
void *gradData) {
assert(*(float *)alpha == 1.0);
assert(*(float *)beta == 0.0);
return cudnnConvolutionBackwardFilter(handle, srcDesc, srcData,
diffDesc, diffData,
convDesc, gradDesc, gradData,
CUDNN_RESULT_NO_ACCUMULATE);
}
#define cudnnConvolutionBackwardFilter cudnnConvolutionBackwardFilter_v2
static inline cudnnStatus_t
cudnnConvolutionBackwardData_v2(
cudnnHandle_t handle,
const void *alpha,
const cudnnFilterDescriptor_t filterDesc,
const void *filterData,
const cudnnTensorDescriptor_t diffDesc,
const void *diffData,
const cudnnConvolutionDescriptor_t convDesc,
const void *beta,
const cudnnTensorDescriptor_t gradDesc,
void *gradData) {
assert(*(float *)alpha == 1.0);
assert(*(float *)beta == 0.0);
return cudnnConvolutionBackwardData(handle,
(cudnnFilterDescriptor_t)filterDesc,
filterData,
(cudnnTensorDescriptor_t)diffDesc,
diffData,
(cudnnConvolutionDescriptor_t)convDesc,
(cudnnTensorDescriptor_t)gradDesc,
gradData,
CUDNN_RESULT_NO_ACCUMULATE);
}
#define cudnnConvolutionBackwardData cudnnConvolutionBackwardData_v2
#endif
#endif #endif
...@@ -2,8 +2,10 @@ import os ...@@ -2,8 +2,10 @@ import os
import theano import theano
from theano import Apply, gof, tensor from theano import Apply, gof, tensor
from theano.gof import Optimizer, local_optimizer from theano.scalar import as_scalar
from theano.gof.type import CDataType from theano.gradient import DisconnectedType
from theano.gof import Optimizer, local_optimizer, COp
from theano.gof.type import CDataType, Generic
from theano.compat import PY3 from theano.compat import PY3
from theano.tensor.nnet import SoftmaxGrad from theano.tensor.nnet import SoftmaxGrad
from theano.sandbox.cuda.type import CudaNdarrayType from theano.sandbox.cuda.type import CudaNdarrayType
...@@ -50,12 +52,18 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) { ...@@ -50,12 +52,18 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
try_run=True, output=True) try_run=True, output=True)
dnn_available.avail = comp and run dnn_available.avail = comp and run
if dnn_available.avail: if not dnn_available.avail:
dnn_available.msg = "cuDNN should work"
else:
dnn_available.msg = ( dnn_available.msg = (
"Theano is not able to use cuDNN. We got this error: \n" + "Theano is not able to use cuDNN. We got this error: \n" +
str(err)) str(err))
else:
v = version()
if isinstance(v, tuple) and v[0] != v[1]:
dnn_available.avail = False
dnn_available.msg = ("Mixed dnn version. The header is"
" from one version, but we link with"
" a different version %s" % str(v))
raise RuntimeError(dnn_available.msg)
return dnn_available.avail return dnn_available.avail
...@@ -77,14 +85,25 @@ def c_set_tensor4d(var, desc, err, fail): ...@@ -77,14 +85,25 @@ def c_set_tensor4d(var, desc, err, fail):
CudaNdarray_HOST_STRIDES(%(var)s)[3]?CudaNdarray_HOST_STRIDES(%(var)s)[3]:1 CudaNdarray_HOST_STRIDES(%(var)s)[3]?CudaNdarray_HOST_STRIDES(%(var)s)[3]:1
); );
if (%(err)s != CUDNN_STATUS_SUCCESS) { if (%(err)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not set tensor4d descriptor: %%s", PyErr_Format(PyExc_RuntimeError,
cudnnGetErrorString(%(err)s)); "could not set tensor4d descriptor: %%s"
"shapes=%%d %%d %%d %%d strides=%%d %%d %%d %%d",
cudnnGetErrorString(%(err)s),
CudaNdarray_HOST_DIMS(%(var)s)[0],
CudaNdarray_HOST_DIMS(%(var)s)[1],
CudaNdarray_HOST_DIMS(%(var)s)[2],
CudaNdarray_HOST_DIMS(%(var)s)[3],
CudaNdarray_HOST_STRIDES(%(var)s)[0]?CudaNdarray_HOST_STRIDES(%(var)s)[0]:CudaNdarray_HOST_DIMS(%(var)s)[2]*CudaNdarray_HOST_DIMS(%(var)s)[3]*CudaNdarray_HOST_DIMS(%(var)s)[1],
CudaNdarray_HOST_STRIDES(%(var)s)[1]?CudaNdarray_HOST_STRIDES(%(var)s)[1]:CudaNdarray_HOST_DIMS(%(var)s)[2]*CudaNdarray_HOST_DIMS(%(var)s)[3],
CudaNdarray_HOST_STRIDES(%(var)s)[2]?CudaNdarray_HOST_STRIDES(%(var)s)[2]:CudaNdarray_HOST_DIMS(%(var)s)[3],
CudaNdarray_HOST_STRIDES(%(var)s)[3]?CudaNdarray_HOST_STRIDES(%(var)s)[3]:1
);
%(fail)s %(fail)s
} }
""" % dict(var=var, err=err, desc=desc, fail=fail) """ % dict(var=var, err=err, desc=desc, fail=fail)
class DnnBase(GpuOp): class DnnBase(GpuOp, COp):
""" """
Creates a handle for cudnn and pulls in the cudnn libraries and headers. Creates a handle for cudnn and pulls in the cudnn libraries and headers.
""" """
...@@ -92,6 +111,9 @@ class DnnBase(GpuOp): ...@@ -92,6 +111,9 @@ class DnnBase(GpuOp):
# the input broadcasting pattern. # the input broadcasting pattern.
check_broadcast = False check_broadcast = False
def __init__(self):
COp.__init__(self, "dnn_base.c")
def c_headers(self): def c_headers(self):
return ['cudnn.h', 'cudnn_helper.h'] return ['cudnn.h', 'cudnn_helper.h']
...@@ -101,11 +123,6 @@ class DnnBase(GpuOp): ...@@ -101,11 +123,6 @@ class DnnBase(GpuOp):
def c_libraries(self): def c_libraries(self):
return ['cudnn'] return ['cudnn']
def c_support_code(self):
return """
cudnnHandle_t _handle = NULL;
"""
def c_init_code(self): def c_init_code(self):
if PY3: if PY3:
error_out = "NULL" error_out = "NULL"
...@@ -121,6 +138,52 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) { ...@@ -121,6 +138,52 @@ if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
}""" % (error_out,)] }""" % (error_out,)]
class DnnVersion(GpuOp):
def c_compiler(self):
return NVCC_compiler
def c_headers(self):
return ['cudnn.h']
def c_libraries(self):
return ['cudnn']
def make_node(self):
return Apply(self, [], [Generic()()])
def c_code(self, node, name, inputs, outputs, sub):
o = outputs[0]
return """
#if defined(CUDNN_VERSION)
%(o)s = PyTuple_Pack(2, PyInt_FromLong(CUDNN_VERSION), PyInt_FromLong(cudnnGetVersion()));
#else
%(o)s = PyInt_FromLong(-1);
#endif
""" % locals()
def do_constant_folding(self, node):
# Needed as we do not want to cache this information.
return False
def c_code_cache_version(self):
# Not needed, but make it clear that we do not want to cache this.
return None
def version():
"""
return the current cuDNN version we compile with.
This only check the header version, the the library we link with.
"""
if version.v is None:
f = theano.function([], DnnVersion()(),
theano.Mode(optimizer=None))
version.v = f()
return version.v
version.v = None
class GpuDnnConvDesc(GpuOp): class GpuDnnConvDesc(GpuOp):
"""This Op builds a convolution descriptor for use in the other """This Op builds a convolution descriptor for use in the other
convolution operations. convolution operations.
...@@ -216,6 +279,15 @@ class GpuDnnConvDesc(GpuOp): ...@@ -216,6 +279,15 @@ class GpuDnnConvDesc(GpuOp):
PyErr_SetString(PyExc_ValueError, "bad border mode"); PyErr_SetString(PyExc_ValueError, "bad border mode");
%(fail)s %(fail)s
} }
#if defined(CUDNN_VERSION) && CUDNN_VERSION >= 20
err = cudnnSetConvolution2dDescriptor(
%(desc)s,
pad_h%(name)s,
pad_w%(name)s,
%(subsx)d, %(subsy)d, 1, 1,
%(conv_flag)s
);
#else
err = cudnnSetConvolutionDescriptorEx( err = cudnnSetConvolutionDescriptorEx(
%(desc)s, %(desc)s,
*(npy_int64 *)PyArray_GETPTR1(%(img_shape)s, 0), *(npy_int64 *)PyArray_GETPTR1(%(img_shape)s, 0),
...@@ -230,7 +302,7 @@ class GpuDnnConvDesc(GpuOp): ...@@ -230,7 +302,7 @@ class GpuDnnConvDesc(GpuOp):
%(subsx)d, %(subsy)d, 1, 1, %(subsx)d, %(subsy)d, 1, 1,
%(conv_flag)s %(conv_flag)s
); );
#endif
if (err != CUDNN_STATUS_SUCCESS) { if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s", PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s",
cudnnGetErrorString(err)); cudnnGetErrorString(err));
...@@ -243,153 +315,10 @@ class GpuDnnConvDesc(GpuOp): ...@@ -243,153 +315,10 @@ class GpuDnnConvDesc(GpuOp):
pad_h_spec=pad_h_spec, pad_w_spec=pad_w_spec) pad_h_spec=pad_h_spec, pad_w_spec=pad_w_spec)
def c_code_cache_version(self): def c_code_cache_version(self):
return (2,) return (2, version())
class GpuDnnConvBase(DnnBase):
__props__ = ()
def c_support_code_struct(self, node, name):
return """
cudnnTensor4dDescriptor_t input%(name)s;
cudnnTensor4dDescriptor_t output%(name)s;
cudnnFilterDescriptor_t kerns%(name)s;
""" % dict(name=name)
def c_init_code_struct(self, node, name, sub):
return """
cudnnStatus_t err%(name)s;
input%(name)s = NULL;
output%(name)s = NULL;
kerns%(name)s = NULL;
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(inp): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s
}
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(out): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s
}
if ((err%(name)s = cudnnCreateFilterDescriptor(&kerns%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate filter descriptor: %%s",
cudnnGetErrorString(err%(name)s));
%(fail)s
}
""" % dict(name=name, fail=sub['fail'])
def c_cleanup_code_struct(self, node, name):
return """
if (input%(name)s != NULL) {cudnnDestroyTensor4dDescriptor(input%(name)s);}
if (output%(name)s != NULL) {cudnnDestroyTensor4dDescriptor(output%(name)s);}
if (kerns%(name)s != NULL) {cudnnDestroyFilterDescriptor(kerns%(name)s);}
""" % dict(name=name)
def c_set_filter(self, var, desc, err, fail):
return """
%(err)s = cudnnSetFilterDescriptor(
%(desc)s, CUDNN_DATA_FLOAT,
CudaNdarray_HOST_DIMS(%(var)s)[0],
CudaNdarray_HOST_DIMS(%(var)s)[1],
CudaNdarray_HOST_DIMS(%(var)s)[2],
CudaNdarray_HOST_DIMS(%(var)s)[3]
);
if (%(err)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not set filter descriptor: %%s",
cudnnGetErrorString(%(err)s));
%(fail)s
}
""" % dict(var=var, desc=desc, err=err, fail=fail)
def c_set_tensor4d(self, *arg):
return c_set_tensor4d(*arg)
def c_code(self, node, name, inputs, outputs, sub):
desc = inputs[2]
out, = outputs
checks = []
for v in inputs[:2]:
checks.append("""
if (!CudaNdarray_is_c_contiguous(%s)) {
PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported.");
%s
}
""" % (v, sub['fail']))
sets = []
for p, v, d in zip(inputs[:2], self.conv_inputs, self.conv_types[:2]):
sets.append(getattr(self, 'c_set_'+d)(p, v + name,
'err' + name, sub['fail']))
set_out = getattr(self, 'c_set_' + self.conv_types[2])(
out, self.conv_output + name, 'err' + name,
sub['fail'])
return """
cudnnStatus_t err%(name)s;
%(checks)s
%(sets)s
{
int out_dims[4];
err%(name)s = cudnnGetOutputTensor4dDim(
%(desc)s, %(path)s,
&out_dims[0], &out_dims[1],
&out_dims[2], &out_dims[3]
);
if (err%(name)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not get output sizes: %%s",
cudnnGetErrorString(err%(name)s));
%(fail)s
}
// workaround for cudnn R1 bug
if (%(path)s == CUDNN_CONVOLUTION_WEIGHT_GRAD &&
(out_dims[0] != CudaNdarray_HOST_DIMS(%(input2)s)[1] ||
out_dims[1] != CudaNdarray_HOST_DIMS(%(input1)s)[1])) {
out_dims[0] = CudaNdarray_HOST_DIMS(%(input2)s)[1];
out_dims[1] = CudaNdarray_HOST_DIMS(%(input1)s)[1];
// This is a horrible hack that is unfortulately necessary
int *dd = (int *)%(desc)s;
out_dims[2] = dd[5];
out_dims[3] = dd[6];
}
if (CudaNdarray_prep_output(&%(out)s, 4, out_dims) != 0) {
%(fail)s
}
}
%(set_out)s
err%(name)s = %(method)s(
_handle,
%(input1_desc)s, CudaNdarray_DEV_DATA(%(input1)s),
%(input2_desc)s, CudaNdarray_DEV_DATA(%(input2)s),
%(desc)s,
%(output_desc)s, CudaNdarray_DEV_DATA(%(out)s),
CUDNN_RESULT_NO_ACCUMULATE
);
if (err%(name)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "error doing operation: %%s",
cudnnGetErrorString(err%(name)s));
%(fail)s
}
""" % dict(out=out, desc=desc, fail=sub['fail'],
name=name, checks='\n'.join(checks), sets='\n'.join(sets),
set_out=set_out, input1=inputs[0], input2=inputs[1],
input1_desc=self.conv_inputs[0]+name,
input2_desc=self.conv_inputs[1]+name,
output_desc=self.conv_output+name,
method=self.conv_op, path=self.path_flag)
def c_code_cache_version(self):
return (8,)
class GpuDnnConv(GpuDnnConvBase): class GpuDnnConv(DnnBase, COp):
""" """
The forward convolution. The forward convolution.
...@@ -398,11 +327,11 @@ class GpuDnnConv(GpuDnnConvBase): ...@@ -398,11 +327,11 @@ class GpuDnnConv(GpuDnnConvBase):
:param descr: the convolution descriptor :param descr: the convolution descriptor
""" """
conv_inputs = 'input', 'kerns' __props__ = ()
conv_output = 'output'
conv_types = 'tensor4d', 'filter', 'tensor4d' def __init__(self):
conv_op = 'cudnnConvolutionForward' COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_fwd.c"],
path_flag = 'CUDNN_CONVOLUTION_FWD' "APPLY_SPECIFIC(conv_fwd)")
def make_node(self, img, kern, desc): def make_node(self, img, kern, desc):
img = as_cuda_ndarray_variable(img) img = as_cuda_ndarray_variable(img)
...@@ -428,8 +357,10 @@ class GpuDnnConv(GpuDnnConvBase): ...@@ -428,8 +357,10 @@ class GpuDnnConv(GpuDnnConvBase):
top = gpu_contiguous(top) top = gpu_contiguous(top)
d_img = GpuDnnConvGradI()(kerns, top, desc) d_img = GpuDnnConvGradI()(kerns, top, desc,
d_kerns = GpuDnnConvGradW()(img, top, desc) img.shape[2], img.shape[3])
d_kerns = GpuDnnConvGradW()(img, top, desc,
kerns.shape[2], kerns.shape[3])
return d_img, d_kerns, theano.gradient.DisconnectedType()() return d_img, d_kerns, theano.gradient.DisconnectedType()()
...@@ -438,7 +369,7 @@ class GpuDnnConv(GpuDnnConvBase): ...@@ -438,7 +369,7 @@ class GpuDnnConv(GpuDnnConvBase):
return [[1], [1], [0]] return [[1], [1], [0]]
class GpuDnnConvGradW(GpuDnnConvBase): class GpuDnnConvGradW(DnnBase, COp):
""" """
The convolution gradient with respect to the weights. The convolution gradient with respect to the weights.
...@@ -447,29 +378,30 @@ class GpuDnnConvGradW(GpuDnnConvBase): ...@@ -447,29 +378,30 @@ class GpuDnnConvGradW(GpuDnnConvBase):
:param descr: the convolution descriptor :param descr: the convolution descriptor
""" """
__props__ = ()
conv_inputs = 'input', 'output', def __init__(self):
conv_output = 'kerns' COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_gw.c"],
conv_types = 'tensor4d', 'tensor4d', 'filter' "APPLY_SPECIFIC(conv_gw)")
path_flag = 'CUDNN_CONVOLUTION_WEIGHT_GRAD'
conv_op = 'cudnnConvolutionBackwardFilter'
def grad(self, inp, grads): def grad(self, inp, grads):
img, top, desc = inp img, top, desc, h, w = inp
kerns, = grads kerns, = grads
kerns = gpu_contiguous(kerns) kerns = gpu_contiguous(kerns)
d_img = GpuDnnConvGradI()(kerns, top, desc) d_img = GpuDnnConvGradI()(kerns, top, desc,
img.shape[2], img.shape[3])
d_top = GpuDnnConv()(img, kerns, desc) d_top = GpuDnnConv()(img, kerns, desc)
return d_img, d_top, theano.gradient.DisconnectedType()() return (d_img, d_top, DisconnectedType()(), DisconnectedType()(),
DisconnectedType()())
def connection_pattern(self, node): def connection_pattern(self, node):
# not connected to desc # not connected to desc, h, w
return [[1], [1], [0]] return [[1], [1], [0], [0], [0]]
def make_node(self, img, topgrad, desc): def make_node(self, img, topgrad, desc, h, w):
img = as_cuda_ndarray_variable(img) img = as_cuda_ndarray_variable(img)
topgrad = as_cuda_ndarray_variable(topgrad) topgrad = as_cuda_ndarray_variable(topgrad)
if img.type.ndim != 4: if img.type.ndim != 4:
...@@ -481,14 +413,18 @@ class GpuDnnConvGradW(GpuDnnConvBase): ...@@ -481,14 +413,18 @@ class GpuDnnConvGradW(GpuDnnConvBase):
or desc.type.ctype != 'cudnnConvolutionDescriptor_t': or desc.type.ctype != 'cudnnConvolutionDescriptor_t':
raise TypeError('desc must be cudnnConvolutionDescriptor_t') raise TypeError('desc must be cudnnConvolutionDescriptor_t')
h = as_scalar(h)
w = as_scalar(w)
broadcastable = [topgrad.type.broadcastable[1], broadcastable = [topgrad.type.broadcastable[1],
img.type.broadcastable[1], img.type.broadcastable[1],
False, False] False, False]
return Apply(self, [img, topgrad, desc],
return Apply(self, [img, topgrad, desc, h, w],
[CudaNdarrayType(broadcastable)()]) [CudaNdarrayType(broadcastable)()])
class GpuDnnConvGradI(GpuDnnConvBase): class GpuDnnConvGradI(DnnBase, COp):
""" """
The convolution gradient with respect to the inputs. The convolution gradient with respect to the inputs.
...@@ -497,29 +433,29 @@ class GpuDnnConvGradI(GpuDnnConvBase): ...@@ -497,29 +433,29 @@ class GpuDnnConvGradI(GpuDnnConvBase):
:param descr: the convolution descriptor :param descr: the convolution descriptor
""" """
__props__ = ()
conv_inputs = 'kerns', 'output', def __init__(self):
conv_output = 'input' COp.__init__(self, ["dnn_base.c", "dnn_conv_base.c", "dnn_gi.c"],
conv_types = 'filter', 'tensor4d', 'tensor4d' "APPLY_SPECIFIC(conv_gi)")
path_flag = 'CUDNN_CONVOLUTION_DATA_GRAD'
conv_op = 'cudnnConvolutionBackwardData'
def grad(self, inp, grads): def grad(self, inp, grads):
kerns, top, desc = inp kerns, top, desc, h, w = inp
img, = grads img, = grads
img = gpu_contiguous(img) img = gpu_contiguous(img)
d_kerns = GpuDnnConvGradW()(img, top, desc) d_kerns = GpuDnnConvGradW()(img, top, desc,
kerns.shape[2], kerns.shape[3])
d_top = GpuDnnConv()(img, kerns, desc) d_top = GpuDnnConv()(img, kerns, desc)
return (d_kerns, d_top, DisconnectedType()(), DisconnectedType()(),
return d_kerns, d_top, theano.gradient.DisconnectedType()() DisconnectedType()())
def connection_pattern(self, node): def connection_pattern(self, node):
# not connected to desc # not connected to desc, h, w
return [[1], [1], [0]] return [[1], [1], [0], [0], [0]]
def make_node(self, kern, topgrad, desc): def make_node(self, kern, topgrad, desc, h, w):
kern = as_cuda_ndarray_variable(kern) kern = as_cuda_ndarray_variable(kern)
topgrad = as_cuda_ndarray_variable(topgrad) topgrad = as_cuda_ndarray_variable(topgrad)
if kern.type.ndim != 4: if kern.type.ndim != 4:
...@@ -531,10 +467,14 @@ class GpuDnnConvGradI(GpuDnnConvBase): ...@@ -531,10 +467,14 @@ class GpuDnnConvGradI(GpuDnnConvBase):
or desc.type.ctype != 'cudnnConvolutionDescriptor_t': or desc.type.ctype != 'cudnnConvolutionDescriptor_t':
raise TypeError('desc must be cudnnConvolutionDescriptor_t') raise TypeError('desc must be cudnnConvolutionDescriptor_t')
h = as_scalar(h)
w = as_scalar(w)
broadcastable = [topgrad.type.broadcastable[0], broadcastable = [topgrad.type.broadcastable[0],
kern.type.broadcastable[1], kern.type.broadcastable[1],
False, False] False, False]
return Apply(self, [kern, topgrad, desc],
return Apply(self, [kern, topgrad, desc, h, w],
[CudaNdarrayType(broadcastable)()]) [CudaNdarrayType(broadcastable)()])
...@@ -581,7 +521,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1), ...@@ -581,7 +521,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1),
img.shape[3] - kerns.shape[3] + 1) img.shape[3] - kerns.shape[3] + 1)
desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1), desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1),
conv_mode='cross')(img.shape, shape) conv_mode='cross')(img.shape, shape)
conv = GpuDnnConvGradW()(img, kerns, desc) conv = GpuDnnConvGradW()(img, kerns, desc, shape[2], shape[3])
return as_cuda_ndarray_variable(conv.dimshuffle(1, 0, 2, 3)) return as_cuda_ndarray_variable(conv.dimshuffle(1, 0, 2, 3))
elif (border_mode == 'full' and subsample == (1, 1) and elif (border_mode == 'full' and subsample == (1, 1) and
...@@ -597,7 +537,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1), ...@@ -597,7 +537,7 @@ def dnn_conv(img, kerns, border_mode='valid', subsample=(1, 1),
img.shape[3] + kerns.shape[3] - 1) img.shape[3] + kerns.shape[3] - 1)
desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1), desc = GpuDnnConvDesc(border_mode='valid', subsample=(1, 1),
conv_mode=conv_mode)(shape, kerns.shape) conv_mode=conv_mode)(shape, kerns.shape)
return GpuDnnConvGradI()(kerns, img, desc) return GpuDnnConvGradI()(kerns, img, desc, shape[2], shape[3])
# Standard case: We use GpuDnnConv with suitable padding. # Standard case: We use GpuDnnConv with suitable padding.
img = gpu_contiguous(img) img = gpu_contiguous(img)
...@@ -664,14 +604,22 @@ class GpuDnnPoolDesc(GpuOp): ...@@ -664,14 +604,22 @@ class GpuDnnPoolDesc(GpuOp):
"descriptor: %%s", cudnnGetErrorString(err)); "descriptor: %%s", cudnnGetErrorString(err));
%(fail)s %(fail)s
} }
#ifndef CUDNN_VERSION
err = cudnnSetPoolingDescriptor( err = cudnnSetPoolingDescriptor(
%(desc)s, %(desc)s,
%(mode_flag)s, %(mode_flag)s,
%(wsX)d, %(wsY)d, %(wsX)d, %(wsY)d,
%(stridex)d, %(stridey)d %(stridex)d, %(stridey)d
); );
#else
err = cudnnSetPooling2dDescriptor(
%(desc)s,
%(mode_flag)s,
%(wsX)d, %(wsY)d,
0, 0,
%(stridex)d, %(stridey)d
);
#endif
if (err != CUDNN_STATUS_SUCCESS) { if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s", PyErr_Format(PyExc_RuntimeError, "could not set op descriptor: %%s",
cudnnGetErrorString(err)); cudnnGetErrorString(err));
...@@ -683,7 +631,7 @@ class GpuDnnPoolDesc(GpuOp): ...@@ -683,7 +631,7 @@ class GpuDnnPoolDesc(GpuOp):
stridey=self.stride[1]) stridey=self.stride[1])
def c_code_cache_version(self): def c_code_cache_version(self):
return (1,) return (1, version())
class GpuDnnPool(DnnBase): class GpuDnnPool(DnnBase):
...@@ -709,8 +657,8 @@ class GpuDnnPool(DnnBase): ...@@ -709,8 +657,8 @@ class GpuDnnPool(DnnBase):
def c_support_code_struct(self, node, name): def c_support_code_struct(self, node, name):
return """ return """
cudnnTensor4dDescriptor_t input%(name)s; cudnnTensorDescriptor_t input%(name)s;
cudnnTensor4dDescriptor_t output%(name)s; cudnnTensorDescriptor_t output%(name)s;
""" % dict(name=name) """ % dict(name=name)
def c_init_code_struct(self, node, name, sub): def c_init_code_struct(self, node, name, sub):
...@@ -718,12 +666,12 @@ cudnnTensor4dDescriptor_t output%(name)s; ...@@ -718,12 +666,12 @@ cudnnTensor4dDescriptor_t output%(name)s;
cudnnStatus_t err%(name)s; cudnnStatus_t err%(name)s;
input%(name)s = NULL; input%(name)s = NULL;
output%(name)s = NULL; output%(name)s = NULL;
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor " PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(inp): %%s", cudnnGetErrorString(err%(name)s)); "(inp): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
} }
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor " PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(out): %%s", cudnnGetErrorString(err%(name)s)); "(out): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
...@@ -732,8 +680,8 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STAT ...@@ -732,8 +680,8 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STAT
def c_cleanup_code_struct(self, node, name): def c_cleanup_code_struct(self, node, name):
return """ return """
if (input%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input%(name)s); } if (input%(name)s != NULL) { cudnnDestroyTensorDescriptor(input%(name)s); }
if (output%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output%(name)s); } if (output%(name)s != NULL) { cudnnDestroyTensorDescriptor(output%(name)s); }
""" % dict(name=name) """ % dict(name=name)
def c_code(self, node, name, inputs, outputs, sub): def c_code(self, node, name, inputs, outputs, sub):
...@@ -759,9 +707,19 @@ if (!CudaNdarray_is_c_contiguous(%(input)s)) { ...@@ -759,9 +707,19 @@ if (!CudaNdarray_is_c_contiguous(%(input)s)) {
%(set_in)s %(set_in)s
cudnnPoolingMode_t mode; cudnnPoolingMode_t mode;
int wsX, wsY, strideX, strideY; int wsX, wsY, vpad, hpad, strideX, strideY;
#ifndef CUDNN_VERSION
err%(name)s = cudnnGetPoolingDescriptor(%(desc)s, &mode, &wsX, &wsY, &strideX, &strideY); err%(name)s = cudnnGetPoolingDescriptor(
%(desc)s, &mode,
&wsX, &wsY,
&strideX, &strideY);
#else
err%(name)s = cudnnGetPooling2dDescriptor(
%(desc)s, &mode,
&wsX, &wsY,
&vpad, &hpad,
&strideX, &strideY);
#endif
if (err%(name)s != CUDNN_STATUS_SUCCESS) { if (err%(name)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, PyErr_Format(PyExc_RuntimeError,
...@@ -781,13 +739,27 @@ if (CudaNdarray_prep_output(&%(out)s, 4, %(out)s_dims) != 0) ...@@ -781,13 +739,27 @@ if (CudaNdarray_prep_output(&%(out)s, 4, %(out)s_dims) != 0)
} }
%(set_out)s %(set_out)s
#ifndef CUDNN_VERSION
err%(name)s = cudnnPoolingForward(
_handle,
%(desc)s,
%(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
%(output_desc)s, CudaNdarray_DEV_DATA(%(out)s)
);
#else
{
const float alpha = 1;
const float beta = 0;
err%(name)s = cudnnPoolingForward( err%(name)s = cudnnPoolingForward(
_handle, _handle,
%(desc)s, %(desc)s,
&alpha,
%(input_desc)s, CudaNdarray_DEV_DATA(%(input)s), %(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
&beta,
%(output_desc)s, CudaNdarray_DEV_DATA(%(out)s) %(output_desc)s, CudaNdarray_DEV_DATA(%(out)s)
); );
}
#endif
if (err%(name)s != CUDNN_STATUS_SUCCESS) { if (err%(name)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, PyErr_Format(PyExc_RuntimeError,
"GpuDnnPool: error doing cudnnPoolingForward operation: %%s", "GpuDnnPool: error doing cudnnPoolingForward operation: %%s",
...@@ -817,7 +789,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) { ...@@ -817,7 +789,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) {
return [[1], [0]] return [[1], [0]]
def c_code_cache_version(self): def c_code_cache_version(self):
return (4,) return (4, version())
class GpuDnnPoolGrad(DnnBase): class GpuDnnPoolGrad(DnnBase):
...@@ -853,10 +825,10 @@ class GpuDnnPoolGrad(DnnBase): ...@@ -853,10 +825,10 @@ class GpuDnnPoolGrad(DnnBase):
def c_support_code_struct(self, node, name): def c_support_code_struct(self, node, name):
return """ return """
cudnnTensor4dDescriptor_t input%(name)s; cudnnTensorDescriptor_t input%(name)s;
cudnnTensor4dDescriptor_t input_grad%(name)s; cudnnTensorDescriptor_t input_grad%(name)s;
cudnnTensor4dDescriptor_t output%(name)s; cudnnTensorDescriptor_t output%(name)s;
cudnnTensor4dDescriptor_t output_grad%(name)s; cudnnTensorDescriptor_t output_grad%(name)s;
""" % dict(name=name) """ % dict(name=name)
def c_init_code_struct(self, node, name, sub): def c_init_code_struct(self, node, name, sub):
...@@ -866,25 +838,25 @@ input%(name)s = NULL; ...@@ -866,25 +838,25 @@ input%(name)s = NULL;
input_grad%(name)s = NULL; input_grad%(name)s = NULL;
output%(name)s = NULL; output%(name)s = NULL;
output_grad%(name)s = NULL; output_grad%(name)s = NULL;
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&input%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"GpuDnnPoolGrad: could not allocate tensor4d descriptor " "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
"(input): %%s", cudnnGetErrorString(err%(name)s)); "(input): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
} }
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&input_grad%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&input_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"GpuDnnPoolGrad: could not allocate tensor4d descriptor " "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
"(input_grad): %%s", cudnnGetErrorString(err%(name)s)); "(input_grad): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
} }
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&output%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"GpuDnnPoolGrad: could not allocate tensor4d descriptor " "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
"(output): %%s", cudnnGetErrorString(err%(name)s)); "(output): %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
} }
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output_grad%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&output_grad%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"GpuDnnPoolGrad: could not allocate tensor4d descriptor " "GpuDnnPoolGrad: could not allocate tensor4d descriptor "
"(output_grad): %%s", cudnnGetErrorString(err%(name)s)); "(output_grad): %%s", cudnnGetErrorString(err%(name)s));
...@@ -894,10 +866,10 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output_grad%(name)s)) != CUDNN ...@@ -894,10 +866,10 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&output_grad%(name)s)) != CUDNN
def c_cleanup_code_struct(self, node, name): def c_cleanup_code_struct(self, node, name):
return """ return """
if (input%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input%(name)s); } if (input%(name)s != NULL) { cudnnDestroyTensorDescriptor(input%(name)s); }
if (input_grad%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(input_grad%(name)s); } if (input_grad%(name)s != NULL) { cudnnDestroyTensorDescriptor(input_grad%(name)s); }
if (output%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output%(name)s); } if (output%(name)s != NULL) { cudnnDestroyTensorDescriptor(output%(name)s); }
if (output_grad%(name)s != NULL) { cudnnDestroyTensor4dDescriptor(output_grad%(name)s); } if (output_grad%(name)s != NULL) { cudnnDestroyTensorDescriptor(output_grad%(name)s); }
""" % dict(name=name) """ % dict(name=name)
def c_code(self, node, name, inputs, outputs, sub): def c_code(self, node, name, inputs, outputs, sub):
...@@ -949,7 +921,7 @@ if (CudaNdarray_prep_output(&%(output_grad)s, 4, ...@@ -949,7 +921,7 @@ if (CudaNdarray_prep_output(&%(output_grad)s, 4,
} }
%(set_out)s %(set_out)s
#ifndef CUDNN_VERSION
err%(name)s = cudnnPoolingBackward( err%(name)s = cudnnPoolingBackward(
_handle, _handle,
%(desc)s, %(desc)s,
...@@ -958,6 +930,22 @@ _handle, ...@@ -958,6 +930,22 @@ _handle,
%(output_desc)s, CudaNdarray_DEV_DATA(%(output)s), %(output_desc)s, CudaNdarray_DEV_DATA(%(output)s),
%(output_grad_desc)s, CudaNdarray_DEV_DATA(%(output_grad)s) %(output_grad_desc)s, CudaNdarray_DEV_DATA(%(output_grad)s)
); );
#else
{
const float alpha = 1;
const float beta = 0;
err%(name)s = cudnnPoolingBackward(
_handle,
%(desc)s,
&alpha,
%(input_desc)s, CudaNdarray_DEV_DATA(%(input)s),
%(input_grad_desc)s, CudaNdarray_DEV_DATA(%(input_grad)s),
%(output_desc)s, CudaNdarray_DEV_DATA(%(output)s),
&beta,
%(output_grad_desc)s, CudaNdarray_DEV_DATA(%(output_grad)s)
);
}
#endif
if (err%(name)s != CUDNN_STATUS_SUCCESS) { if (err%(name)s != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, PyErr_Format(PyExc_RuntimeError,
"GpuDnnPoolGrad: error doing operation: %%s", "GpuDnnPoolGrad: error doing operation: %%s",
...@@ -974,7 +962,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) { ...@@ -974,7 +962,7 @@ if (err%(name)s != CUDNN_STATUS_SUCCESS) {
output_grad_desc="output_grad"+name) output_grad_desc="output_grad"+name)
def c_code_cache_version(self): def c_code_cache_version(self):
return (4,) return (4, version())
def dnn_pool(img, ws, stride=(1, 1), mode='max'): def dnn_pool(img, ws, stride=(1, 1), mode='max'):
...@@ -1015,6 +1003,7 @@ class GpuDnnSoftmaxBase(DnnBase): ...@@ -1015,6 +1003,7 @@ class GpuDnnSoftmaxBase(DnnBase):
def __init__(self, tensor_format, algo, mode): def __init__(self, tensor_format, algo, mode):
assert(tensor_format in ('bc01', 'b01c')) assert(tensor_format in ('bc01', 'b01c'))
DnnBase.__init__(self)
self.tensor_format = tensor_format self.tensor_format = tensor_format
assert(algo in ('fast', 'accurate')) assert(algo in ('fast', 'accurate'))
...@@ -1029,14 +1018,14 @@ class GpuDnnSoftmaxBase(DnnBase): ...@@ -1029,14 +1018,14 @@ class GpuDnnSoftmaxBase(DnnBase):
def _define_tensor4d_desc(self, name, id): def _define_tensor4d_desc(self, name, id):
return """ return """
cudnnTensor4dDescriptor_t %(id)s_%(name)s; cudnnTensorDescriptor_t %(id)s_%(name)s;
""" % dict(name=name, id=id) """ % dict(name=name, id=id)
def _init_tensor4d_desc(self, name, id, fail): def _init_tensor4d_desc(self, name, id, fail):
return """ return """
%(id)s_%(name)s = NULL; %(id)s_%(name)s = NULL;
if ((err%(name)s = cudnnCreateTensor4dDescriptor(&%(id)s_%(name)s)) != CUDNN_STATUS_SUCCESS) { if ((err%(name)s = cudnnCreateTensorDescriptor(&%(id)s_%(name)s)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor " PyErr_Format(PyExc_MemoryError, "could not allocate tensor descriptor "
": %%s", cudnnGetErrorString(err%(name)s)); ": %%s", cudnnGetErrorString(err%(name)s));
%(fail)s %(fail)s
} }
...@@ -1045,7 +1034,7 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&%(id)s_%(name)s)) != CUDNN_STA ...@@ -1045,7 +1034,7 @@ if ((err%(name)s = cudnnCreateTensor4dDescriptor(&%(id)s_%(name)s)) != CUDNN_STA
def _clean_tensor4d_desc(self, name, id): def _clean_tensor4d_desc(self, name, id):
return """ return """
if(%(id)s_%(name)s!= NULL) if(%(id)s_%(name)s!= NULL)
cudnnDestroyTensor4dDescriptor(%(id)s_%(name)s); cudnnDestroyTensorDescriptor(%(id)s_%(name)s);
""" % dict(name=name, id=id) """ % dict(name=name, id=id)
def c_support_code_struct(self, node, name): def c_support_code_struct(self, node, name):
...@@ -1102,8 +1091,7 @@ if (%(algo)d == 1) ...@@ -1102,8 +1091,7 @@ if (%(algo)d == 1)
cudnnSoftmaxMode_t mode%(name)s = CUDNN_SOFTMAX_MODE_CHANNEL; cudnnSoftmaxMode_t mode%(name)s = CUDNN_SOFTMAX_MODE_CHANNEL;
if (%(mode)d == 1) if (%(mode)d == 1)
mode%(name)s = CUDNN_SOFTMAX_MODE_INSTANCE; mode%(name)s = CUDNN_SOFTMAX_MODE_INSTANCE;
""" % dict(name=name, """ % dict(name=name, tensor_format=tensor_format, mode=mode, algo=algo)
tensor_format=tensor_format, mode=mode, algo=algo)
# Validate the input and build the input variables. # Validate the input and build the input variables.
for input_idx, input_name in enumerate(self.softmax_inputs): for input_idx, input_name in enumerate(self.softmax_inputs):
...@@ -1134,7 +1122,7 @@ if (CudaNdarray_prep_output(&%(outs)s, 4, CudaNdarray_HOST_DIMS(%(ins)s)) != 0) ...@@ -1134,7 +1122,7 @@ if (CudaNdarray_prep_output(&%(outs)s, 4, CudaNdarray_HOST_DIMS(%(ins)s)) != 0)
return result return result
def c_code_cache_version(self): def c_code_cache_version(self):
return (0, 6) return (0, 6, version())
def method(self): def method(self):
raise NotImplementedError('GpuDnnSoftmaxBase::method') raise NotImplementedError('GpuDnnSoftmaxBase::method')
...@@ -1150,15 +1138,33 @@ class GpuDnnSoftmax(GpuDnnSoftmaxBase): ...@@ -1150,15 +1138,33 @@ class GpuDnnSoftmax(GpuDnnSoftmaxBase):
def method(self): def method(self):
return """ return """
#ifndef CUDNN_VERSION
err%(name)s = cudnnSoftmaxForward(
_handle,
algo%(name)s,
mode%(name)s,
softmax_input_%(name)s,
CudaNdarray_DEV_DATA(%(ins)s),
softmax_output_%(name)s,
CudaNdarray_DEV_DATA(%(outs)s)
);
#else
{
const float alpha = 1.;
const float beta = 0.;
err%(name)s = cudnnSoftmaxForward( err%(name)s = cudnnSoftmaxForward(
_handle, _handle,
algo%(name)s, algo%(name)s,
mode%(name)s, mode%(name)s,
(void*) &alpha,
softmax_input_%(name)s, softmax_input_%(name)s,
CudaNdarray_DEV_DATA(%(ins)s), CudaNdarray_DEV_DATA(%(ins)s),
(void*) &beta,
softmax_output_%(name)s, softmax_output_%(name)s,
CudaNdarray_DEV_DATA(%(outs)s) CudaNdarray_DEV_DATA(%(outs)s)
); );
}
#endif
""" """
def grad(self, inp, grads): def grad(self, inp, grads):
...@@ -1184,6 +1190,7 @@ class GpuDnnSoftmaxGrad(GpuDnnSoftmaxBase): ...@@ -1184,6 +1190,7 @@ class GpuDnnSoftmaxGrad(GpuDnnSoftmaxBase):
def method(self): def method(self):
return """ return """
#ifndef CUDNN_VERSION
err%(name)s = cudnnSoftmaxBackward( err%(name)s = cudnnSoftmaxBackward(
_handle, _handle,
algo%(name)s, algo%(name)s,
...@@ -1195,7 +1202,26 @@ err%(name)s = cudnnSoftmaxBackward( ...@@ -1195,7 +1202,26 @@ err%(name)s = cudnnSoftmaxBackward(
softmax_output_%(name)s, softmax_output_%(name)s,
CudaNdarray_DEV_DATA(%(outs)s) CudaNdarray_DEV_DATA(%(outs)s)
); );
""" #else
{
const float alpha = 1.;
const float beta = 0.;
err%(name)s = cudnnSoftmaxBackward(
_handle,
algo%(name)s,
mode%(name)s,
(void*) &alpha,
%(name1)s_%(name)s,
CudaNdarray_DEV_DATA(%(ins1)s),
%(name0)s_%(name)s,
CudaNdarray_DEV_DATA(%(ins0)s),
(void*) &beta,
softmax_output_%(name)s,
CudaNdarray_DEV_DATA(%(outs)s)
);
}
#endif
"""
# Intentation for history # Intentation for history
......
#section support_code
static cudnnHandle_t _handle = NULL;
static int
c_set_tensor4d(CudaNdarray *var, cudnnTensorDescriptor_t desc) {
cudnnStatus_t err = cudnnSetTensor4dDescriptorEx(
desc, CUDNN_DATA_FLOAT,
CudaNdarray_HOST_DIMS(var)[0],
CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_DIMS(var)[2],
CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[0]?CudaNdarray_HOST_STRIDES(var)[0]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3]*CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_STRIDES(var)[1]?CudaNdarray_HOST_STRIDES(var)[1]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[2]?CudaNdarray_HOST_STRIDES(var)[2]:CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[3]?CudaNdarray_HOST_STRIDES(var)[3]:1
);
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError,
"Could not set tensor4d descriptor: %s"
"shapes=%d %d %d %d strides=%d %d %d %d",
cudnnGetErrorString(err),
CudaNdarray_HOST_DIMS(var)[0],
CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_DIMS(var)[2],
CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[0]?CudaNdarray_HOST_STRIDES(var)[0]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3]*CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_STRIDES(var)[1]?CudaNdarray_HOST_STRIDES(var)[1]:CudaNdarray_HOST_DIMS(var)[2]*CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[2]?CudaNdarray_HOST_STRIDES(var)[2]:CudaNdarray_HOST_DIMS(var)[3],
CudaNdarray_HOST_STRIDES(var)[3]?CudaNdarray_HOST_STRIDES(var)[3]:1
);
return -1;
}
return 0;
}
static int
c_set_filter(CudaNdarray *var, cudnnFilterDescriptor_t desc) {
if (!CudaNdarray_is_c_contiguous(var)) {
PyErr_SetString(PyExc_ValueError,
"Only contiguous filters (kernels) are supported.");
return -1;
}
cudnnStatus_t err = cudnnSetFilter4dDescriptor(
desc, CUDNN_DATA_FLOAT,
CudaNdarray_HOST_DIMS(var)[0],
CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_DIMS(var)[2],
CudaNdarray_HOST_DIMS(var)[3]
);
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError,
"Could not set filter descriptor: %s."
" dims= %d %d %d %d",
cudnnGetErrorString(err),
CudaNdarray_HOST_DIMS(var)[0],
CudaNdarray_HOST_DIMS(var)[1],
CudaNdarray_HOST_DIMS(var)[2],
CudaNdarray_HOST_DIMS(var)[3]);
return -1;
}
return 0;
}
#section init_code
{
cudnnStatus_t err;
if ((err = cudnnCreate(&_handle)) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not create cuDNN handle: %s",
cudnnGetErrorString(err));
#if PYTHON_MAJOR_VERSION >= 3
return NULL;
#else
return;
#endif
}
}
#section support_code_struct
cudnnTensorDescriptor_t APPLY_SPECIFIC(input);
cudnnTensorDescriptor_t APPLY_SPECIFIC(output);
cudnnFilterDescriptor_t APPLY_SPECIFIC(kerns);
#section init_code_struct
cudnnStatus_t APPLY_SPECIFIC(err);
APPLY_SPECIFIC(input) = NULL;
APPLY_SPECIFIC(output) = NULL;
APPLY_SPECIFIC(kerns) = NULL;
if ((APPLY_SPECIFIC(err) = cudnnCreateTensorDescriptor(&APPLY_SPECIFIC(input))) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(inp): %s", cudnnGetErrorString(APPLY_SPECIFIC(err)));
FAIL;
}
if ((APPLY_SPECIFIC(err) = cudnnCreateTensorDescriptor(&APPLY_SPECIFIC(output))) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate tensor4d descriptor "
"(out): %s", cudnnGetErrorString(APPLY_SPECIFIC(err)));
FAIL;
}
if ((APPLY_SPECIFIC(err) = cudnnCreateFilterDescriptor(&APPLY_SPECIFIC(kerns))) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, "could not allocate filter descriptor: %s",
cudnnGetErrorString(APPLY_SPECIFIC(err)));
FAIL;
}
#section cleanup_code_struct
if (APPLY_SPECIFIC(input) != NULL)
cudnnDestroyTensorDescriptor(APPLY_SPECIFIC(input));
if (APPLY_SPECIFIC(output) != NULL)
cudnnDestroyTensorDescriptor(APPLY_SPECIFIC(output));
if (APPLY_SPECIFIC(kerns) != NULL)
cudnnDestroyFilterDescriptor(APPLY_SPECIFIC(kerns));
#section support_code_struct
int
APPLY_SPECIFIC(conv_fwd)(CudaNdarray *input, CudaNdarray *kerns,
cudnnConvolutionDescriptor_t desc,
CudaNdarray **output) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
if (c_set_tensor4d(input, APPLY_SPECIFIC(input)) == -1)
return 1;
if (c_set_filter(kerns, APPLY_SPECIFIC(kerns)) == -1)
return 1;
{
int out_dims[4];
err = cudnnGetConvolution2dForwardOutputDim(
desc,
APPLY_SPECIFIC(input),
APPLY_SPECIFIC(kerns),
&out_dims[0], &out_dims[1], &out_dims[2], &out_dims[3]);
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError,
"GpuDnnConv: error while computing the output shape: %s",
cudnnGetErrorString(err));
return 1;
}
if (CudaNdarray_prep_output(output, 4, out_dims) != 0) {
return 1;
}
}
if (c_set_tensor4d(*output, APPLY_SPECIFIC(output)) == -1)
return 1;
{
cudnnConvolutionFwdAlgo_t algo;
err = cudnnGetConvolutionForwardAlgorithm(
_handle,
APPLY_SPECIFIC(input),
APPLY_SPECIFIC(kerns),
desc,
APPLY_SPECIFIC(output),
CUDNN_CONVOLUTION_FWD_NO_WORKSPACE, // TODO: add op param
0,
&algo);
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError,
"GpuDnnConv: Couldn't select convolution algorithm: %s",
cudnnGetErrorString(err));
return 1;
}
const float alpha = 1;
const float beta = 0;
err = cudnnConvolutionForward(
_handle,
(void *)&alpha,
APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(input),
APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(kerns),
desc,
algo,
NULL, 0,
(void *)&beta,
APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(*output));
}
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "GpuDnnConv: error doing operation: %s",
cudnnGetErrorString(err));
return 1;
}
return 0;
}
#section support_code_struct
int
APPLY_SPECIFIC(conv_gi)(CudaNdarray *kerns, CudaNdarray *output,
cudnnConvolutionDescriptor_t desc,
int h, int w,
CudaNdarray **input) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
if (c_set_tensor4d(output, APPLY_SPECIFIC(output)) == -1)
return 1;
if (c_set_filter(kerns, APPLY_SPECIFIC(kerns)) == -1)
return 1;
{
int out_dims[4];
out_dims[0] = CudaNdarray_HOST_DIMS(output)[0];
out_dims[1] = CudaNdarray_HOST_DIMS(kerns)[1];
out_dims[2] = h;
out_dims[3] = w;
if (CudaNdarray_prep_output(input, 4, out_dims) != 0) {
return 1;
}
}
if (c_set_tensor4d(*input, APPLY_SPECIFIC(input)) == -1)
return 1;
{
const float alpha = 1;
const float beta = 0;
err = cudnnConvolutionBackwardData(
_handle,
(void *)&alpha,
APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(kerns),
APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(output),
desc,
(void *)&beta,
APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(*input));
}
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "GpuDnnConvGradI: error doing operation: %s",
cudnnGetErrorString(err));
return 1;
}
return 0;
}
#section support_code_struct
int
APPLY_SPECIFIC(conv_gw)(CudaNdarray *input, CudaNdarray *output,
cudnnConvolutionDescriptor_t desc,
int h, int w,
CudaNdarray **kerns) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
if (c_set_tensor4d(input, APPLY_SPECIFIC(input)) == -1)
return 1;
if (c_set_tensor4d(output, APPLY_SPECIFIC(output)) == -1)
return 1;
{
int out_dims[4];
out_dims[0] = CudaNdarray_HOST_DIMS(output)[1];
out_dims[1] = CudaNdarray_HOST_DIMS(input)[1];
out_dims[2] = h;
out_dims[3] = w;
if (CudaNdarray_prep_output(kerns, 4, out_dims) != 0) {
return 1;
}
}
if (c_set_filter(*kerns, APPLY_SPECIFIC(kerns)) == -1)
return 1;
{
const float alpha = 1;
const float beta = 0;
err = cudnnConvolutionBackwardFilter(
_handle,
(void *)&alpha,
APPLY_SPECIFIC(input), CudaNdarray_DEV_DATA(input),
APPLY_SPECIFIC(output), CudaNdarray_DEV_DATA(output),
desc,
(void *)&beta,
APPLY_SPECIFIC(kerns), CudaNdarray_DEV_DATA(*kerns));
}
if (err != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "GpuDnnConvGradW: error doing operation: %s",
cudnnGetErrorString(err));
return 1;
}
return 0;
}
...@@ -27,7 +27,7 @@ from theano.sandbox import cuda ...@@ -27,7 +27,7 @@ from theano.sandbox import cuda
if cuda.cuda_available == False: if cuda.cuda_available == False:
raise SkipTest('Optional package cuda disabled') raise SkipTest('Optional package cuda disabled')
from theano.sandbox.cuda.dnn import GpuDnnConv, GpuDnnConvBase, dnn_conv from theano.sandbox.cuda.dnn import GpuDnnConv, DnnBase, dnn_conv
#needed as the gpu conv don't have a perform implementation. #needed as the gpu conv don't have a perform implementation.
if theano.config.mode == 'FAST_COMPILE': if theano.config.mode == 'FAST_COMPILE':
...@@ -596,7 +596,7 @@ def test_gemm_valid(): ...@@ -596,7 +596,7 @@ def test_gemm_valid():
def test_dnn_valid(): def test_dnn_valid():
if not cuda.dnn.dnn_available(): if not cuda.dnn.dnn_available():
raise SkipTest(cuda.dnn.dnn_available.msg) raise SkipTest(cuda.dnn.dnn_available.msg)
for t in _test_valid(GpuDnnConvBase, mode=theano_mode.including("cudnn")): for t in _test_valid(DnnBase, mode=theano_mode.including("cudnn")):
yield t yield t
...@@ -710,7 +710,7 @@ def test_gemm_full(): ...@@ -710,7 +710,7 @@ def test_gemm_full():
def test_dnn_full(): def test_dnn_full():
if not cuda.dnn.dnn_available(): if not cuda.dnn.dnn_available():
raise SkipTest(cuda.dnn.dnn_available.msg) raise SkipTest(cuda.dnn.dnn_available.msg)
for t in _test_full(GpuDnnConvBase, mode=theano_mode.including("cudnn")): for t in _test_full(DnnBase, mode=theano_mode.including("cudnn")):
yield t yield t
...@@ -762,13 +762,13 @@ def test_gemm_subsample(): ...@@ -762,13 +762,13 @@ def test_gemm_subsample():
def test_dnn_subsample(): def test_dnn_subsample():
if not cuda.dnn.dnn_available(): if not cuda.dnn.dnn_available():
raise SkipTest(cuda.dnn.dnn_available.msg) raise SkipTest(cuda.dnn.dnn_available.msg)
for t in _test_subsample(GpuDnnConvBase, theano_mode.including('cudnn')): for t in _test_subsample(DnnBase, theano_mode.including('cudnn')):
yield t yield t
class TestConv2DGPU(unittest.TestCase): class TestConv2DGPU(unittest.TestCase):
conv_ops = (cuda.blas.GpuConv, conv_ops = (cuda.blas.GpuConv,
cuda.dnn.GpuDnnConvBase, cuda.dnn.DnnBase,
cuda.blas.BaseGpuCorrMM) cuda.blas.BaseGpuCorrMM)
def test_logical_shapes(self): def test_logical_shapes(self):
......
...@@ -192,3 +192,9 @@ def test_dnn_tag(): ...@@ -192,3 +192,9 @@ def test_dnn_tag():
assert cuda.dnn.dnn_available() assert cuda.dnn.dnn_available()
assert any([isinstance(n.op, cuda.dnn.GpuDnnPool) assert any([isinstance(n.op, cuda.dnn.GpuDnnPool)
for n in f.maker.fgraph.toposort()]) for n in f.maker.fgraph.toposort()])
def test_version():
if not cuda.dnn.dnn_available():
raise SkipTest(cuda.dnn.dnn_available.msg)
assert isinstance(cuda.dnn.version(), (int, tuple))
...@@ -82,6 +82,7 @@ get_scalar_type.cache = {} ...@@ -82,6 +82,7 @@ get_scalar_type.cache = {}
def as_scalar(x, name=None): def as_scalar(x, name=None):
from ..tensor import TensorType, scalar_from_tensor
if isinstance(x, gof.Apply): if isinstance(x, gof.Apply):
if len(x.outputs) != 1: if len(x.outputs) != 1:
raise ValueError("It is ambiguous which output of a multi-output" raise ValueError("It is ambiguous which output of a multi-output"
...@@ -89,9 +90,12 @@ def as_scalar(x, name=None): ...@@ -89,9 +90,12 @@ def as_scalar(x, name=None):
else: else:
x = x.outputs[0] x = x.outputs[0]
if isinstance(x, Variable): if isinstance(x, Variable):
if not isinstance(x.type, Scalar): if isinstance(x.type, Scalar):
return x
elif isinstance(x.type, TensorType) and x.ndim == 0:
return scalar_from_tensor(x)
else:
raise TypeError("Variable type field must be a Scalar.", x, x.type) raise TypeError("Variable type field must be a Scalar.", x, x.type)
return x
try: try:
return constant(x) return constant(x)
except TypeError: except TypeError:
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论