Merge pull request #4584 from abergeron/gpua_doc

Add some documentation on how to write gpu ops.

Merge pull request #4584 from abergeron/gpua_doc
1b8b9149 · Pascal Lamblin · GitHub · eb95000e · cb836344 · 1b8b9149
--- a/doc/extending/extending_theano_gpu.txt
+++ b/doc/extending/extending_theano_gpu.txt
+.. _extending_theano_gpu:
+==============================
+Extending Theano with a GPU Op
+==============================
+.. note::
+    This covers the :ref:`gpuarray <gpuarray>` back-end for the GPU.
+This tutorial covers how to extend Theano with an op that offers a GPU
+implementation.  It assumes you are familiar with how to write new
+Theano ops.  If that is not the case you should probably follow the
+:ref:`extending_theano` and :ref:`extending_theano_c` sections before
+continuing on.
+Writing a new GPU op can be done in Python for some simple tasks, but
+will usually done in C to access the complete API and avoid paying the
+overhead of a Python function call.
+Dealing With the Context
+========================
+One of the major differences with GPU ops is that they require a
+context (a.k.a. device) to execute.  Most of the time you can infer
+the context to run on from your inputs.  There is a way for the user
+to transfer things between contexts and to tag certain variables for
+transfer.  It might also be the case that your inputs are not all from
+the same context and you would have to choose which one to run on.
+In order to support all of those options and have a consistent
+interface, :func:`theano.gpuarray.basic_ops.infer_context_name` was
+written.  An example usage is below::
+    def make_node(self, a, b, c):
+        ctx = infer_context_name(a, b, c)
+        a = as_gpuarray_variable(a, ctx)
+        b = as_gpuarray_variable(b, ctx)
+        c = as_gpuarray_variable(c, ctx)
+        return Apply(self, [a, b, c], [a.type()])
+In this example the Op takes three inputs, all on the GPU.  In case
+one or more of your inputs is not supposed to be on the GPU, you
+should not pass it to :func:`infer_context_name` or call
+:func:`as_gpuarray_variable` on it.
+Also note that :func:`theano.gpuarray.basic_ops.as_gpuarray_variable`
+takes ``context_name`` as a mandatory parameter.  This is because it's
+not enough to know you want the value to be on the GPU, you also want
+to know which GPU to put it on.  In almost all cases, you can pass in
+the return value of :func:`infer_context_name` there.
+If you also need the context during runtime (for example to allocate
+the output), you can use the context of one of your inputs to know
+which one to use.  Here is another example::
+    def perform(self, node, inputs, output_storage):
+        A, B = inputs
+        C, = output_storage
+        C[0] = pygpu.empty([A.shape[0], B.shape[1]], dtype=A.dtype, A.context)
+        pygpu.blas.gemm(1, A, B, 0, C, overwrite_c=True)
+Finally if you require the context before perform, such as during
+make_thunk() to initialize kernels and such, you can access the
+context of your inputs through the type of the variables::
+    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+        ctx = node.inputs[0].type.context
+Note that ``GpuArrayType`` objects also have a ``context_name``
+attribute which is the symbolic equivalent of ``context``.  It can't
+be used for calls to pygpu or libgpuarray, but it should be used for
+theano operations and variables.
+The last place where you might need the context is in the C
+initialization code.  For that you will have to use the :ref:`params
+<extending_op_params>`.  The params type should be
+:data:`theano.gpuarray.type.gpu_context_type` and the params object
+should be a context object from one of your input variables::
+    def get_params(self, node):
+        return node.inputs[0].type.context
+If you don't have any input variables on the GPU you can follow the
+the example of :class:`GpuFromHost
+<theano.gpuarray.basic_ops.GpuFromHost>` or :class:`GpuEye
+<theano.gpuarray.basic_ops.GpuEye>`.  This is not a case that you
+should encounter often, so it will not be covered further.
+Defining New Kernels
+====================
+If your op needs to do some transformation on the data, chances are
+that you will need to write a new kernel.  The best way to do this is
+to leverage :class:`GpuKernelBase
+<theano.gpuarray.basic_ops.GpuKernelBase>` (or :class:`CGpuKernelBase
+<theano.gpuarray.basic_ops.CGpuKernelBase>` if you want to use the
+:class:`COp <theano.gof.op.COp>` functionality).
+For plain :class:`GpuKernelBase
+<theano.gpuarray.basic_ops.GpuKernelBase>`, you have to define a
+method called ``gpu_kernels`` which returns a list of :class:`Kernel
+<theano.gpuarray.basic_ops.Kernel>` objects.  You can define as many
+kernels as you want for a single op.  An example would look like
+this::
+    def gpu_kernels(self, node, name):
+        code = """
+    KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
+        ga_size nb = n < m ? n : m;
+        for (ga_size i = LID_0; i < nb; i += LDIM_0) {
+            a[i*m + i] = 1;
+        }
+    }"""
+        return [Kernel(
+                code=code, name="k",
+                params=[gpuarray.GpuArray, gpuarray.SIZE, gpuarray.SIZE],
+                flags=Kernel.get_flags('float64'))]
+If you want to use ``COp``, then you should use ``CGpuKernelBase``
+instead.  It adds a new section to the parsed files whose tag is
+``kernels``.  Inside that section you can define some kernels with
+``#kernel name:params:flags``.
+Here ``name`` is the name of the kernel function in the following
+code, ``params`` is a comma-separeted list of numpy typecode names.
+There are three exceptions for ``size_t`` which should be noted as
+``size``, ``ssize_t`` which should be noted as ``ssize`` and a pointer
+which should be noted as ``*``.
+``flags`` is a ``|``-separated list of C kernel flag values (can be
+empty).  The same kernel definition as above would look like this with
+``CGpuKernelBase``::
+    #section kernels
+    #kernel k : *, size, size : GA_USE_DOUBLE
+    KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
+        ga_size nb = n < m ? n : m;
+        for (ga_size i = LID_0; i < nb; i += LDIM_0) {
+        a[i*m + i] = 1;
+        }
+    }
+The second method is to handle the kernel compilation and cache on
+your own.  This is not recommended because there are lots of details
+to pay attention to that can cripple your performance if not done
+right, which GpuKernelBase handles for you.  But if you really want to
+go this way, then you can look up the C API for kernels in
+libgpuarray.
+In any case you will need to call your compiled kernel with some data,
+in most cases in your :meth:`c_code` method.  This is done using the
+`GpuKernel_call()
+<http://deeplearning.net/software/libgpuarray/c_api.html#GpuKernel_call>`_
+function in your C code.  An example calling the above kernel would
+be::
+    size_t ls, gs;
+    size_t dims[2];
+    void *args[3];
+    // ...
+    args[0] = input->ga.data;
+    args[1] = &dims[0];
+    args[2] = &dims[1];
+    ls = 1;
+    gs = 256;
+    err = GpuKernel_call(&k_k, 1, &ls, &gs, 0, args);
+    // ...
+The name of the kernel object depends on the name you passed to
+``Kernel()`` when you declared it (or the name in your `#kernel`
+statement).  It defaults to `'k_' + name`.
+For other operations in the C code you should refer to the
+`libgpuarray documentation
+<http://deeplearning.net/software/libgpuarray/>`_.
+A Complete Example
+==================
+This is a complete example using both approches for a implementation
+of the Eye operation.
+GpuKernelBase
+-------------
+Python File
+~~~~~~~~~~~
+.. literalinclude:: ../../theano/gpuarray/basic_ops.py
+    :language: python
+    :pyobject: GpuEye
+CGpuKernelBase
+--------------
+Python File
+~~~~~~~~~~~
+.. literalinclude:: ../../theano/gpuarray/tests/test_cgpukernelbase.py
+    :language: python
+    :pyobject: GpuEye
+``tstgpueye.c``
+~~~~~~~~~~~~~~~
+.. literalinclude:: ../../theano/gpuarray/tests/tstgpueye.c
+    :language: C
+Wrapping Exisiting Libraries
+============================
+PyCUDA
+------
+For things in PyCUDA (or things wrapped with PyCUDA), we usually need
+to create a PyCUDA context.  This can be done with the following
+code::
+    with gpuarray_cuda_context:
+        pycuda_context = pycuda.driver.Context.attach()
+If you don't need to create a context, because the library doesn't
+require it, you can also just use the pygpu context and a `with`
+statement like above for all your code which will make the context the
+current context on the cuda stack.
+GpuArray objects are compatible with PyCUDA and will expose the
+necessary interface so that they can be used in most things.  One
+notable exception is PyCUDA kernels which require native objects.  If
+you need to convert a pygpu GpuArray to a PyCUDA GPUArray, this code
+should do the trick::
+    assert pygpu_array.flags['IS_C_CONTIGUOUS']
+    pycuda_array = pycuda.gpuarray.GPUArray(pygpu_array.shape,
+                                            pygpu_array.dtype,
+                                            base=pygpu_array,
+                                            gpudata=(pygpu_array.gpudata +
+                                                     pygpu_array.offset))
+As long as the computations happen on the NULL stream there are no
+special considerations to watch for with regards to synchronization.
+Otherwise, you will have to make sure that you synchronize the pygpu
+objects by calling the `.sync()` method before scheduling any work and
+synchronize with the work that happends in the library after all the
+work is scheduled.
--- a/doc/extending/index.txt
+++ b/doc/extending/index.txt
@@ -45,6 +45,7 @@ with Theano itself.
    ctype
    cop
    using_params
+    extending_theano_gpu
    optimization
    tips
    unittest

--- a/doc/extending/using_params.txt
+++ b/doc/extending/using_params.txt
@@ -29,6 +29,13 @@ Making a purpose-built class may require more upfront work, but can
 pay off if you reuse the type for a lot of Ops, by not having to re-do
 all of the python manipulation.
+The params object
+-----------------
+The object that you use to store your param values must be hashable
+and comparable for equality, because it will be stored in a dictionary
+at some point.  Apart from those requirements it can be anything that
+matches what you have declared as the params type.
 Defining a params type
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -175,6 +182,14 @@ weights.
           self.alpha = alpha
           self.beta = beta
+       def __hash__(self):
+           return hash((type(self), self.alpha, self.beta))
+       def __eq__(self, other):
+           return (type(self) == type(other) and
+                   self.alpha == other.alpha and
+                   self.beta == other.beta)
   class Mix(Op):
       params_type = Generic()

--- a/theano/gpuarray/basic_ops.py
+++ b/theano/gpuarray/basic_ops.py
 from __future__ import absolute_import, print_function, division
 import os
+import copy
+import re
 import numpy
 from theano import Op, Apply, Type, Variable
@@ -8,7 +9,7 @@ from theano import tensor, config
 from theano.gradient import grad_undefined
 from theano.tensor.basic import Alloc, Join, Split
-from theano.gof import HideC
+from theano.gof import HideC, COp
 from theano.gof.utils import MethodNotDefined
 from collections import deque
@@ -124,6 +125,51 @@ class Kernel(object):
    """
    This class groups together all the attributes of a gpu kernel.
+    `params` should contain the data type for each argument.  Buffer
+    arguments should use the GpuArray class as the data type and
+    scalar should use their equivalent numpy dtype.  For ga_size and
+    ga_ssize, use gpuarray.SIZE and gpuarray.SSIZE.
+    If the `ctypes` flags is set to `True` then it should be a C
+    string which represent the typecode to use.
+    `flags` can contain the following keys whose values are booleans:
+        have_double
+            the kernel uses double-typed variables somewhere
+        have_small
+            the kernel uses variables whose type takes less than 4
+            bytes somewhere
+        have_complex
+            the kernel uses complex values somewhere
+        have_half
+            the kernel uses half-floats somewhere
+        ctypes
+            the `params` list consists of C typecodes
+    It can also have the key `cflags` which is a string of C flag
+    values like this `"GA_USE_DOUBLE|GA_USE_CLUDA"`.
+    Parameters
+    ----------
+    code: str
+        The source code of the kernel.
+    params: list
+        list of parameter types.
+    name: str
+        the name of the kernel function in the source.
+    flags: dict
+        dictionary of flags
+    codevar: str
+        the name of the variable for the code object.
+        (defaults to `kcode_` + name)
+    binvar: str
+        the name of the variable for the binary object.
+        (defaults to `kbin_` + name)
+    objvar: str
+        the name of the variable for the kernel object.
+        (defaults to `k_` + name)
    """
    def __init__(self, code, params, name, flags,
@@ -167,6 +213,8 @@ class Kernel(object):
    def _get_c_flags(self):
        res = []
+        if self.flags.get('cflags', '') != '':
+            res.append(self.flags['cflags'])
        if self.flags.get('cluda', False):
            res.append('GA_USE_CLUDA')
        if self.flags.get('have_double', False):
@@ -176,9 +224,26 @@ class Kernel(object):
        if self.flags.get('have_complex', False):
            res.append('GA_USE_COMPLEX')
        if self.flags.get('have_half', False):
-            res.append('GA_USE_SMALL')
+            res.append('GA_USE_HALF')
        return '|'.join(res)
+    def _get_py_flags(self):
+        res = dict(self.flags)
+        cflags = res.pop('cflags', '')
+        for fl in cflags.split('|'):
+            fl = fl.strip()
+            if fl == 'GA_USE_CLUDA':
+                res['cluda'] = True
+            if fl == 'GA_USE_DOUBLE':
+                res['have_double'] = True
+            if fl == 'GA_USE_SMALL':
+                res['have_small'] = True
+            if fl == 'GA_USE_COMPLEX':
+                res['have_complex'] = True
+            if fl == 'GA_USE_HALF':
+                res['have_half'] = True
+        return res
    def _get_c_types(self):
        def m(t):
            if t == gpuarray.GpuArray:
@@ -215,7 +280,7 @@ class GpuKernelBase(object):
    def _generate_kernel_bin(self, k, ctx):
        gk = gpuarray.GpuKernel(k.code, k.name, k.params, context=ctx,
-                                **k.flags)
+                                **k._get_py_flags())
        bin = gk._binary
        bcode = ','.join(hex(c) for c in iterbytes(bin))
        return ("""static const char %(bname)s[] = { %(bcode)s };""" %
@@ -313,6 +378,102 @@ class GpuKernelBase(object):
        return (4, self.get_params(node).bin_id)
+def forward_string_meth(name):
+    def f(*args):
+        res = getattr(GpuKernelBase, name)(*args)
+        try:
+            res = res + '\n' + getattr(COp, name)(*args)
+        except MethodNotDefined:
+            pass
+        return res
+    f.__name__ = name
+    return f
+def get_dtype(s):
+    if s == '*':
+        return gpuarray.GpuArray
+    if s == 'size':
+        return gpuarray.SIZE
+    if s == 'ssize':
+        return gpuarray.SSIZE
+    else:
+        return numpy.dtype(s)
+class CGpuKernelBase(COp, GpuKernelBase):
+    """
+    Class to combine GpuKernelBase and COp.
+    It adds a new section type 'kernels' where you can define kernels
+    with the '#kernel' tag
+    """
+    SECTIONS = copy.copy(COp.SECTIONS)
+    SECTIONS.add('kernels')
+    kernel_re = re.compile(r'^#kernel ([a-zA-Z_].*?)$', re.MULTILINE)
+    c_support_code = forward_string_meth('c_support_code')
+    c_support_code_apply = forward_string_meth('c_support_code_apply')
+    c_support_code_struct = forward_string_meth('c_support_code_struct')
+    c_init_code_struct = forward_string_meth('c_init_code_struct')
+    c_cleanup_code_struct = forward_string_meth('c_cleanup_code_struct')
+    def _type_macros(self, node):
+        define_template = "#define %s %s\n"
+        undef_template = "#undef %s\n"
+        define_macros = []
+        undef_macros = []
+        for i, v in enumerate(node.inputs):
+            if isinstance(v.type, GpuArrayType):
+                macro_name = "DTYPE_i%d" % (i,)
+                macro_value = pygpu.gpuarray.dtype_to_ctype(v.dtype)
+                define_macros.append(
+                    define_template %
+                    (macro_name, macro_value))
+                undef_macros.append(undef_template % macro_name)
+        for i, v in enumerate(node.outputs):
+            if isinstance(v.type, GpuArrayType):
+                macro_name = "DTYPE_o%d" % (i,)
+                macro_value = pygpu.gpuarray.dtype_to_ctype(v.dtype)
+                define_macros.append(
+                    define_template %
+                    (macro_name, macro_value))
+                undef_macros.append(undef_template % macro_name)
+        return ''.join(define_macros), ''.join(undef_macros)
+    def gpu_kernels(self, node, name):
+        if hasattr(self, '_cached_kernels'):
+            return self._cached_kernels
+        if 'kernels' in self.code_sections:
+            code = self.code_sections['kernels']
+            split = self.kernel_re.split(code)
+            if split[0].strip() != '':
+                raise ValueError("Stray code in kernels section before the "
+                                 "first #kernel statement.")
+            def_macros, undef_macros = self._type_macros(node)
+            n = 1
+            res = []
+            while n < len(split):
+                kspec = split[n]
+                kcode = split[n + 1]
+                splt2 = kspec.split(':')
+                if len(splt2) != 3:
+                    raise ValueError("Bad kernel spec: %s" % (kspec,))
+                kname = splt2[0].strip()
+                ktypes = [get_dtype(s.strip()) for s in splt2[1].split(',')]
+                kflags = splt2[2].strip()
+                kcode = def_macros + '\n' + kcode + '\n' + undef_macros
+                res.append(Kernel(kcode, ktypes, kname,
+                                  flags=dict(cluda=True, cflags=kflags)))
+                n += 2
+            self._cached_kernels = res
+            return res
+        else:
+            return GpuKernelBase.gpu_kernels(self, node, name)
 class HostFromGpu(Op):
    """
    Transfer data to CPU.

--- a/theano/gpuarray/tests/test_cgpukernelbase.py
+++ b/theano/gpuarray/tests/test_cgpukernelbase.py
+from __future__ import division, absolute_import, print_function
+import numpy
+from six.moves import xrange
+import theano
+from theano import tensor, config, Apply, Op
+from theano.gradient import grad_undefined
+from .config import mode_with_gpu, test_ctx_name
+from ..basic_ops import CGpuKernelBase
+from ..type import GpuArrayType, get_context
+from pygpu.gpuarray import dtype_to_typecode
+# This is an implementation to test that CGpuKernelBase works and also
+# to use as an example in the docs.  It is not used for user graphs.
+class GpuEye(CGpuKernelBase, Op):
+    """
+    Eye for GPU.
+    """
+    __props__ = ('dtype', 'context_name')
+    _f16_ok = True
+    def __init__(self, dtype=None, context_name=None):
+        if dtype is None:
+            dtype = config.floatX
+        self.dtype = dtype
+        self.context_name = context_name
+        CGpuKernelBase.__init__(self, ['tstgpueye.c'],
+                                'APPLY_SPECIFIC(tstgpueye)')
+    def get_params(self, node):
+        return get_context(self.context_name)
+    def c_headers(self):
+        return ['<gpuarray/types.h>', '<gpuarray/kernel.h>']
+    def make_node(self, n, m):
+        n = tensor.as_tensor_variable(n)
+        m = tensor.as_tensor_variable(m)
+        assert n.ndim == 0
+        assert m.ndim == 0
+        otype = GpuArrayType(dtype=self.dtype,
+                             broadcastable=(False, False),
+                             context_name=self.context_name)
+        return Apply(self, [n, m], [otype()])
+    def infer_shape(self, node, in_shapes):
+        out_shape = [node.inputs[0], node.inputs[1]]
+        return [out_shape]
+    def grad(self, inp, grads):
+        return [grad_undefined(self, i, inp[i])
+                for i in xrange(2)]
+    def get_op_params(self):
+        return [('TYPECODE', str(dtype_to_typecode(self.dtype)))]
+def test_cgpukernelbase():
+    op = GpuEye(dtype='int32', context_name=test_ctx_name)
+    f = theano.function([], op(4, 5), mode=mode_with_gpu)
+    r = f()
+    assert (numpy.asarray(r) == numpy.eye(4, 5, dtype='int32')).all()
--- a/theano/gpuarray/tests/tstgpueye.c
+++ b/theano/gpuarray/tests/tstgpueye.c
+#section kernels
+#kernel eye : *, size, size :
+/* The eye name will be used to generate supporting objects.  The only
+   you probably need to care about is the kernel object which will be
+   named 'k_' + <the name above> (k_eye in this case).  This name also
+   has to match the kernel function name below.
+ */
+KERNEL void eye(GLOBAL_MEM DTYPE_o0 *a, ga_size n, ga_size m) {
+  ga_size nb = n < m ? n : m;
+  for (ga_size i = LID_0; i < nb; i += LDIM_0) {
+    a[i*m + i] = 1;
+  }
+}
+#section support_code_struct
+int APPLY_SPECIFIC(tstgpueye)(PyArrayObject *n, PyArrayObject *m,
+                              PyGpuArrayObject **z, PyGpuContextObject *ctx) {
+  size_t dims[2] = {0, 0};
+  size_t ls, gs;
+  void *args[3];
+  int err;
+  dims[0] = ((DTYPE_INPUT_0 *)PyArray_DATA(n))[0];
+  dims[1] = ((DTYPE_INPUT_1 *)PyArray_DATA(m))[0];
+  Py_XDECREF(*z);
+  *z = pygpu_zeros(2, dims,
+                   TYPECODE,
+                   GA_C_ORDER,
+                   ctx, Py_None);
+  if (*z == NULL)
+    return -1;
+  args[0] = (*z)->ga.data;
+  args[1] = &dims[0];
+  args[2] = &dims[1];
+  ls = 1;
+  gs = 256;
+  /* The k_eye name comes from the kernel declaration above. */
+  err = GpuKernel_call(&k_eye, 1, &ls, &gs, 0, args);
+  if (err != GA_NO_ERROR) {
+    PyErr_Format(PyExc_RuntimeError,
+                 "gpuarray error: kEye: %s. n%lu, m=%lu.",
+                 GpuKernel_error(&k_eye, err),
+                 (unsigned long)dims[0], (unsigned long)dims[1]);
+    return -1;
+  }
+  return 0;
+}