First version of the C op tutorial.

c7dbca05 · Pierre Luc Carrier · 025d484e · c7dbca05
--- a/doc/tutorial/extending_theano_c.txt
+++ b/doc/tutorial/extending_theano_c.txt
+.. _extending_theano_c:
+============================
+Extending Theano with a C Op
+============================
+This tutorial covers how to extend Theano with an op that offers a C
+implementation. It does not cover ops that run on a GPU but it does introduce
+many elements and concepts which are relevant for GPU ops. This tutorial is
+aimed at individuals who already know how to extend Theano (see tutorial
+:ref:`extending_theano`) by adding a new op with a python implementation
+and will only cover the additional knowledge required to also produce ops
+with C implementations.
+Providing a Theano op with a C implementation requires to interact with
+Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
+introduce both and highlight their features which are most relevant to the
+task of implementing a C op. This tutorial then introduces the most important
+methods that the op needs to implement in order to provide a usable C
+implementation. Finally, it shows how to combine these elements to write a
+simple C op for performing the simple task of multiplying every element in a
+vector by a scalar.
+Python C-API
+============
+Python provides a C-API to allow the manipulation of python objects from
+C code. In this API, all classes that represent Python objects are descendants
+of the class PyObject. This class is essentially a wrapper; an instance of
+PyObject contains a pointer to another object as well as a reference count
+for that object. Thus, an instance of PyObject allows to treat a pointer to an
+object as an object itself.
+As such, manipulating a PyObject instance is often straight-forward but it
+is important to properly manage its reference count. Failing to do so can
+lead to undesired behavior in the C code.
+Reference counting
+------------------
+Reference counting is a mechanism for keeping track, for an object, of
+the number of references to it held by other entities. This mechanism is often
+used for purposes of garbage collecting because it allows to easily see if
+an object is still being used by other entities. When the reference count
+for an object drops to 0, it means it is not used by any anyone and can
+be safely deleted.
+PyObjects implement reference counting and the Python C-API defines a number
+of macros to help manage those reference counts. The definition of these
+macros can be found here : `Python C-API Reference Counting 
+<https://docs.python.org/2/c-api/refcounting.html>`_. Listed below are the
+two macros most often used in Theano C ops.
+.. method:: void Py_XINCREF(PyObject *o)
+    Increments the reference count of object o. Without effect if the object
+    is NULL.
+.. method:: void Py_XDECREF(PyObject *o)
+    Decrements the reference count of object o. If the reference count reaches
+    0, it will trigger a call of the object's deallocation function. Without
+    effect if the object is NULL.
+The general principle, in the reference counting paradigm, is that the owner
+of a reference to an object is responsible for disposing properly of it. 
+This can be done by decrementing the reference count once the reference is no
+longer used or by transfering ownership; passing on the reference to a new
+owner which becomes responsible for it.
+Some functions return "borrowed references"; this means that they return a 
+reference to an object **without** transfering ownership of the reference to the
+caller of the function. This means that if you call a function which returns a
+borrowed reference, you do not have the burden of properly disposing of that
+reference. You should **not** call Py_XDECREF() on a borrowed reference.
+Correctly managing the reference counts is important as failing to do so can
+lead to issues ranging from memory leaks to segmentation fauls.
+NumPy C-API
+===========
+The NumPy library provides a C-API to allow users to create, access and
+manipulate NumPy arrays from within their own C routines. NumPy's ndarrays
+are used extensively inside theano and so extending Theano with a C op will
+require interaction with the NumPy C-API.
+This sections covers the API's elements that are often required to write code
+for a Theano C op. The full documentation for the API can be found here :
+`NumPy C-API <http://docs.scipy.org/doc/numpy/reference/c-api.html>`_
+NumPy ndarrays
+--------------
+In the NumPy C-API, NumPy arrays are represented as instances of the 
+PyArrayObject class which is a descendant of the PyObject class. This means
+that, as for any other Python object that you manipulate from C code, you
+need to appropriatedly manage the reference counts of PyArrayObject instances.
+Unlike in a standard multidimensionnal C array, a NumPy array's internal data
+representation does not have to occupy a continuous region in memory. In fact,
+it can be C-contiguous, F-contiguous or non-contiguous. C-contiguous means
+that the data is not only contiguous in memory but also that it is organized
+such that the index of the latest dimension changes the fastest. If the
+following array x
+.. code-block:: python
+    x = [[1, 2, 3],
+         [4, 5, 6]]
+is C-contiguous, it means that, in memory, the six values contained in the
+array x are stored in the order [1, 2, 3, 4, 5, 6] (the first value is x[0,0],
+the second value is x[0,1], the third value is x[0,2], the fourth value is 
+x[1,0], etc). F-contiguous (or Fortran Contiguous) also means that the data is
+contiguous but that it is organized such that the index of the latest
+dimension changes the slowest. If the array x is F-contiguous, it means that, 
+in memory, the values appear in the order [1, 4, 2, 5, 3, 6] (the first
+value is x[0,0], the second value is x[1,0], the third value is x[0,1], etc).
+Finally, the internal data can be non-contiguous. In this case, it occupies
+a non-contiguous region in memory but it is still stored in an organized
+fashion : the distance between the element x[i,j] and the element x[i+1,j]
+of the array is constant over all valid values of i and j, just as the
+distance between the element x[i,j] and the element x[i,j+1] of the array
+is constant over all valid values of i and j. This distance between
+consecutive elements of an array over a given dimension, is called the
+stride of that dimension.  
+Accessing NumPy ndarrays' data and properties
+---------------------------------------------
+The following macros serve to access various attributes of NumPy ndarrays.
+.. method:: void* PyArray_DATA(PyArrayObject* arr)
+    Returns a pointer to the first element of the array's data.
+.. method:: int PyArray_NDIM(PyArrayObject* arr)
+    Returns the number of dimensions in the the array pointed by arr
+.. method:: npy_intp* PyArray_DIMS(PyArrayObject* arr)
+    Returns a pointer on the first element of arr's internal array describing
+    its dimensions. This internal array contains as many elements as the
+    array arr has dimensions.
+    The macro PyArray_SHAPE is a synonym of PyArray_DIMS : it has the same
+    effect and is used in an identical way.
+.. method:: npy_intp* PyArray_STRIDES(PyArrayObject* arr)
+    Returns a pointer on the first element of arr's internal array describing
+    the stride for each of its dimension. This array has as many elements as
+    the number of dimensions in arr. In this array, the strides are expressed
+    in number of bytes.
+.. method:: PyArray_Descr* PyArray_DESCR(PyArrayObject* arr)
+    Returns a reference to the object representing the dtype of the array.
+    The macro PyArray_DTYPE is a synonym of the PyArray_DESCR() : it has the
+    same effect and is used in an identical way.
+    :note:
+        This is a borrowed reference so you do not need to decrement its
+        reference count once you are done with it.
+.. method:: int PyArray_TYPE(PyArrayObject* arr)
+    Returns the typenumber for the elements of the array. Like the dtype, the
+    typenumber is a descriptor for the type of the data in the array. However,
+    the two are not synonyms and, as such, cannot be used in place of the
+    other.
+.. method:: npy_intp PyArray_SIZE(PyArrayObject* arr)
+    Returns to total number of elements in the array
+.. method:: bool PyArray_CHKFLAGS(PyArrayObject* arr, flags)
+    Returns true if the array has the specified flags. The variable flag
+    should either be a NumPy array flag or an integer obtained by applying
+    bitwise or to an ensemble of flags.
+    The flags that can be used in with this macro are : 
+    NPY_ARRAY_C_CONTIGUOUS, NPY_ARRAY_F_CONTIGUOUS, NPY_ARRAY_OWNDATA,
+    NPY_ARRAY_ALIGNED, NPY_ARRAY_WRITEABLE, NPY_ARRAY_UPDATEIFCOPY.
+Creating NumPy ndarrays
+-----------------------
+The following functions allow the creation and copy of NumPy arrays :
+.. method:: PyObject* PyArray_Empty(int nd, npy_intp* dims, PyArray_Descr*
+                                    dtype, int fortran)
+    Constructs a new ndarray with the number of dimensions specified by nd,
+    shape specified by dims and data type specified by dtype. If fortran is
+    equal to 0, the data is organized in a C-contiguous layout, otherwise it
+    is organized in a F-contiguous layout. The array elements are not
+    initialized in any way.
+    The macro PyArray_EMPTY() performs the same function as the function
+    PyArray_Empty() but the data type is given as a typenum instead of a
+    pointer to a PyArray_Descr object.
+.. method:: PyObject* PyArray_Zeros(int nd, npy_intp* dims, PyArray_Descr*
+                                    dtype, int fortran)
+    Constructs a new ndarray with the number of dimensions specified by nd,
+    shape specified by dims and data type specified by dtype. If fortran is
+    equal to 0, the data is organized in a C-contiguous layout, otherwise it
+    is organized in a F-contiguous layout. Every element in the array is
+    initialized to 0.
+    The macro PyArray_ZEROS() performs the same function as the function
+    PyArray_Zeros() but the data type is given as a typenum instead of a
+    pointer to a PyArray_Descr object.  
+.. method:: PyArrayObject* PyArray_GETCONTIGUOUS(PyObject* op):
+    Returns a C-contiguous and well-behaved copy of the array op. If op is
+    already C-congiguous and well-behaved, this function simply returns a
+    reference new reference to op.
+Functions the C Op needs to define
+==================================
+There is a key difference between and op defining a Python implementation for
+its computation and defining a C implementation. In the case of a Python
+implementation, the op defines a function perform() which executes the
+required python code to realize the op. In the case of a C implementation,
+however, the op does **not** define a function that will execute the C code; it
+instead defines functions that will **return** the C code to the caller.
+This is because calling C code from Python code comes with a significant
+overhead. If every op was responsible for executing it's own C code, every
+time a Theano function was called, this overhead would occur as many times
+as the number of ops with C implementations in the function's computational
+graph.
+To maximize performance, Theano instead requires the C ops to simply return
+the code needed for their execution and takes upon itself the task of
+organizing, linking and compiling the code from the various ops. Through this,
+Theano is able to minimize the number of times C code is called from Python
+code by maximizing the amount of computation that is done every time C code
+is called from Python.
+The following is a very crude example to illustrate how it's possible to
+obtain performance gains with this process. Suppose you need to execute,
+from Python code, 10 different ops, each one having a C implementation. If
+each op was responsible for executing it's own C code, the overhead of
+calling C code from Python code would occur 10 times. Consider now the case
+where the ops instead return the C code for their execution. You could get
+the C code from each op and then define your own C module that would call
+the C code from each op in succession. In this case, the overhead would only
+occur once; when calling your custom module itself.
+Moreover, the fact that Theano itself takes care of compiling the C code,
+instead of the individual ops, allows Theano to easily cache the compiled C
+code. This allows for faster compilation times.
+See :ref:`cop` for the full documentation of the various methods of the
+class Op that are related to the C implementation. Of particular interest are:
+*       The functions c_libraries() and c_lib_dirs() to allow your op to use 
+        external libraries.
+*       The function c_code_cleanup() to specify how the op should clean up
+        what it has allocated during its execution.
+*       The functions c_init_code() and c_init_code_apply() to specify code
+        that should be executed once when the module is initialized, before
+        anything else is executed.
+*       The functions c_compile_args() and c_no_compile_args() to specify
+        requirements regarding how the op's C code should be compiled.
+This sections describes the functions c_code(), c_support_code() and 
+c_code_cache_version() because they are the ones that are most commonly
+used.
+.. method:: c_code(node, name, input_names, output_names, sub)
+    This method returns a string containing the C code to perform the
+    computation required by this op.
+    The ``node`` argument is an :ref:`apply` node representing an
+    application of the current Op on a list of inputs, producing a list of
+    outputs.
+    ``input_names`` is a sequence of strings which contains as many strings
+    as the op has inputs. Each string contains the name of the C variable
+    to which the corresponding input has been assigned. For example, the name
+    of the C variable representing the first input of the op is given by
+    ``input_names[0]``. You should therefore use this name to interact in your
+    C code to interact with that variable. ``output_names`` is used
+    identically to ``input_names``, but for the ops' outputs.
+    Finally, `sub` is a dictionary of extras parameters to the c_code
+    method. Among other things, it contains ``sub['fail']`` which is a string
+    of C code that you should execute (after ensuring that a python exception
+    is set) if your C code needs to raise an exception.  
+    :note: 
+        Your C code should not return the output of the computation but
+        rather put the results in the C variables whose names are contained in
+        the `output_names``. 
+.. method:: c_support_code()
+    Returns a string containing the support C code for this op. This code
+    will be included at the global scope level and can be used to define
+    functions and structs that will be used by the op's main C code.
+.. method:: c_code_cache_version()
+    Returns a tuple of integers representing the version of the C code in this
+    op. Ex : (1, 4, 0) for version 1.4.0
+    This tuple is used by theano to cache the compiled C code for this op. As
+    such, the return value **MUST be CHANGED** everytime the C code is altered or
+    else Theano will disregard the change in the code and simply load a
+    previous version of the op from the cache. If you want to avoid caching of
+    the C code of this op, return an empty tuple or do not implement this
+    method.
+    :note:
+        Theano can handle tuples of any hashable objects as return values
+        for this function but, for greater readability and easier management,
+        this function should return a tuple of integers as previously
+        described.
+Complete C Op example
+=====================
+In this section, we put together every concept that was covered in this
+tutorial to generate an op which multiplies every element in a vector
+by a scalar. 
+Notice how the reference count on the output variable is
+managed. Also take note of how the new variables required for the op's
+computation are declared in a new scope to avoid cross-initialization errors.
+:note:
+    Given the simple nature of this op, there was no need to use the 
+    c_support_code() function.
+.. code-block:: python
+    import numpy
+    import theano
+    from theano import gof
+    import theano.tensor as T
+    class VectorTimesScalar(gof.Op):
+        __props__ = ()
+        def __init__(self, **kwargs):
+            gof.Op.__init__(self, **kwargs)
+        def make_node(self, x, y):
+            # Validate the inputs' type
+            if x.type.ndim != 1:
+                raise TypeError('x must be a 1-d vector')
+            if y.type.ndim != 0:
+                raise TypeError('y must be a scalar')
+            # Create an output variable of the same type as x
+            output_var = x.type.make_variable()
+            return gof.Apply(self, [x, y], [output_var])
+        def __str__(self):
+            return self.__class__.__name__
+        def c_code_cache_version(self):
+            return (1, 0)
+        def c_code(self, node, name, inp, out, sub):
+            x, y = inp
+            z, = out
+            dtype_x = node.inputs[0].dtype
+            dtype_y = node.inputs[1].dtype
+            dtype_z = node.outputs[0].dtype
+            itemsize_x = numpy.dtype(dtype_x).itemsize
+            itemsize_z = numpy.dtype(dtype_z).itemsize
+            fail = sub['fail']
+            c_code = """
+            // Validate the inputs
+            if (PyArray_NDIM(%(x)s) != 1)
+            {
+                PyErr_SetString(PyExc_ValueError, "x is not a 1d tensor");
+                %(fail)s;
+            }
+            if (PyArray_NDIM(%(y)s) != 0)
+            {
+                PyErr_SetString(PyExc_ValueError, "y is not a scalar");
+                %(fail)s;
+            }
+            // Validate that the output storage exists and has the same
+            // dimension as x.
+            if ((NULL == %(z)s) || PyArray_NDIM(%(z)s) != 1 ||
+                (PyArray_DIMS(%(x)s)[0] != PyArray_DIMS(%(z)s)[0]))
+            {
+                /* Reference received to invalid output variable.
+                Decrease received reference's ref count and allocate new
+                output variable */
+                Py_XDECREF(%(z)s);
+                %(z)s = (PyArrayObject*)PyArray_Empty(1,
+                                                    PyArray_DIMS(%(x)s),
+                                                    PyArray_DESCR(%(x)s),
+                                                    0);
+                if (!%(z)s) {
+                    %(fail)s;
+                }
+            }
+            // Perform the vector multiplication by a scalar
+            {
+                /* The declaration of the following variables is done in a new
+                scope to prevent cross initialization errors */
+                npy_%(dtype_x)s* x_data_ptr = 
+                                (npy_%(dtype_x)s*)PyArray_DATA(%(x)s);
+                npy_%(dtype_z)s* z_data_ptr = 
+                                (npy_%(dtype_z)s*)PyArray_DATA(%(z)s);
+                npy_%(dtype_y)s y_value = 
+                                ((npy_%(dtype_y)s*)PyArray_DATA(%(y)s))[0];
+                int x_stride = PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s;
+                int z_stride = PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s;
+                int x_dim = PyArray_DIMS(%(x)s)[0];
+                for(int i=0; i < x_dim; i++)
+                {
+                    z_data_ptr[i * z_stride] = (x_data_ptr[i * x_stride] * 
+                                                y_value);
+                }
+            }
+            """
+            return c_code % locals()
\ No newline at end of file