Merge pull request #2121 from carriepl/doc_for_c_op

Tutorial for C op

Merge pull request #2121 from carriepl/doc_for_c_op
7d286d04 · Frédéric Bastien · c7e88a00 · 3597e074 · 7d286d04 · 7d286d04
--- a/doc/tutorial/extending_theano_c.txt
+++ b/doc/tutorial/extending_theano_c.txt
+.. _extending_theano_c:
+============================
+Extending Theano with a C Op
+============================
+This tutorial covers how to extend Theano with an op that offers a C
+implementation. It does not cover ops that run on a GPU but it does introduce
+many elements and concepts which are relevant for GPU ops. This tutorial is
+aimed at individuals who already know how to extend Theano (see tutorial
+:ref:`extending_theano`) by adding a new op with a Python implementation
+and will only cover the additional knowledge required to also produce ops
+with C implementations.
+Providing a Theano op with a C implementation requires to interact with
+Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
+introduce both and highlight their features which are most relevant to the
+task of implementing a C op. This tutorial then introduces the most important
+methods that the op needs to implement in order to provide a usable C
+implementation. Finally, it shows how to combine these elements to write a
+simple C op for performing the simple task of multiplying every element in a
+vector by a scalar.
+Python C-API
+============
+Python provides a C-API to allows the manipulation of python objects from C
+code. In this API, all variables that represent Python objects are of type
+``PyObject *``. All objects have a pointer to their type object and a reference
+count field (that is shared with the python side). Most python methods have
+an equivalent C function that can be called on the ``PyObject *`` pointer.
+As such, manipulating a PyObject instance is often straight-forward but it
+is important to properly manage its reference count. Failing to do so can
+lead to undesired behavior in the C code.
+Reference counting
+------------------
+Reference counting is a mechanism for keeping track, for an object, of
+the number of references to it held by other entities. This mechanism is often
+used for purposes of garbage collecting because it allows to easily see if
+an object is still being used by other entities. When the reference count
+for an object drops to 0, it means it is not used by anyone any longer and can
+be safely deleted.
+PyObjects implement reference counting and the Python C-API defines a number
+of macros to help manage those reference counts. The definition of these
+macros can be found here : `Python C-API Reference Counting
+<https://docs.python.org/2/c-api/refcounting.html>`_. Listed below are the
+two macros most often used in Theano C ops.
+.. method:: void Py_XINCREF(PyObject *o)
+    Increments the reference count of object ``o``. Without effect if the
+    object is NULL.
+.. method:: void Py_XDECREF(PyObject *o)
+    Decrements the reference count of object ``o``. If the reference count
+    reaches 0, it will trigger a call of the object's deallocation function.
+    Without effect if the object is NULL.
+The general principle, in the reference counting paradigm, is that the owner
+of a reference to an object is responsible for disposing properly of it.
+This can be done by decrementing the reference count once the reference is no
+longer used or by transfering ownership; passing on the reference to a new
+owner which becomes responsible for it.
+Some functions return "borrowed references"; this means that they return a
+reference to an object **without** transfering ownership of the reference to the
+caller of the function. This means that if you call a function which returns a
+borrowed reference, you do not have the burden of properly disposing of that
+reference. You should **not** call Py_XDECREF() on a borrowed reference.
+Correctly managing the reference counts is important as failing to do so can
+lead to issues ranging from memory leaks to segmentation faults.
+NumPy C-API
+===========
+The NumPy library provides a C-API to allow users to create, access and
+manipulate NumPy arrays from within their own C routines. NumPy's ndarrays
+are used extensively inside Theano and so extending Theano with a C op will
+require interaction with the NumPy C-API.
+This sections covers the API's elements that are often required to write code
+for a Theano C op. The full documentation for the API can be found here :
+`NumPy C-API <http://docs.scipy.org/doc/numpy/reference/c-api.html>`_.
+NumPy data types
+----------------
+To allow portability between platforms, the NumPy C-API defines its own data
+types which should be used whenever you are manipulating a NumPy array's
+internal data. The data types most commonly used to implement C ops are the
+following : ``npy_int{8,16,32,64}``, ``npy_uint{8,16,32,64}`` and
+``npy_float{32,64}``.
+You should use these data types when manipulating a NumPy array's internal
+data instead of C primitives because the size of the memory representation
+for C primitives can vary between platforms. For instance, a C ``long`` can be
+represented in memory with 4 bytes but it can also be represented with 8.
+On the other hand, the in-memory size of NumPy data types remains constant
+across platforms. Using them will make your code simpler and more portable.
+The full list of defined data types can be found here :
+`NumPy C-API data types
+<http://docs.scipy.org/doc/numpy/reference/c-api.dtype.html#c-type-names>`_.
+NumPy ndarrays
+--------------
+In the NumPy C-API, NumPy arrays are represented as instances of the
+PyArrayObject class which is a descendant of the PyObject class. This means
+that, as for any other Python object that you manipulate from C code, you
+need to appropriatedly manage the reference counts of PyArrayObject instances.
+Unlike in a standard multidimensionnal C array, a NumPy array's internal data
+representation does not have to occupy a continuous region in memory. In fact,
+it can be C-contiguous, F-contiguous or non-contiguous. C-contiguous means
+that the data is not only contiguous in memory but also that it is organized
+such that the index of the latest dimension changes the fastest. If the
+following array
+.. code-block:: python
+    x = [[1, 2, 3],
+         [4, 5, 6]]
+is C-contiguous, it means that, in memory, the six values contained in the
+array ``x`` are stored in the order ``[1, 2, 3, 4, 5, 6]`` (the first value is
+``x[0,0]``, the second value is ``x[0,1]``, the third value is ``x[0,2]``, the,
+fourth value is ``x[1,0]``, etc). F-contiguous (or Fortran Contiguous) also
+means that the data is contiguous but that it is organized such that the index
+of the latest dimension changes the slowest. If the array ``x`` is
+F-contiguous, it means that, in memory, the values appear in the order
+``[1, 4, 2, 5, 3, 6]`` (the first value is ``x[0,0]``, the second value is
+``x[1,0]``, the third value is ``x[0,1]``, etc).
+Finally, the internal data can be non-contiguous. In this case, it occupies
+a non-contiguous region in memory but it is still stored in an organized
+fashion : the distance between the element ``x[i,j]`` and the element
+``x[i+1,j]`` of the array is constant over all valid values of ``i`` and
+``j``, just as the distance between the element ``x[i,j]`` and the element
+``x[i,j+1]`` of the array is constant over all valid values of ``i`` and ``j``.
+This distance between consecutive elements of an array over a given dimension,
+is called the stride of that dimension.
+Accessing NumPy ndarrays' data and properties
+---------------------------------------------
+The following macros serve to access various attributes of NumPy ndarrays.
+.. method:: void* PyArray_DATA(PyArrayObject* arr)
+    Returns a pointer to the first element of the array's data. The returned
+    pointer must be cast to a pointer of the proper Numpy C-API data type
+    before use.
+.. method:: int PyArray_NDIM(PyArrayObject* arr)
+    Returns the number of dimensions in the the array pointed by ``arr``
+.. method:: npy_intp* PyArray_DIMS(PyArrayObject* arr)
+    Returns a pointer on the first element of ``arr``'s internal array
+    describing its dimensions. This internal array contains as many elements
+    as the array ``arr`` has dimensions.
+    The macro ``PyArray_SHAPE()`` is a synonym of ``PyArray_DIMS()`` : it has
+    the same effect and is used in an identical way.
+.. method:: npy_intp* PyArray_STRIDES(PyArrayObject* arr)
+    Returns a pointer on the first element of ``arr``'s internal array
+    describing the stride for each of its dimension. This array has as many
+    elements as the number of dimensions in ``arr``. In this array, the
+    strides are expressed in number of bytes.
+.. method:: PyArray_Descr* PyArray_DESCR(PyArrayObject* arr)
+    Returns a reference to the object representing the dtype of the array.
+    The macro ``PyArray_DTYPE()`` is a synonym of the ``PyArray_DESCR()`` : it
+    has the same effect and is used in an identical way.
+    :note:
+        This is a borrowed reference so you do not need to decrement its
+        reference count once you are done with it.
+.. method:: int PyArray_TYPE(PyArrayObject* arr)
+    Returns the typenumber for the elements of the array. Like the dtype, the
+    typenumber is a descriptor for the type of the data in the array. However,
+    the two are not synonyms and, as such, cannot be used in place of the
+    other.
+.. method:: npy_intp PyArray_SIZE(PyArrayObject* arr)
+    Returns to total number of elements in the array
+.. method:: bool PyArray_CHKFLAGS(PyArrayObject* arr, flags)
+    Returns true if the array has the specified flags. The variable flag
+    should either be a NumPy array flag or an integer obtained by applying
+    bitwise or to an ensemble of flags.
+    The flags that can be used in with this macro are :
+    NPY_ARRAY_C_CONTIGUOUS, NPY_ARRAY_F_CONTIGUOUS, NPY_ARRAY_OWNDATA,
+    NPY_ARRAY_ALIGNED, NPY_ARRAY_WRITEABLE, NPY_ARRAY_UPDATEIFCOPY.
+Creating NumPy ndarrays
+-----------------------
+The following functions allow the creation and copy of NumPy arrays :
+.. method:: PyObject* PyArray_EMPTY(int nd, npy_intp* dims, typenum dtype,
+                                    int fortran)
+    Constructs a new ndarray with the number of dimensions specified by
+    ``nd``, shape specified by ``dims`` and data type specified by ``dtype``.
+    If ``fortran`` is equal to 0, the data is organized in a C-contiguous
+    layout, otherwise it is organized in a F-contiguous layout. The array
+    elements are not initialized in any way.
+    The function ``PyArray_Empty()`` performs the same function as the macro
+    ``PyArray_EMPTY()`` but the data type is given as a pointer to a
+    ``PyArray_Descr`` object instead of a ``typenum``.
+.. method:: PyObject* PyArray_ZEROS(int nd, npy_intp* dims, typenum dtype,
+                                    int fortran)
+    Constructs a new ndarray with the number of dimensions specified by
+    ``nd``, shape specified by ``dims`` and data type specified by ``dtype``.
+    If ``fortran`` is equal to 0, the data is organized in a C-contiguous
+    layout, otherwise it is organized in a F-contiguous layout. Every element
+    in the array is initialized to 0.
+    The function ``PyArray_Zeros()`` performs the same function as the macro
+    ``PyArray_ZEROS()`` but the data type is given as a pointer to a
+    ``PyArray_Descr`` object instead of a ``typenum``.
+.. method:: PyArrayObject* PyArray_GETCONTIGUOUS(PyObject* op)
+    Returns a C-contiguous and well-behaved copy of the array op. If op is
+    already C-contiguous and well-behaved, this function simply returns a
+    new reference to op.
+Methods the C Op needs to define
+================================
+There is a key difference between an op defining a Python implementation for
+its computation and defining a C implementation. In the case of a Python
+implementation, the op defines a function ``perform()`` which executes the
+required Python code to realize the op. In the case of a C implementation,
+however, the op does **not** define a function that will execute the C code; it
+instead defines functions that will **return** the C code to the caller.
+This is because calling C code from Python code comes with a significant
+overhead. If every op was responsible for executing its own C code, every
+time a Theano function was called, this overhead would occur as many times
+as the number of ops with C implementations in the function's computational
+graph.
+To maximize performance, Theano instead requires the C ops to simply return
+the code needed for their execution and takes upon itself the task of
+organizing, linking and compiling the code from the various ops. Through this,
+Theano is able to minimize the number of times C code is called from Python
+code.
+The following is a very simple example to illustrate how it's possible to
+obtain performance gains with this process. Suppose you need to execute,
+from Python code, 10 different ops, each one having a C implementation. If
+each op was responsible for executing its own C code, the overhead of
+calling C code from Python code would occur 10 times. Consider now the case
+where the ops instead return the C code for their execution. You could get
+the C code from each op and then define your own C module that would call
+the C code from each op in succession. In this case, the overhead would only
+occur once; when calling your custom module itself.
+Moreover, the fact that Theano itself takes care of compiling the C code,
+instead of the individual ops, allows Theano to easily cache the compiled C
+code. This allows for faster compilation times.
+See :ref:`cop` for the full documentation of the various methods of the
+class Op that are related to the C implementation. Of particular interest are:
+*       The methods :meth:`Op.c_libraries` and :meth:`Op.c_lib_dirs` to allow
+        your op to use external libraries.
+*       The method :meth:`Op.c_code_cleanup` to specify how the op should
+        clean up what it has allocated during its execution.
+*       The methods :meth:`Op.c_init_code` and :meth:`Op.c_init_code_apply`
+        to specify code that should be executed once when the module is
+        initialized, before anything else is executed.
+*       The methods :meth:`Op.c_compile_args` and
+        :meth:`Op.c_no_compile_args` to specify requirements regarding how
+        the op's C code should be compiled.
+This section describes the methods :meth:`Op.c_code`,
+:meth:`Op.c_support_code`, :meth:`Op.c_support_code_apply` and
+:meth:`Op.c_code_cache_version` because they are the ones that are most
+commonly used.
+.. method:: c_code(node, name, input_names, output_names, sub)
+    This method returns a string containing the C code to perform the
+    computation required by this op.
+    The ``node`` argument is an :ref:`apply` node representing an
+    application of the current Op on a list of inputs, producing a list of
+    outputs.
+    ``input_names`` is a sequence of strings which contains as many strings
+    as the op has inputs. Each string contains the name of the C variable
+    to which the corresponding input has been assigned. For example, the name
+    of the C variable representing the first input of the op is given by
+    ``input_names[0]``. You should therefore use this name in your
+    C code to interact with that variable. ``output_names`` is used
+    identically to ``input_names``, but for the op's outputs.
+    Finally, ``sub`` is a dictionary of extras parameters to the c_code
+    method. Among other things, it contains ``sub['fail']`` which is a string
+    of C code that you should include in your C code (after ensuring that a
+    Python exception is set) if it needs to raise an exception. Ex:
+    .. code-block:: python
+        c_code = """
+            PyErr_Format(PyExc_ValueError, "X does not have the right value");
+            %(fail)s;
+        """ % {'fail' : sub['fail']}
+    to raise a ValueError Python exception with the specified message.
+    The function ``PyErr_Format()`` supports string formatting so it is
+    possible to tailor the error message to the specifics of the error
+    that occured. If ``PyErr_Format()`` is called with more than two
+    arguments, the subsequent arguments are used to format the error message
+    with the same behavior as the function `PyString_FromFormat()
+    <https://docs.python.org/2/c-api/string.html#c.PyString_FromFormat>`_. The
+    ``%`` characters in the format characters need to be escaped since the C
+    code itself is defined in a string which undergoes string formatting.
+    .. code-block:: python
+        c_code = """
+            PyErr_Format(PyExc_ValueError,
+                         "X==%%i but it should be greater than 0", X);
+            %(fail)s;
+        """ % {'fail' : sub['fail']}
+    :note:
+        Your C code should not return the output of the computation but
+        rather put the results in the C variables whose names are contained in
+        the ``output_names``.
+.. method:: c_support_code()
+    Returns a string containing some support C code for this op. This code
+    will be included at the global scope level and can be used to define
+    functions and structs that will be used by every apply of this op.
+.. method:: c_support_code_apply(node, name)
+    Returns a string containing some support C code for this op. This code
+    will be included at the global scope level and can be used to define
+    functions and structs that will be used by this op. The difference between
+    this method and ``c_support_code()`` is that the C code specified in
+    ``c_support_code_apply()`` should be specific to each apply of the Op,
+    while ``c_support_code()`` is for support code that is not specific to
+    each apply.
+.. method:: c_code_cache_version()
+    Returns a tuple of integers representing the version of the C code in this
+    op. Ex : (1, 4, 0) for version 1.4.0
+    This tuple is used by Theano to cache the compiled C code for this op. As
+    such, the return value **MUST BE CHANGED** every time the C code is altered
+    or else Theano will disregard the change in the code and simply load a
+    previous version of the op from the cache. If you want to avoid caching of
+    the C code of this op, return an empty tuple or do not implement this
+    method.
+    :note:
+        Theano can handle tuples of any hashable objects as return values
+        for this function but, for greater readability and easier management,
+        this function should return a tuple of integers as previously
+        described.
+Simple C Op example
+===================
+In this section, we put together the concepts that were covered in this
+tutorial to generate an op which multiplies every element in a vector
+by a scalar and returns the resulting vector. This is intended to be a simple
+example so the methods ``c_support_code()`` and ``c_support_code_apply()`` are
+not used because they are not required.
+In the C code below notice how the reference count on the output variable is
+managed. Also take note of how the new variables required for the op's
+computation are declared in a new scope to avoid cross-initialization errors.
+Also, in the C code, it is very important to properly validate the inputs
+and outputs storage. Theano guarantees that the inputs exist and have the
+right number of dimensions but it does not guarantee their exact shape. For
+instance, if an op computes the sum of two vectors, it needs to validate that
+its two inputs have the same shape. In our case, we do not need to validate
+the exact shapes of the inputs because we don't have a need that they match
+in any way.
+For the outputs, things are a little bit more subtle. Theano does not
+guarantee that they have been allocated but it does guarantee that, if they
+have been allocated, they have the right number of dimension. Again, Theano
+offers no guarantee on the exact shapes. This means that, in our example, we
+need to validate that the output storage has been allocated and has the same
+shape as our vector input. If it is not the case, we allocate a new output
+storage with the right shape and number of dimensions.
+.. code-block:: python
+    import numpy
+    import theano
+    from theano import gof
+    import theano.tensor as T
+    class VectorTimesScalar(gof.Op):
+        __props__ = ()
+        def make_node(self, x, y):
+            # Validate the inputs' type
+            if x.type.ndim != 1:
+                raise TypeError('x must be a 1-d vector')
+            if y.type.ndim != 0:
+                raise TypeError('y must be a scalar')
+            # Create an output variable of the same type as x
+            output_var = x.type()
+            return gof.Apply(self, [x, y], [output_var])
+        def c_code_cache_version(self):
+            return (1, 0)
+        def c_code(self, node, name, inp, out, sub):
+            x, y = inp
+            z, = out
+            # Extract the dtypes of the inputs and outputs storage to
+            # be able to declare pointers for those dtypes in the C
+            # code.
+            dtype_x = node.inputs[0].dtype
+            dtype_y = node.inputs[1].dtype
+            dtype_z = node.outputs[0].dtype
+            itemsize_x = numpy.dtype(dtype_x).itemsize
+            itemsize_z = numpy.dtype(dtype_z).itemsize
+            fail = sub['fail']
+            c_code = """
+            // Validate that the output storage exists and has the same
+            // dimension as x.
+            if (NULL == %(z)s ||
+                PyArray_DIMS(%(x)s)[0] != PyArray_DIMS(%(z)s)[0])
+            {
+                /* Reference received to invalid output variable.
+                Decrease received reference's ref count and allocate new
+                output variable */
+                Py_XDECREF(%(z)s);
+                %(z)s = (PyArrayObject*)PyArray_EMPTY(1,
+                                                    PyArray_DIMS(%(x)s),
+                                                    PyArray_TYPE(%(x)s),
+                                                    0);
+                if (!%(z)s) {
+                    %(fail)s;
+                }
+            }
+            // Perform the vector multiplication by a scalar
+            {
+                /* The declaration of the following variables is done in a new
+                scope to prevent cross initialization errors */
+                npy_%(dtype_x)s* x_data_ptr =
+                                (npy_%(dtype_x)s*)PyArray_DATA(%(x)s);
+                npy_%(dtype_z)s* z_data_ptr =
+                                (npy_%(dtype_z)s*)PyArray_DATA(%(z)s);
+                npy_%(dtype_y)s y_value =
+                                ((npy_%(dtype_y)s*)PyArray_DATA(%(y)s))[0];
+                int x_stride = PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s;
+                int z_stride = PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s;
+                int x_dim = PyArray_DIMS(%(x)s)[0];
+                for(int i=0; i < x_dim; i++)
+                {
+                    z_data_ptr[i * z_stride] = (x_data_ptr[i * x_stride] *
+                                                y_value);
+                }
+            }
+            """
+            return c_code % locals()
+More complex C Op example
+=========================
+This section introduces a new example, slightly more complex than the previous
+one, with an op to perform an element-wise multiplication between the elements
+of two vectors. This new example differs from the previous one in its use
+of the methods ``c_support_code()`` and ``c_support_code_apply()`` (it does
+not `need` to use them but it does so to explain their use) and its capacity
+to support inputs of different dtypes.
+Recall the method ``c_support_code()`` is meant to produce code that will
+be used for every apply of the op. This means that the C code in this
+method must be valid in every setting your op supports. If the op is meant
+to supports inputs of various dtypes, the C code in this method should be
+generic enough to work with every supported dtype. If the op operates on
+inputs that can be vectors or matrices, the C code in this method should
+be able to accomodate both kinds of inputs.
+In our example, the method ``c_support_code()`` is used to declare a C
+function to validate that two vectors have the same shape. Because our
+op only supports vectors as inputs, this function is allowed to rely
+on its inputs being vectors. However, our op should support multiple
+dtypes so this function cannot rely on a specific dtype in its inputs.
+The method ``c_support_code_apply()``, on the other hand, is allowed
+to depend on the inputs to the op because it is apply-specific. Therefore, we
+use it to define a function to perform the multiplication between two vectors.
+Variables or functions defined in the method ``c_support_code_apply()`` will
+be included at the global scale for every apply of the Op. Because of this,
+the names of those variables and functions should include the name of the op,
+like in the example. Otherwise, using the op twice in the same graph will give
+rise to conflicts as some elements will be declared more than once.
+The last interesting difference occurs in the ``c_code()`` method. Because the
+dtype of the output is variable and not guaranteed to be the same as any of
+the inputs (because of the upcast in the method ``make_node()``), the typenum
+of the output has to be obtained in the Python code and then included in the
+C code.
+.. code-block:: python
+    class VectorTimesVector(gof.Op):
+        __props__ = ()
+        def make_node(self, x, y):
+            # Validate the inputs' type
+            if x.type.ndim != 1:
+                raise TypeError('x must be a 1-d vector')
+            if y.type.ndim != 1:
+                raise TypeError('y must be a 1-d vector')
+            # Create an output variable of the same type as x
+            output_var = theano.tensor.TensorType(
+                            dtype=theano.scalar.upcast(x.dtype, y.dtype),
+                            broadcastable=[False])()
+            return gof.Apply(self, [x, y], [output_var])
+        def c_code_cache_version(self):
+            return (1, 0, 2)
+        def c_support_code(self):
+            c_support_code = """
+            bool vector_same_shape(PyArrayObject* arr1,
+                PyArrayObject* arr2)
+            {
+                return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
+            }
+            """
+            return c_support_code
+        def c_support_code_apply(self, node, name):
+            dtype_x = node.inputs[0].dtype
+            dtype_y = node.inputs[1].dtype
+            dtype_z = node.outputs[0].dtype
+            c_support_code = """
+            void vector_elemwise_mult_%(name)s(npy_%(dtype_x)s* x_ptr,
+                int x_str, npy_%(dtype_y)s* y_ptr, int y_str,
+                npy_%(dtype_z)s* z_ptr, int z_str, int nbElements)
+            {
+                for (int i=0; i < nbElements; i++){
+                    z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
+                }
+            }
+            """
+            return c_support_code % locals()
+        def c_code(self, node, name, inp, out, sub):
+            x, y = inp
+            z, = out
+            dtype_x = node.inputs[0].dtype
+            dtype_y = node.inputs[1].dtype
+            dtype_z = node.outputs[0].dtype
+            itemsize_x = numpy.dtype(dtype_x).itemsize
+            itemsize_y = numpy.dtype(dtype_y).itemsize
+            itemsize_z = numpy.dtype(dtype_z).itemsize
+            typenum_z = numpy.dtype(dtype_z).num
+            fail = sub['fail']
+            c_code = """
+            // Validate that the inputs have the same shape
+            if ( !vector_same_shape(%(x)s, %(y)s))
+            {
+                PyErr_Format(PyExc_ValueError, "Shape mismatch : "
+                            "x.shape[0] and y.shape[0] should match but "
+                            "x.shape[0] == %%i and y.shape[0] == %%i",
+                            PyArray_DIMS(%(x)s)[0], PyArray_DIMS(%(y)s)[0]);
+                %(fail)s;
+            }
+            // Validate that the output storage exists and has the same
+            // dimension as x.
+            if (NULL == %(z)s || !(vector_same_shape(%(x)s, %(z)s)))
+            {
+                /* Reference received to invalid output variable.
+                Decrease received reference's ref count and allocate new
+                output variable */
+                Py_XDECREF(%(z)s);
+                %(z)s = (PyArrayObject*)PyArray_EMPTY(1,
+                                                    PyArray_DIMS(%(x)s),
+                                                    %(typenum_z)s,
+                                                    0);
+                if (!%(z)s) {
+                    %(fail)s;
+                }
+            }
+            // Perform the vector elemwise multiplication
+            vector_elemwise_mult_%(name)s(
+                                    (npy_%(dtype_x)s*)PyArray_DATA(%(x)s),
+                                    PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s,
+                                    (npy_%(dtype_y)s*)PyArray_DATA(%(y)s),
+                                    PyArray_STRIDES(%(y)s)[0] / %(itemsize_y)s,
+                                    (npy_%(dtype_z)s*)PyArray_DATA(%(z)s),
+                                    PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s,
+                                    PyArray_DIMS(%(x)s)[0]);
+            """
+            return c_code % locals()
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -43,5 +43,6 @@ you out.
    debug_faq
    profiling
    extending_theano
+    extending_theano_c
    python-memory-management
    multi_cores