提交 00183e72 authored 作者: Olivier Delalleau's avatar Olivier Delalleau

Merge pull request #905 from nouiz/add_exerc_docu_rebase

Documentation improvements
......@@ -19,7 +19,7 @@ I wrote a new optimization, but it's not getting used...
Remember that you have to register optimizations with the :ref:`optdb`
for them to get used by the normal modes like FAST_COMPILE, FAST_RUN,
and DEBUG_MODE.
and DebugMode.
I wrote a new optimization, and it changed my results even though I'm pretty sure it is correct.
......
......@@ -168,7 +168,7 @@ not modify any of the inputs.
TODO: EXPLAIN DESTROYMAP and VIEWMAP BETTER AND GIVE EXAMPLE.
When developing an Op, you should run computations in DebugMode, by using
argument ``mode='DEBUG_MODE'`` to ``theano.function``. DebugMode is
argument ``mode='DebugMode'`` to ``theano.function``. DebugMode is
slow, but it can catch many common violations of the Op contract.
TODO: Like what? How? Talk about Python vs. C too.
......
......@@ -6,15 +6,15 @@ Extending Theano
================
This documentation is for users who want to extend Theano with new Types, new
This advanced tutorial is for users who want to extend Theano with new Types, new
Operations (Ops), and new graph optimizations.
Along the way, it also introduces many aspects of how Theano works, so it is
also good for you if you are interested in getting more under the hood with
Theano itself.
Before tackling this tutorial, it is highly recommended to read the
:ref:`tutorial`.
Before tackling this more advanced presentation, it is highly recommended to read the
introductory :ref:`Tutorial<tutorial>`.
The first few pages will walk you through the definition of a new :ref:`type`,
``double``, and a basic arithmetic :ref:`operations <op>` on that Type. We
......
......@@ -289,7 +289,7 @@ Example:
f = T.function([a,b],[c],mode='FAST_RUN')
m = theano.Module()
minstance = m.make(mode='DEBUG_MODE')
minstance = m.make(mode='DebugMode')
Whenever possible, unit tests should omit this parameter. Leaving
out the mode will ensure that unit tests use the default mode.
......@@ -306,7 +306,7 @@ type this:
THEANO_FLAGS='mode=FAST_COMPILE' nosetests
THEANO_FLAGS='mode=FAST_RUN' nosetests
THEANO_FLAGS='mode=DEBUG_MODE' nosetests
THEANO_FLAGS='mode=DebugMode' nosetests
.. _random_value_in_tests:
......
.. _glossary:
Glossary of terminology
=======================
Glossary
========
.. glossary::
......
......@@ -190,12 +190,10 @@ Here is the state of that vision as of 24 October 2011 (after Theano release
* Will provide better support for GPU on Windows and use an OpenCL backend on CPU.
* Loops work, but not all related optimizations are currently done.
* The cvm linker allows lazy evaluation. It works, but some work is still
needed before enabling it by default.
* The cvm linker allows lazy evaluation. It is the current default linker.
* All tests pass with linker=cvm?
* How to have `DEBUG_MODE` check it? Right now, DebugMode checks the computation non-lazily.
* The profiler used by cvm is less complete than `PROFILE_MODE`.
* How to have `DebugMode` check it? Right now, DebugMode checks the computation non-lazily.
* The profiler used by cvm is less complete than `ProfileMode`.
* SIMD parallelism on the CPU comes from the compiler.
* Multi-core parallelism is only supported for gemv and gemm, and only
......
......@@ -29,7 +29,7 @@ DebugMode can be used as follows:
x = tensor.dvector('x')
f = theano.function([x], 10*x, mode='DEBUG_MODE')
f = theano.function([x], 10*x, mode='DebugMode')
f(5)
f(0)
......@@ -42,7 +42,7 @@ It can also be used by passing a DebugMode instance as the mode, as in
If any problem is detected, DebugMode will raise an exception according to
what went wrong, either at call time (``f(5)``) or compile time (
``f = theano.function(x, 10*x, mode='DEBUG_MODE')``). These exceptions
``f = theano.function(x, 10*x, mode='DebugMode')``). These exceptions
should *not* be ignored; talk to your local Theano guru or email the
users list if you cannot make the exception go away.
......@@ -51,7 +51,7 @@ In the example above, there is no way to guarantee that a future call to say,
``f(-1)`` won't cause a problem. DebugMode is not a silver bullet.
If you instantiate DebugMode using the constructor ``compile.DebugMode``
rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
rather than the keyword ``DebugMode`` you can configure its behaviour via
constructor arguments.
Reference
......@@ -133,7 +133,7 @@ Reference
The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
The keyword version of DebugMode (which you get by using ``mode='DebugMode``)
is quite strict, and can raise several different Exception types.
There following are DebugMode exceptions you might encounter:
......@@ -200,7 +200,7 @@ There following are DebugMode exceptions you might encounter:
in the same order when run several times in a row. This can happen if any
steps are ordered by ``id(object)`` somehow, such as via the default object
hash function. A Stochastic optimization invalidates the pattern of work
whereby we debug in DEBUG_MODE and then run the full-size jobs in FAST_RUN.
whereby we debug in DebugMode and then run the full-size jobs in FAST_RUN.
.. class:: InvalidValueError(DebugModeError)
......
.. _libdoc_compile_mode:
======================================
:mod:`mode` -- controlling compilation
======================================
......@@ -17,9 +20,10 @@ Theano defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python
implementations. This mode can take much longer than the other modes,
but can identify many kinds of problems.
- ``'DebugMode'``: A mode for debuging. See :ref:`DebugMode <debugmode>` for details.
- ``'ProfileMode'``: A mode for profiling. See :ref:`ProfileMode <profilemode>` for details.
- ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode.
- ``'PROFILE_MODE'``: Deprecated. Use the string ProfileMode.
The default mode is typically ``FAST_RUN``, but it can be controlled via the
configuration variable :attr:`config.mode`, which can be
......
......@@ -13,7 +13,7 @@
Guide
=====
The config module contains many attributes that modify Theano's behavior. Many of these
The config module contains many ``attributes`` that modify Theano's behavior. Many of these
attributes are consulted during the import of the ``theano`` module and many are assumed to be
read-only.
......
......@@ -13,7 +13,7 @@
.. toctree::
:maxdepth: 1
fgraph
fg
toolbox
type
......
......@@ -12,18 +12,18 @@
Guide
======
Symbolic printing: the Print() Op
----------------------------------
Printing during execution
-------------------------
Intermediate values in a computation cannot be printed in
the normal python way with the print statement, because Theano has no *statements*.
Instead there is the `Print` Op.
Instead there is the :class:`Print` Op.
>>> x = T.dvector()
>>> hello_world_op = Print('hello world')
>>> hello_world_op = printing.Print('hello world')
>>> printed_x = hello_world_op(x)
>>> f = function([x], printed_x)
>>> f([1,2,3])
>>> f([1, 2, 3])
>>> # output: "hello world __str__ = [ 1. 2. 3.]"
If you print more than one thing in a function like `f`, they will not
......@@ -39,15 +39,15 @@ Printing graphs
---------------
Theano provides two functions (:func:`theano.pp` and
:func:`theano.debugprint`) to print a graph to the terminal before or after
:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
Theano also provides :func:`pydotprint` that creates a png image of the function.
Theano also provides :func:`theano.printing.pydotprint` that creates a png image of the function.
1) The first is :func:`theano.pp`.
1) The first is :func:`theano.pp`.
>>> x = T.dscalar('x')
>>> y = x**2
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy) # print out the gradient prior to optimization
'((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
......@@ -71,56 +71,63 @@ iteration number or other kinds of information in the name.
To make graphs legible, :func:`pp` hides some Ops that are actually in the graph. For example,
automatic DimShuffles are not shown.
2) The second function to print a graph is :func:`theano.printing.debugprint(variable_or_function, depth=-1)`
2) The second function to print a graph is :func:`theano.printing.debugprint`
>>> theano.printing.debugprint(f.maker.fgraph.outputs[0])
Elemwise{mul,no_inplace} 46950805397392
2.0 46950805310800
x 46950804895504
Elemwise{mul,no_inplace} [@A] ''
|TensorConstant{2.0} [@B]
|x [@C]
Each line printed represents a Variable in the graph.
The line `` x 46950804895504`` means the variable named 'x' at memory
location 46950804895504. If you accidentally have two variables called 'x' in
your graph, their different memory locations will be your clue.
The line ``|x [@C`` means the variable named ``x`` with debugprint identifier
[@C] is an input of the Elemwise. If you accidentally have two variables called ``x`` in
your graph, their different debugprint identifier will be your clue.
The line `` 2.0 46950805310800`` means that there is a constant 2.0 at the
given memory location.
The line ``|TensorConstant{2.0} [@B]`` means that there is a constant 2.0
wit this debugprint identifier.
The line `` Elemwise{mul,no_inplace} 46950805397392`` is indented less than
The line ``Elemwise{mul,no_inplace} [@A] ''`` is indented less than
the other ones, because it means there is a variable computed by multiplying
the other (more indented) ones together.
the other (more indented) ones together.
The ``|`` symbol are just there to help read big graph. The group
together inputs to a node.
Sometimes, you'll see a Variable but not the inputs underneath. That can
happen when that Variable has already been printed. Where else has it been
printed? Look for the memory address using the Find feature of your text
printed? Look for debugprint identifier using the Find feature of your text
editor.
>>> theano.printing.debugprint(gy)
Elemwise{mul} 46950804894224
Elemwise{mul} 46950804735120
Elemwise{second,no_inplace} 46950804626128
Elemwise{pow,no_inplace} 46950804625040
x 46950658736720
2 46950804039760
1.0 46950804625488
2 46950804039760
Elemwise{pow} 46950804737616
x 46950658736720
Elemwise{sub} 46950804736720
2 46950804039760
InplaceDimShuffle{} 46950804736016
1 46950804735760
Elemwise{mul} [@A] ''
|Elemwise{mul} [@B] ''
| |Elemwise{second,no_inplace} [@C] ''
| | |Elemwise{pow,no_inplace} [@D] ''
| | | |x [@E]
| | | |TensorConstant{2} [@F]
| | |TensorConstant{1.0} [@G]
| |TensorConstant{2} [@F]
|Elemwise{pow} [@H] ''
|x [@E]
|Elemwise{sub} [@I] ''
|TensorConstant{2} [@F]
|InplaceDimShuffle{} [@J] ''
|TensorConstant{1} [@K]
>>> theano.printing.debugprint(gy, depth=2)
Elemwise{mul} 46950804894224
Elemwise{mul} 46950804735120
Elemwise{pow} 46950804737616
Elemwise{mul} [@A] ''
|Elemwise{mul} [@B] ''
|Elemwise{pow} [@C] ''
If the depth parameter is provided, it limits the nuber of levels that are
shown.
3) The function :func:`theano.printing.pydotprint(fct, outfile=SOME_DEFAULT_VALUE)` will print a compiled theano function to a png file.
3) The function :func:`theano.printing.pydotprint` will print a compiled theano function to a png file.
In the image, Apply nodes (the applications of ops) are shown as boxes and variables are shown as ovals.
The number at the end of each label indicates graph position.
......@@ -170,10 +177,13 @@ Reference
running the function will print the value that `x` takes in the graph.
.. function:: theano.printing.pp(*args)
.. autofunction:: theano.printing.debugprint
TODO
.. function:: theano.pp(*args)
.. autofunction:: theano.printing.debugprint
Just a shortcut to :func:`theano.printing.pp`
.. autofunction:: theano.printing.pp(*args)
.. autofunction:: theano.printing.pydotprint
......@@ -136,19 +136,35 @@ arange must have its length specified at creation time.
Simple accumulation into a scalar, ditching lamba
-------------------------------------------------
This should be fairly self-explanatory.
Although this example would seem almost self-explanatory, it stresses a
pitfall to be careful of: the initial output state that is supplied, that is
``output_info``, must be of a **shape similar to that of the output variable**
generated at each iteration and moreover, it **must not involve an implicit
downcast** of the latter.
.. code-block:: python
import numpy as np
import theano
import theano.tensor as T
up_to = T.iscalar("up_to")
# define a named function, rather than using lambda
def accumulate_by_adding(arange_val, sum_to_date):
return sum_to_date + arange_val
seq = T.arange(up_to)
# An unauthorized implicit downcast from the dtype of 'seq', to that of
# 'T.as_tensor_variable(0)' which is of dtype 'int8' by default would occur
# if this instruction were to be used instead of the next one:
# outputs_info = T.as_tensor_variable(0)
outputs_info = T.as_tensor_variable(np.asarray(0, seq.dtype))
scan_result, scan_updates = theano.scan(fn=accumulate_by_adding,
outputs_info=T.as_tensor_variable(0),
sequences=T.arange(up_to))
outputs_info=outputs_info,
sequences=seq)
triangular_sequence = theano.function(inputs=[up_to], outputs=scan_result)
# test
......@@ -157,7 +173,6 @@ This should be fairly self-explanatory.
print [n * (n + 1) // 2 for n in xrange(some_num)]
Another simple example
----------------------
......
.. currentmodule:: tensor
.. _libdoc_basic_tensor:
===========================
Basic Tensor Functionality
===========================
......@@ -532,7 +534,7 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
.. function:: shape_padright(x,n_ones = 1)
.. function:: shape_padright(x, n_ones=1)
Reshape `x` by right padding the shape with `n_ones` 1s. Note that all
this new dimension will be broadcastable. To make them non-broadcastable
......@@ -597,7 +599,7 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
Create a matrix by filling the shape of `a` with `b`
.. function:: eye(n, m = None, k = 0, dtype=theano.config.floatX)
.. function:: eye(n, m=None, k=0, dtype=theano.config.floatX)
:param n: number of rows in output (value or theano scalar)
:param m: number of columns in output (value or theano scalar)
......@@ -1065,11 +1067,11 @@ Mathematical
Returns a variable representing the exponential of a, ie e^a.
.. function:: maximum(a,b)
.. function:: maximum(a, b)
Returns a variable representing the maximum element by element of a and b
.. function:: minimum(a,b)
.. function:: minimum(a, b)
Returns a variable representing the minimum element by element of a and b
......
.. _adding:
========================================
Baby steps - Adding two numbers together
========================================
====================
Baby Steps - Algebra
====================
Adding two scalars
Adding two Scalars
==================
So, to get us started with Theano and get a feel of what we're working with,
To get us started with Theano and get a feel of what we're working with,
let's make a simple function: add two numbers together. Here is how you do
it:
......@@ -34,12 +33,12 @@ Let's break this down into several steps. The first step is to define
two symbols (*Variables*) representing the quantities that you want
to add. Note that from now on, we will use the term
*Variable* to mean "symbol" (in other words,
``x``, ``y``, ``z`` are all *Variable* objects). The output of the function
``f`` is a ``numpy.ndarray`` with zero dimensions.
*x*, *y*, *z* are all *Variable* objects). The output of the function
*f* is a ``numpy.ndarray`` with zero dimensions.
If you are following along and typing into an interpreter, you may have
noticed that there was a slight delay in executing the ``function``
instruction. Behind the scenes, ``f`` was being compiled into C code.
instruction. Behind the scene, *f* was being compiled into C code.
.. note:
......@@ -52,12 +51,10 @@ instruction. Behind the scenes, ``f`` was being compiled into C code.
>>> x = theano.tensor.ivector()
>>> y = -x
``x`` and ``y`` are both Variables, i.e. instances of the
*x* and *y* are both Variables, i.e. instances of the
``theano.gof.graph.Variable`` class. The
type of both ``x`` and ``y`` is ``theano.tensor.ivector``.
type of both *x* and *y* is ``theano.tensor.ivector``.
-------------------------------------------
**Step 1**
......@@ -68,9 +65,9 @@ In Theano, all symbols must be typed. In particular, ``T.dscalar``
is the type we assign to "0-dimensional arrays (`scalar`) of doubles
(`d`)". It is a Theano :ref:`type`.
``dscalar`` is not a class. Therefore, neither ``x`` nor ``y``
``dscalar`` is not a class. Therefore, neither *x* nor *y*
are actually instances of ``dscalar``. They are instances of
:class:`TensorVariable`. ``x`` and ``y``
:class:`TensorVariable`. *x* and *y*
are, however, assigned the theano Type ``dscalar`` in their ``type``
field, as you can see here:
......@@ -83,52 +80,49 @@ TensorType(float64, scalar)
>>> x.type is T.dscalar
True
You can learn more about the structures in Theano in :ref:`graphstructures`.
By calling ``T.dscalar`` with a string argument, you create a
*Variable* representing a floating-point scalar quantity with the
given name. If you provide no argument, the symbol will be unnamed. Names
are not required, but they can help debugging.
More will be said in a moment regarding Theano's inner structure. You
could also learn more by looking into :ref:`graphstructures`.
-------------------------------------------
**Step 2**
The second step is to combine ``x`` and ``y`` into their sum ``z``:
The second step is to combine *x* and *y* into their sum *z*:
>>> z = x + y
``z`` is yet another *Variable* which represents the addition of
``x`` and ``y``. You can use the :ref:`pp <libdoc_printing>`
function to pretty-print out the computation associated to ``z``.
*z* is yet another *Variable* which represents the addition of
*x* and *y*. You can use the :ref:`pp <libdoc_printing>`
function to pretty-print out the computation associated to *z*.
>>> print pp(z)
(x + y)
-------------------------------------------
**Step 3**
The last step is to create a function taking ``x`` and ``y`` as inputs
and giving ``z`` as output:
The last step is to create a function taking *x* and *y* as inputs
and giving *z* as output:
>>> f = function([x, y], z)
The first argument to :func:`function <function.function>` is a list of Variables
that will be provided as inputs to the function. The second argument
is a single Variable *or* a list of Variables. For either case, the second
argument is what we want to see as output when we apply the function.
argument is what we want to see as output when we apply the function. *f* may
then be used like a normal Python function.
``f`` may then be used like a normal Python function.
Adding two matrices
Adding two Matrices
===================
You might already have guessed how to do this. Indeed, the only change
from the previous example is that you need to instantiate ``x`` and
``y`` using the matrix Types:
from the previous example is that you need to instantiate *x* and
*y* using the matrix Types:
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_adding.test_adding_2
......@@ -138,14 +132,14 @@ from the previous example is that you need to instantiate ``x`` and
>>> z = x + y
>>> f = function([x, y], z)
``dmatrix`` is the Type for matrices of doubles. And then we can use
``dmatrix`` is the Type for matrices of doubles. Then we can use
our new function on 2D arrays:
>>> f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
array([[ 11., 22.],
[ 33., 44.]])
The variable is a numpy array. We can also use numpy arrays directly as
The variable is a NumPy array. We can also use NumPy arrays directly as
inputs:
>>> import numpy
......@@ -159,18 +153,36 @@ by :ref:`broadcasting <libdoc_tensor_broadcastable>`.
The following types are available:
* **byte**: bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4
* **32-bit integers**: iscalar, ivector, imatrix, irow, icol, itensor3, itensor4
* **64-bit integers**: lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4
* **float**: fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4
* **double**: dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
* **complex**: cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4
* **byte**: ``bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4``
* **16-bit integers**: ``wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4``
* **32-bit integers**: ``iscalar, ivector, imatrix, irow, icol, itensor3, itensor4``
* **64-bit integers**: ``lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4``
* **float**: ``fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4``
* **double**: ``dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4``
* **complex**: ``cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4``
The previous list is not exhaustive. A guide to all types compatible
with numpy arrays may be found :ref:`here <libdoc_tensor_creation>`.
The previous list is not exhaustive and a guide to all types compatible
with NumPy arrays may be found here: :ref:`tensor creation<libdoc_tensor_creation>`.
.. note::
You, the user---not the system architecture---have to choose whether your
program will use 32- or 64-bit integers (``i`` prefix vs. the ``l`` prefix)
and floats (``f`` prefix vs. the ``d`` prefix).
-------------------------------------------
**Exercise**
.. code-block:: python
import theano
a = theano.tensor.vector() # declare variable
out = a + a ** 10 # build symbolic expression
f = theano.function([a], out) # compile function
print f([0, 1, 2]) # prints `array([0, 2, 1026])`
Modify and execute this code to compute this expression: a ** 2 + b ** 2 + 2 * a * b.
:download:`Solution<adding_solution_1.py>`
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Baby Steps - Algebra'
import theano
a = theano.tensor.vector() # declare variable
b = theano.tensor.vector() # declare variable
out = a ** 2 + b ** 2 + 2 * a * b # build symbolic expression
f = theano.function([a, b], out) # compile function
print f([1, 2], [4, 5]) # prints [ 25. 49.]
......@@ -5,53 +5,52 @@
Understanding Memory Aliasing for Speed and Correctness
=======================================================
The aggressive reuse of memory is one of the ways Theano makes code fast, and
it's important for the correctness and speed of your program that you understand
which buffers Theano might alias to which others.
The aggressive reuse of memory is one of the ways through which Theano makes code fast, and
it is important for the correctness and speed of your program that you understand
how Theano might alias buffers.
This file describes the principles for how Theano treats memory, and explains
when you might want to change the default behaviour of some functions and
This section describes the principles based on which Theano handles memory, and explains
when you might want to alter the default behaviour of some functions and
methods for faster performance.
The memory model: 2 spaces
==========================
The Memory Model: Two Spaces
============================
There are some simple principles that guide Theano's treatment of memory. The
There are some simple principles that guide Theano's handling of memory. The
main idea is that there is a pool of memory managed by Theano, and Theano tracks
changes to values in that pool.
1. Theano manages its own memory space, which typically does not overlap with
the memory of normal python variables that non-Theano code creates.
* Theano manages its own memory space, which typically does not overlap with
the memory of normal Python variables that non-Theano code creates.
1. Theano Functions only modify buffers that are in Theano's memory space.
* Theano functions only modify buffers that are in Theano's memory space.
1. Theano's memory space includes the buffers allocated to store shared
variables and the temporaries used to evaluate Functions.
* Theano's memory space includes the buffers allocated to store ``shared``
variables and the temporaries used to evaluate functions.
1. Physically, Theano's memory space may be spread across the host, a GPU
device(s), and in the future may even include objects on a remote machine.
* Physically, Theano's memory space may be spread across the host, a GPU
device(s), and in the future may even include objects on a remote machine.
1. The memory allocated for a shared variable buffer is unique: it is never
aliased to another shared variable.
* The memory allocated for a ``shared`` variable buffer is unique: it is never
aliased to another ``shared`` variable.
1. Theano's managed memory is constant while Theano Functions are not running
and Theano library code is not running.
* Theano's managed memory is constant while Theano functions are not running
and Theano's library code is not running.
1. The default behaviour of Function is to return user-space values for
outputs, and to expect user-space values for inputs.
* The default behaviour of a function is to return user-space values for
outputs, and to expect user-space values for inputs.
The distinction between Theano-managed memory and user-managed memory can be
broken down by some Theano functions (e.g. shared, get_value and the
constructors for In and Out) by using
a ``borrow=True`` flag. This can make those methods faster (by avoiding copy
operations) at the expense of risking subtle bugs in the overall program (by
aliasing memory).
broken down by some Theano functions (e.g. ``shared``, ``get_value`` and the
constructors for ``In`` and ``Out``) by using a ``borrow=True`` flag.
This can make those methods faster (by avoiding copy operations) at the expense
of risking subtle bugs in the overall program (by aliasing memory).
The rest of this section is aimed at helping you to understand when it is safe
to use the ``borrow=True`` argument and reap the benefit of faster code.
to use the ``borrow=True`` argument and reap the benefits of faster code.
Borrowing when creating shared variables
Borrowing when Creating Shared Variables
========================================
A ``borrow`` argument can be provided to the shared-variable constructor.
......@@ -69,9 +68,9 @@ A ``borrow`` argument can be provided to the shared-variable constructor.
s_false = theano.shared(np_array, borrow=False)
s_true = theano.shared(np_array, borrow=True)
By default (``s_default``) and when explicitly setting ``borrow=False``, the
shared variable we construct gets a [deep] copy of ``np_array``. So changes we
subsequently make to ``np_array`` have no effect on our shared variable.
By default (*s_default*) and when explicitly setting ``borrow=False``, the
shared variable we construct gets a [deep] copy of *np_array*. So changes we
subsequently make to *np_array* have no effect on our shared variable.
.. code-block:: python
......@@ -82,40 +81,40 @@ subsequently make to ``np_array`` have no effect on our shared variable.
s_true.get_value() # -> array([2.0, 2.0])
If we are running this with the CPU as the device,
then changes we make to np_array *right away* will show up in
then changes we make to *np_array* *right away* will show up in
``s_true.get_value()``
because numpy arrays are mutable, and ``s_true`` is using the ``np_array``
because NumPy arrays are mutable, and *s_true* is using the *np_array*
object as it's internal buffer.
However, this aliasing of ``np_array`` and ``s_true`` is not guaranteed to occur,
However, this aliasing of *np_array* and *s_true* is not guaranteed to occur,
and may occur only temporarily even if it occurs at all.
It is not guaranteed to occur because if Theano is using a GPU device, then the
borrow flag has no effect.
It may occur only temporarily because
if we call a Theano function that updates the value of ``s_true`` the aliasing
``borrow`` flag has no effect. It may occur only temporarily because
if we call a Theano function that updates the value of *s_true* the aliasing
relationship *may* or *may not* be broken (the function is allowed to
update the shared variable by modifying its buffer, which will preserve
update the ``shared`` variable by modifying its buffer, which will preserve
the aliasing, or by changing which buffer the variable points to, which
will terminate the aliasing).
*Take home message:*
It is safe practice (and a good idea) to use ``borrow=True`` in a shared
variable constructor when the shared variable stands for a large object (in
It is a safe practice (and a good idea) to use ``borrow=True`` in a ``shared``
variable constructor when the ``shared`` variable stands for a large object (in
terms of memory footprint) and you do not want to create copies of it in
memory.
It is not a reliable technique to use ``borrow=True`` to modify shared variables
by side-effect, because with some devices (e.g. GPU devices) this technique will
It is not a reliable technique to use ``borrow=True`` to modify ``shared`` variables
through side-effect, because with some devices (e.g. GPU devices) this technique will
not work.
Borrowing when accessing value of shared variables
Borrowing when Accessing Value of Shared Variables
==================================================
Retrieving
----------
A ``borrow`` argument can also be used to control how a shared variable's value is retrieved.
A ``borrow`` argument can also be used to control how a ``shared`` variable's value is
retrieved.
.. If you modify this code, also change :
......@@ -136,16 +135,16 @@ When ``borrow=True`` is passed to ``get_value``, it means that the return value
But both of these calls might create copies of the internal memory.
The reason that ``borrow=True`` might still make a copy is that the internal
representation of a shared variable might not be what you expect. When you
create a shared variable by passing a numpy array for example, then ``get_value()``
must return a numpy array too. That's how Theano can make the GPU use
transparent. But when you are using a GPU (or in future perhaps a remote machine), then the numpy.ndarray
is not the internal representation of your data.
representation of a ``shared`` variable might not be what you expect. When you
create a ``shared`` variable by passing a NumPy array for example, then ``get_value()``
must return a NumPy array too. That's how Theano can make the GPU use
transparent. But when you are using a GPU (or in the future perhaps a remote machine),
then the numpy.ndarray is not the internal representation of your data.
If you really want Theano to return its internal representation *and never copy it*
then you should use the ``return_internal_type=True`` argument to
``get_value``. It will never cast the internal object (always return in
constant time), but might return various datatypes depending on contextual
factors (e.g. the compute device, the dtype of the numpy array).
factors (e.g. the compute device, the dtype of the NumPy array).
.. code-block:: python
......@@ -156,28 +155,28 @@ It is possible to use ``borrow=False`` in conjunction with
This is primarily for internal debugging, not for typical use.
For the transparent use of different type of optimization Theano can make,
there is the policy that get_value() always return by default the same object type
it received when the shared variable was created. So if you created manually data on
the gpu and create a shared variable on the gpu with this data, get_value will always
return gpu data even when return_internal_type=False.
there is the policy that ``get_value()`` always return by default the same object type
it received when the ``shared`` variable was created. So if you created manually data on
the gpu and create a ``shared`` variable on the gpu with this data, ``get_value`` will always
return gpu data even when ``return_internal_type=False``.
*Take home message:*
It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
your code does not modify the return value. *Do not use this to modify a shared
your code does not modify the return value. *Do not use this to modify a ``shared``
variable by side-effect* because it will make your code device-dependent.
Modification of GPU variables by this sort of side-effect is impossible.
Modification of GPU variables through this sort of side-effect is impossible.
Assigning
---------
Shared variables also have a ``set_value`` method that can accept an optional ``borrow=True`` argument.
The semantics are similar to those of creating a new shared variable -
``borrow=False`` is the default and ``borrow=True`` means that Theano *may*
reuse the buffer you provide as the internal storage for the variable.
``Shared`` variables also have a ``set_value`` method that can accept an optional
``borrow=True`` argument. The semantics are similar to those of creating a new
``shared`` variable - ``borrow=False`` is the default and ``borrow=True`` means
that Theano *may* reuse the buffer you provide as the internal storage for the variable.
A standard pattern for manually updating the value of a shared variable is as
follows.
A standard pattern for manually updating the value of a ``shared`` variable is as
follows:
.. code-block:: python
......@@ -185,60 +184,44 @@ follows.
some_inplace_fn(s.get_value(borrow=True)),
borrow=True)
This pattern works regardless of the compute device, and when the compute device
This pattern works regardless of the computing device, and when the latter
makes it possible to expose Theano's internal variables without a copy, then it
goes as fast as an in-place update.
proceeds as fast as an in-place update.
When shared variables are allocated on the GPU, the transfers to and from GPU device memory can
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
* Prior to Theano 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily. *Solution*: update to a newer version of Theano.
unnecessarily.
* If you are going to swap several chunks of data in and out of a shared variable repeatedly,
*Solution*: update to a newer version of Theano.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
*Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines support only contiguous memory.
So Theano must make the ``value`` you provide ``c_contiguous`` prior to copying it.
This can require an extra copy of the data on the host. *Solution*: make sure that the value
you assign to a CudaNdarraySharedVariable is *already* ``c_contiguous``.
(Further remarks on the current implementation of the GPU version of set_value() can be found
here: :ref:`libdoc_cuda_var`)
So Theano must make the value you provide *C-contiguous* prior to copying it.
This can require an extra copy of the data on the host.
*Solution*: make sure that the value
you assign to a CudaNdarraySharedVariable is *already* *C-contiguous*.
Retrieving and assigning via the .value property
------------------------------------------------
Shared variables have a ``.value`` property that is connected to ``get_value``
and ``set_value``. The borrowing behaviour of the property is controlled by a
boolean configuration variable ``config.shared.value_borrows``, which currently
defaults to ``True``. If that variable is ``True`` then an assignment like ``s.value=v``
is equivalent to ``s.set_value(v, borrow=True)``, and a retrieval like ``print
s.value`` is equivalent to ``print s.get_value(borrow=True)``. Likewise,
if ``config.shared.value_borrows`` is ``False``, then the borrow parameter that the ``.value`` property
passes to ``set_value`` and ``get_value`` is ``False``.
The ``True`` default value of ``config.shared.value_borrows`` means that
aliasing can sometimes happen and sometimes not, which can be confusing.
Be aware that the default value may be changed to ``False`` sometime in the
not-to-distant future. This change will create more copies, and potentially slow
down code that accesses ``.value`` attributes inside tight loops. To avoid this
potential impact on your code, use the ``.get_value`` and ``.set_value`` methods
directly with appropriate flags.
(Further information on the current implementation of the GPU version of ``set_value()`` can be found
here: :ref:`libdoc_cuda_var`)
Borrowing when constructing Function objects
Borrowing when Constructing Function Objects
============================================
A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
that control how ``theano.function`` handles its arguments and return value[s].
that control how ``theano.function`` handles its argument[s] and return value[s].
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_aliasing.test_aliasing_3
......@@ -248,7 +231,7 @@ that control how ``theano.function`` handles its arguments and return value[s].
import theano, theano.tensor
x = theano.tensor.matrix()
y = 2*x
y = 2 * x
f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
Borrowing an input means that Theano will treat the argument you provide as if
......@@ -259,28 +242,29 @@ course of evaluating that function (e.g. ``f``).
Borrowing an output means that Theano will not insist on allocating a fresh
output buffer every time you call the function. It will possibly reuse the same one as
a previous call, and overwrite the old contents. Consequently, it may overwrite
old return values by side effect.
on a previous call, and overwrite the old content. Consequently, it may overwrite
old return values through side-effect.
Those return values may also be overwritten in
the course of evaluating *another compiled function* (for example, the output
may be aliased to a shared variable). So be careful to use a borrowed return
may be aliased to a ``shared`` variable). So be careful to use a borrowed return
value right away before calling any more Theano functions.
The default is of course to *not borrow* internal results.
It is also possible to pass an ``return_internal_type=True`` flag to the ``Out``
It is also possible to pass a ``return_internal_type=True`` flag to the ``Out``
variable which has the same interpretation as the ``return_internal_type`` flag
to the shared variable's ``get_value`` function. Unlike ``get_value()``, the
to the ``shared`` variable's ``get_value`` function. Unlike ``get_value()``, the
combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
``Out()`` are not guaranteed to avoid copying an output value. They are just
hints that give more flexibility to the compilation and optimization of the
graph.
*Take home message:*
When an input ``x`` to a function is not needed after the function returns and you
When an input *x* to a function is not needed after the function returns and you
would like to make it available to Theano as additional workspace, then consider
marking it with ``In(x, borrow=True)``. It may make the function faster and
reduce its memory requirement.
When a return value ``y`` is large (in terms of memory footprint), and you only need to read from it once, right
When a return value *y* is large (in terms of memory footprint), and you only need to read from it once, right
away when it's returned, then consider marking it with an ``Out(y,
borrow=True)``.
......@@ -4,53 +4,56 @@
Conditions
==========
**IfElse vs switch**
IfElse vs Switch
================
- Build condition over symbolic variables.
- IfElse Op takes a `boolean` condition and two variables to compute as input.
- Switch take a `tensor` as condition and two variables to compute as input.
- Switch is an elementwise operation. It is more general than IfElse.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
- Both ops build a condition over symbolic variables.
- ``IfElse`` takes a *boolean* condition and two variables as inputs.
- ``Switch`` takes a *tensor* as condition and two variables as inputs.
``switch`` is an elementwise operation and is thus more general than ``ifelse``.
- Whereas ``switch`` evaluates both *output* variables, ``ifelse`` is lazy and only
evaluates one variable with respect to the condition.
**Example**
.. code-block:: python
from theano import tensor as T
from theano.ifelse import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
a,b = T.scalars('a', 'b')
x,y = T.matrices('x', 'y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
z_switch = T.switch(T.lt(a, b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a, b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
f_switch = theano.function([a, b, x, y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
f_lazyifelse = theano.function([a, b, x, y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
big_mat1 = numpy.ones((10000, 1000))
big_mat2 = numpy.ones((10000, 1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
print 'time spent evaluating both values %f sec' % (time.clock() - tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
print 'time spent evaluating one value %f sec' % (time.clock() - tic)
In this example, IfElse Op spend less time (about an half) than Switch
since it computes only one variable instead of both.
In this example, the ``IfElse`` op spends less time (about half as much) than ``Switch``
since it computes only one variable out of the two.
.. code-block:: python
......@@ -59,11 +62,10 @@ since it computes only one variable instead of both.
time spent evaluating one value 0.3500 sec
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Unless ``linker='vm'`` or ``linker='cvm'`` are used, ``ifelse`` will compute both
variables and take the same computation time as ``switch``. Although the linker
is not currently set by default to ``cvm``, it will be in the near future.
There is not an optimization to automatically change a switch with a
broadcasted scalar to an ifelse, as this is not always the faster. See
There is no automatic optimization replacing a ``switch`` with a
broadcasted scalar to an ``ifelse``, as this is not always faster. See
this `ticket <http://www.assembla.com/spaces/theano/tickets/764>`_.
......@@ -6,22 +6,23 @@ Debugging Theano: FAQ and Troubleshooting
=========================================
There are many kinds of bugs that might come up in a computer program.
This page is structured as an FAQ. It should provide recipes to tackle common
problems, and introduce some of the tools that we use to find problems in our
Theano code, and even (it happens) in Theano's internals, such as
This page is structured as a FAQ. It provides recipes to tackle common
problems, and introduces some of the tools that we use to find problems in our
own Theano code, and even (it happens) in Theano's internals, in
:ref:`using_debugmode`.
Isolating the problem/Testing Theano compiler
Isolating the Problem/Testing Theano Compiler
---------------------------------------------
You can run your Theano function in a DebugMode(:ref:`using_debugmode`). This test the Theano optimizations and help to find where NaN, inf and other problem come from.
You can run your Theano function in a :ref:`DebugMode<using_debugmode>`.
This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.
Using Test Values
-----------------
As of v.0.4.0, Theano has a new mechanism by which graphs are executed
on-the-fly, before a theano.function is ever compiled. Since optimizations
on-the-fly, before a ``theano.function`` is ever compiled. Since optimizations
haven't been applied at this stage, it is easier for the user to locate the
source of some bug. This functionality is enabled through the config flag
``theano.config.compute_test_value``. Its use is best shown through the
......@@ -34,27 +35,27 @@ following example.
theano.config.compute_test_value = 'off'
# configure shared variables
W1val = numpy.random.rand(2,10,10).astype(theano.config.floatX)
W1val = numpy.random.rand(2, 10, 10).astype(theano.config.floatX)
W1 = theano.shared(W1val, 'W1')
W2val = numpy.random.rand(15,20).astype(theano.config.floatX)
W2val = numpy.random.rand(15, 20).astype(theano.config.floatX)
W2 = theano.shared(W2val, 'W2')
# input which will be of shape (5,10)
x = T.matrix('x')
# transform the shared variable in some way. Theano does not
# know off hand that the matrix func_of_W1 has shape (20,10)
func_of_W1 = W1.dimshuffle(2,0,1).flatten(2).T
# know off hand that the matrix func_of_W1 has shape (20, 10)
func_of_W1 = W1.dimshuffle(2, 0, 1).flatten(2).T
# source of error: dot product of 5x10 with 20x10
h1 = T.dot(x,func_of_W1)
h1 = T.dot(x, func_of_W1)
# do more stuff
h2 = T.dot(h1,W2.T)
h2 = T.dot(h1, W2.T)
# compile and call the actual function
f = theano.function([x], h2)
f(numpy.random.rand(5,10))
f(numpy.random.rand(5, 10))
Running the above code generates the following error message:
......@@ -86,9 +87,9 @@ Running the above code generates the following error message:
_dot22(x, <TensorType(float64, matrix)>), [_dot22.0],
_dot22(x, InplaceDimShuffle{1,0}.0), 'Sequence id of Apply node=4')
Needless to say the above is not very informative and does not provide much in
Needless to say, the above is not very informative and does not provide much in
the way of guidance. However, by instrumenting the code ever so slightly, we
can get Theano to give us the exact source of the error.
can get Theano to reveal the exact source of the error.
.. code-block:: python
......@@ -97,17 +98,17 @@ can get Theano to give us the exact source of the error.
...
# input which will be of shape (5,10)
# input which will be of shape (5, 10)
x = T.matrix('x')
# provide Theano with a default test-value
x.tag.test_value = numpy.random.rand(5,10)
x.tag.test_value = numpy.random.rand(5, 10)
In the above, we're tagging the symbolic matrix ``x`` with a special test
In the above, we are tagging the symbolic matrix *x* with a special test
value. This allows Theano to evaluate symbolic expressions on-the-fly (by
calling the ``perform`` method of each Op), as they are being defined. Sources
calling the ``perform`` method of each op), as they are being defined. Sources
of error can thus be identified with much more precision and much earlier in
the compilation pipeline. For example, running the above code yields the
following error message, which properly identifies line 23 as the culprit.
following error message, which properly identifies *line 23* as the culprit.
.. code-block:: bash
......@@ -120,33 +121,33 @@ following error message, which properly identifies line 23 as the culprit.
z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: ('matrices are not aligned', (5, 10), (20, 10))
The compute_test_value mechanism works as follows:
The ``compute_test_value`` mechanism works as follows:
* Theano Constants and SharedVariable are used as is. No need to instrument them.
* A Theano ``Variable`` (i.e. ``dmatrix``, ``vector``, etc.) should be
* Theano ``constants`` and ``shared`` variables are used as is. No need to instrument them.
* A Theano *variable* (i.e. ``dmatrix``, ``vector``, etc.) should be
given a special test value through the attribute ``tag.test_value``.
* Theano automatically instruments intermediate results. As such, any quantity
derived from ``x`` will be given a `tag.test_value` automatically.
derived from *x* will be given a ``tag.test_value`` automatically.
`compute_test_value` can take the following values:
``compute_test_value`` can take the following values:
* ``off``: default behavior. This debugging mechanism is inactive.
* ``raise``: compute test values on the fly. Any variable for which a test
* ``off``: Default behavior. This debugging mechanism is inactive.
* ``raise``: Compute test values on the fly. Any variable for which a test
value is required, but not provided by the user, is treated as an error. An
exception is raised accordingly.
* ``warn``: idem, but a warning is issued instead of an Exception.
* ``ignore``: silently ignore the computation of intermediate test values, if a
* ``warn``: Idem, but a warning is issued instead of an *Exception*.
* ``ignore``: Silently ignore the computation of intermediate test values, if a
variable is missing a test value.
.. note::
This feature is currently not compatible with ``Scan`` and also with Ops
This feature is currently incompatible with ``Scan`` and also with ops
which do not implement a ``perform`` method.
How do I print an intermediate value in a Function/Method?
----------------------------------------------------------
"How do I Print an Intermediate Value in a Function/Method?"
------------------------------------------------------------
Theano provides a 'Print' Op to do this.
Theano provides a 'Print' op to do this.
.. code-block:: python
......@@ -158,15 +159,15 @@ Theano provides a 'Print' Op to do this.
f_with_print = theano.function([x], x_printed * 5)
#this runs the graph without any printing
assert numpy.all( f([1,2,3]) == [5, 10, 15])
assert numpy.all( f([1, 2, 3]) == [5, 10, 15])
#this runs the graph with the message, and value printed
assert numpy.all( f_with_print([1,2,3]) == [5, 10, 15])
assert numpy.all( f_with_print([1, 2, 3]) == [5, 10, 15])
Since Theano runs your program in a topological order, you won't have precise
control over the order in which multiple Print() Ops are evaluted. For a more
precise inspection of what's being computed where, when, and how, see the
control over the order in which multiple ``Print()`` ops are evaluted. For a more
precise inspection of what's being computed where, when, and how, see the discussion
:ref:`faq_wraplinker`.
.. warning::
......@@ -177,40 +178,50 @@ precise inspection of what's being computed where, when, and how, see the
to remove them to know if this is the cause or not.
How do I print a graph (before or after compilation)?
----------------------------------------------------------
"How do I Print a Graph?" (before or after compilation)
-------------------------------------------------------
.. TODO: dead links in the next paragraph
Theano provides two functions (:func:`theano.pp` and
:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
Theano also provides :func:`pydotprint` that creates a png image of the function.
Theano also provides :func:`theano.printing.pydotprint` that creates a png image of the function.
You can read about them in :ref:`libdoc_printing`.
The function I compiled is too slow, what's up?
-----------------------------------------------
First, make sure you're running in FAST_RUN mode.
FAST_RUN is the default mode, but make sure by passing ``mode='FAST_RUN'``
"The Function I Compiled is Too Slow, what's up?"
-------------------------------------------------
First, make sure you're running in ``FAST_RUN`` mode. Even though
``FAST_RUN`` is the default mode, insist by passing ``mode='FAST_RUN'``
to ``theano.function`` (or ``theano.make``) or by setting :attr:`config.mode`
to ``FAST_RUN``.
Second, try the theano :ref:`using_profilemode`. This will tell you which
Apply nodes, and which Ops are eating up your CPU cycles.
Second, try the Theano :ref:`using_profilemode`. This will tell you which
``Apply`` nodes, and which ops are eating up your CPU cycles.
Tips:
* use the flags floatX=float32 to use float32 instead of float64 for the theano type matrix(),vector(),...(if you used dmatrix, dvector() they stay at float64).
* Check that in the profile mode that there is no Dot operation and you're multiplying two matrices of the same type. Dot should be optimized to dot22 when the inputs are matrices and of the same type. This can happen when using floatX=float32 and something in the graph makes one of the inputs float64.
* Use the flags ``floatX=float32`` to require type *float32* instead of *float64*;
Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),...
since they respectively involve the default types *float32* and *float64*.
* Check in the ``profile`` mode that there is no ``Dot`` op in the post-compilation
graph while you are multiplying two matrices of the same type. ``Dot`` should be
optimized to ``dot22`` when the inputs are matrices and of the same type. This can
still happen when using ``floatX=float32`` when one of the inputs of the graph is
of type *float64*.
.. _faq_wraplinker:
How do I step through a compiled function with the WrapLinker?
--------------------------------------------------------------
"How do I Step through a Compiled Function with the WrapLinker?"
----------------------------------------------------------------
This is not exactly an FAQ, but the doc is here for now...
This is not exactly a FAQ, but the doc is here for now...
It's pretty easy to roll-your-own evaluation mode.
Check out this one:
......@@ -225,37 +236,37 @@ Check out this one:
wrap_linker = theano.gof.WrapLinkerMany([theano.gof.OpWiseCLinker()], [print_eval])
super(PrintEverythingMode, self).__init__(wrap_linker, optimizer='fast_run')
When you use ``mode=PrintEverythingMode()`` as the mode for Function or Method,
then you should see [potentially a lot of] output. Every Apply node will be printed out,
along with its position in the graph, the arguments to the ``perform`` or
When you use ``mode=PrintEverythingMode()`` as the mode for ``Function`` or ``Method``,
then you should see [potentially a lot of] output. Every ``Apply`` node will be printed out,
along with its position in the graph, the arguments to the functions ``perform`` or
``c_code`` and the output it computed.
>>> x = T.dscalar('x')
>>> f = function([x], [5*x], mode=PrintEverythingMode())
>>> f = function([x], [5 * x], mode=PrintEverythingMode())
>>> f(3)
>>> # print: 0 Elemwise{mul,no_inplace}(5, x) [array(5, dtype=int8), array(3.0)] [array(15.0)]
>>> # print: [array(15.0)]
Admittedly, this may be a huge amount of
output to read through if you are using big tensors... but you can choose to
put logic inside of the print_eval function that would, for example, only
print something out if a certain kind of Op was used, at a certain program
position, or if a particular value shows up in one of the inputs or outputs.
put logic inside of the *print_eval* function that would, for example, print
something out only if a certain kind of op were used, at a certain program
position, or only if a particular value showed up in one of the inputs or outputs.
Use your imagination :)
.. TODO: documentation for link.WrapLinkerMany
This can be a really powerful debugging tool.
Note the call to ``fn`` inside the call to ``print_eval``; without it, the graph wouldn't get computed at all!
This can be a really powerful debugging tool. Note the call to *fn* inside the call to
*print_eval*; without it, the graph wouldn't get computed at all!
How to use pdb ?
----------------
How to Use pdb
--------------
In the majority of cases, you won't be executing from the interactive shell
but from a set of Python scripts. In such cases, the use of the Python
debugger can come in handy, especially as your models become more complex.
Intermediate results don't necessarily have a clear name and you can get
exceptions which are hard to decipher, due to the "compiled" nature of
exceptions which are hard to decipher, due to the "compiled" nature of the
functions.
Consider this example script ("ex.py"):
......@@ -269,16 +280,16 @@ Consider this example script ("ex.py"):
a = T.dmatrix('a')
b = T.dmatrix('b')
f = theano.function([a,b], [a*b])
f = theano.function([a, b], [a * b])
# matrices chosen so dimensions are unsuitable for multiplication
mat1 = numpy.arange(12).reshape((3,4))
mat2 = numpy.arange(25).reshape((5,5))
mat1 = numpy.arange(12).reshape((3, 4))
mat2 = numpy.arange(25).reshape((5, 5))
f(mat1, mat2)
This is actually so simple the debugging could be done easily, but it's for
illustrative purposes. As the matrices can't be element-wise multiplied
illustrative purposes. As the matrices can't be multiplied element-wise
(unsuitable shapes), we get the following exception:
.. code-block:: text
......@@ -290,12 +301,12 @@ illustrative purposes. As the matrices can't be element-wise multiplied
File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
The call stack contains a few useful informations to trace back the source
The call stack contains some useful information to trace back the source
of the error. There's the script where the compiled function was called --
but if you're using (improperly parameterized) prebuilt modules, the error
might originate from ops in these modules, not this script. The last line
tells us about the Op that caused the exception. In thise case it's a "mul"
involving Variables name "a" and "b". But suppose we instead had an
tells us about the op that caused the exception. In this case it's a "mul"
involving variables with names "a" and "b". But suppose we instead had an
intermediate result to which we hadn't given a name.
After learning a few things about the graph structure in Theano, we can use
......@@ -328,7 +339,7 @@ explore around the graph.
That graph is purely symbolic (no data, just symbols to manipulate it
abstractly). To get information about the actual parameters, you explore the
"thunks" objects, which bind the storage for the inputs (and outputs) with
"thunk" objects, which bind the storage for the inputs (and outputs) with
the function itself (a "thunk" is a concept related to closures). Here, to
get the current node's first input's shape, you'd therefore do "p
thunk.inputs[0][0].shape", which prints out "(3, 4)".
......
......@@ -2,11 +2,19 @@
.. _basictutexamples:
=============
More examples
More Examples
=============
At this point it would be wise to begin familiarizing yourself
more systematically with Theano's fundamental objects and operations by browsing
this section of the library: :ref:`libdoc_basic_tensor`.
Logistic function
As the tutorial unfolds, you should also gradually acquaint yourself with the other
relevant areas of the library and with the relevant subjects of the documentation
entrance page.
Logistic Function
=================
Here's another straightforward example, though a bit more elaborate
......@@ -61,12 +69,12 @@ array([[ 0.5 , 0.73105858],
[ 0.26894142, 0.11920292]])
Computing more than one thing at the same time
Computing More than one Thing at the Same Time
==============================================
Theano supports functions with multiple outputs. For example, we can
compute the :ref:`elementwise <libdoc_tensor_elementwise>` difference, absolute difference, and
squared difference between two matrices ``a`` and ``b`` at the same time:
squared difference between two matrices *a* and *b* at the same time:
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_3
......@@ -82,7 +90,7 @@ squared difference between two matrices ``a`` and ``b`` at the same time:
shortcut for allocating symbolic variables that we will often use in the
tutorials.
When we use the function, it will return the three variables (the printing
When we use the function f, it returns the three variables (the printing
was reformatted for readability):
>>> f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
......@@ -94,9 +102,7 @@ was reformatted for readability):
[ 1., 4.]])]
Setting a default value for an argument
Setting a Default Value for an Argument
=======================================
Let's say you want to define a function that adds two numbers, except
......@@ -117,11 +123,11 @@ array(35.0)
This makes use of the :ref:`Param <function_inputs>` class which allows
you to specify properties of your function's parameters with greater detail. Here we
give a default value of 1 for ``y`` by creating a ``Param`` instance with
give a default value of 1 for *y* by creating a ``Param`` instance with
its ``default`` field set to 1.
Inputs with default values must follow inputs without default
values (like python's functions). There can be multiple inputs with default values. These parameters can
values (like Python's functions). There can be multiple inputs with default values. These parameters can
be set positionally or by name, as in standard Python:
......@@ -143,18 +149,21 @@ array(34.0)
array(33.0)
.. note::
``Param`` does not know the name of the local variables ``y`` and ``w``
``Param`` does not know the name of the local variables *y* and *w*
that are passed as arguments. The symbolic variable objects have name
attributes (set by ``dscalars`` in the example above) and *these* are the
names of the keyword parameters in the functions that we build. This is
the mechanism at work in ``Param(y, default=1)``. In the case of ``Param(w,
default=2, name='w_by_name')``, we override the symbolic variable's name
default=2, name='w_by_name')``. We override the symbolic variable's name
attribute with a name to be used for this function.
You may like to see :ref:`Function<usingfunction>` in the library for more detail.
.. _functionstateexample:
Using shared variables
Using Shared Variables
======================
It is also possible to make a function with an internal state. For
......@@ -162,7 +171,7 @@ example, let's say we want to make an accumulator: at the beginning,
the state is initialized to zero. Then, on each function call, the state
is incremented by the function's argument.
First let's define the ``accumulator`` function. It adds its argument to the
First let's define the *accumulator* function. It adds its argument to the
internal state, and returns the old state value.
.. If you modify this code, also change :
......@@ -174,24 +183,24 @@ internal state, and returns the old state value.
>>> accumulator = function([inc], state, updates=[(state, state+inc)])
This code introduces a few new concepts. The ``shared`` function constructs
so-called :term:`shared variables <shared variable>`.
These are hybrid symbolic and non-symbolic
variables. Shared variables can be used in symbolic expressions just like
so-called :ref:`shared variables<libdoc_compile_shared>`.
These are hybrid symbolic and non-symbolic variables whose value may be shared
between multiple functions. Shared variables can be used in symbolic expressions just like
the objects returned by ``dmatrices(...)`` but they also have an internal
value, that defines the value taken by this symbolic variable in *all* the
value that defines the value taken by this symbolic variable in *all* the
functions that use it. It is called a *shared* variable because its value is
shared between many functions. The value can be accessed and modified by the
``.get_value()`` and ``.set_value()`` methods. We will come back to this soon.
The other new thing in this code is the ``updates`` parameter of function.
The updates is a list of pairs of the form (shared-variable, new expression).
The other new thing in this code is the ``updates`` parameter of ``function``.
``updates`` must be supplied with a list of pairs of the form (shared-variable, new expression).
It can also be a dictionary whose keys are shared-variables and values are
the new expressions. Either way, it means "whenever this function runs, it
will replace the ``.value`` of each shared variable with the result of the
corresponding expression". Above, our accumulator replaces the ``state``'s value with the sum
of the state and the increment amount.
Anyway, let's try it out!
Let's try it out!
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_8
......@@ -216,7 +225,7 @@ array(-1)
array(2)
As we mentioned above, you can define more than one function to use the same
shared variable. These functions can both update the value.
shared variable. These functions can all update the value.
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_8
......@@ -228,13 +237,13 @@ array(2)
array(0)
You might be wondering why the updates mechanism exists. You can always
achieve a similar thing by returning the new expressions, and working with
them in numpy as usual. The updates mechanism can be a syntactic convenience,
achieve a similar result by returning the new expressions, and working with
them in NumPy as usual. The updates mechanism can be a syntactic convenience,
but it is mainly there for efficiency. Updates to shared variables can
sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
updates). Also, theano has more control over where and how shared variables are
updates). Also, Theano has more control over where and how shared variables are
allocated, which is one of the important elements of getting good performance
on the GPU.
on the :ref:`GPU<using_gpu>`.
It may happen that you expressed some formula using a shared variable, but
you do *not* want to use its value. In this case, you can use the
......@@ -254,15 +263,15 @@ array(7)
>>> state.get_value() # old state still there, but we didn't use it
array(0)
The givens parameter can be used to replace any symbolic variable, not just a
The ``givens`` parameter can be used to replace any symbolic variable, not just a
shared variable. You can replace constants, and expressions, in general. Be
careful though, not to allow the expressions introduced by a givens
careful though, not to allow the expressions introduced by a ``givens``
substitution to be co-dependent, the order of substitution is not defined, so
the substitutions have to work in any order.
In practice, a good way of thinking about the ``givens`` is as a mechanism
that allows you to replace any part of your formula with a different
expression that evaluates to a tensor of same shape and dtype. ``givens``
expression that evaluates to a tensor of same shape and dtype.
.. _using_random_numbers:
......@@ -272,17 +281,19 @@ Using Random Numbers
Because in Theano you first express everything symbolically and
afterwards compile this expression to get functions,
using pseudo-random numbers is not as straightforward as it is in
numpy, though also not too complicated.
NumPy, though also not too complicated.
The way to think about putting randomness into Theano's computations is
to put random variables in your graph. Theano will allocate a numpy
to put random variables in your graph. Theano will allocate a NumPy
RandomStream object (a random number generator) for each such
variable, and draw from it as necessary. We will call this sort of
sequence of random numbers a *random stream*. *Random streams* are at
their core shared variables, so the observations on shared variables
hold here as well.
hold here as well. Theanos's random objects are defined and implemented in
:ref:`RandomStreams<libdoc_tensor_shared_randomstreams>` and, at a lower level,
in :ref:`RandomStreamsBase<libdoc_tensor_raw_random>`.
Brief example
Brief Example
-------------
Here's a brief example. The setup code is:
......@@ -303,7 +314,9 @@ Here's a brief example. The setup code is:
Here, 'rv_u' represents a random stream of 2x2 matrices of draws from a uniform
distribution. Likewise, 'rv_n' represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are
defined in :class:`RandomStreams`.
defined in :class:`RandomStreams` and, at a lower level, in :ref:`raw_random<libdoc_tensor_raw_random>`.
.. TODO: repair the latter reference on RandomStreams
Now let's use these objects. If we call f(), we get random uniform numbers.
The internal state of the random number generator is automatically updated,
......@@ -313,22 +326,22 @@ so we get different random numbers every time.
>>> f_val1 = f() #different numbers from f_val0
When we add the extra argument ``no_default_updates=True`` to
``function`` (as in ``g``), then the random number generator state is
not affected by calling the returned function. So for example, calling
``g`` multiple times will return the same numbers.
``function`` (as in *g*), then the random number generator state is
not affected by calling the returned function. So, for example, calling
*g* multiple times will return the same numbers.
>>> g_val0 = g() # different numbers from f_val0 and f_val1
>>> g_val1 = g() # same numbers as g_val0!
An important remark is that a random variable is drawn at most once during any
single function execution. So the ``nearly_zeros`` function is guaranteed to
return approximately 0 (except for rounding error) even though the ``rv_u``
single function execution. So the *nearly_zeros* function is guaranteed to
return approximately 0 (except for rounding error) even though the *rv_u*
random variable appears three times in the output expression.
>>> nearly_zeros = function([], rv_u + rv_u - 2 * rv_u)
Seedings Streams
----------------
Seeding Streams
---------------
Random variables can be seeded individually or collectively.
......@@ -346,12 +359,12 @@ of the random variables.
>>> srng.seed(902340) # seeds rv_u and rv_n with different seeds each
Sharing Streams between Functions
Sharing Streams Between Functions
---------------------------------
As usual for shared variables, the random number generators used for random
variables are common between functions. So our ``nearly_zeros`` function will
update the state of the generators used in function ``f`` above.
variables are common between functions. So our *nearly_zeros* function will
update the state of the generators used in function *f* above.
For example:
......@@ -364,7 +377,64 @@ For example:
>>> v2 = f() # v2 != v1
Others Random Distributions
Other Random Distributions
---------------------------
There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`.
.. _logistic_regression:
A Real Example: Logistic Regression
===================================
The preceding elements are featured in this more realistic example. It will be used repeatedly.
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
gw,gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# (we shall return to this in a
# following section of this tutorial)
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
.. _extending_theano:
****************
================
Extending Theano
****************
================
Theano graphs
-------------
Theano Graphs
=============
- Theano works with symbolic graphs
- Those graphs are bi-partite graphs (graph with 2 types of nodes)
- The 2 types of nodes are Apply and Variable nodes
- Each Apply node has a link to the Op that it executes
- Theano works with symbolic graphs.
- Those graphs are bi-partite graphs (graph with 2 types of nodes).
- The two types of nodes are ``Apply`` and ``Variable`` nodes.
- Each ``Apply`` node has a link to the op that it executes.
Inputs and Outputs are lists of Theano variables
Inputs and Outputs are lists of Theano variables.
.. image:: ../hpcs2011_tutorial/pics/apply_node.png
:width: 500 px
......@@ -21,27 +21,29 @@ Inputs and Outputs are lists of Theano variables
.. note::
This tutorial does not cover how to make an op that returns a view or
modify the values in its inputs. So all
Ops created with the instructions here MUST return newly allocated
modifies the values in its inputs. Thus, all ops created with the
instructions described here MUST return newly allocated
memory or reuse the memory provided in the parameter
output_storage of the :func:`perform` function. See :ref:`views_and_inplace`
for explanation of how to do this.
``output_storage`` of the :func:`perform` function. See :ref:`views_and_inplace`
for an explanation on how to do this.
If your Op returns a view or change the value on its inputs
without doing as said in that page, Theano will run, but will
return good results for some graphs, but bad results for others.
If your op returns a view or changes the value of its inputs
without doing as prescribed in that page, Theano will run, but will
return correct results for some graphs and wrong results for others.
It is recommented that you run your tests in DebugMode (Theano flag
mode=DebugMode) that checks if your Op behaves correctly in this
It is recommended that you run your tests in DebugMode (Theano *flag*
``mode=DebugMode``) since it verifies if your op behaves correctly in this
regard.
.. note::
See :ref:`dev_start_guide` for information about git, github, the
development workflow and how to make a quality contribution.
See the :ref:`dev_start_guide` for information regarding the versioning
framework, namely about *git* and *GitHub*, regarding the development workflow and
how to make a quality contribution.
Op contract
-----------
Op Contract
===========
.. code-block:: python
......@@ -66,8 +68,8 @@ Op contract
pass
# C implementation: [see theano web site for other functions]
def c_code(...):
# ...
def c_code(...):
# ...
pass
# others implementation (pycuda, ...):
......@@ -81,7 +83,7 @@ Op contract
def grad(self, inputs, g):
pass
def R_op(self, inputs, eval_points):
def R_op(self, inputs, eval_points):
pass
def infer_shape(node, (i0_shapes, ...))
......@@ -89,28 +91,28 @@ Op contract
.. ../extending/op.txt
There are 2 mandatory methods that one needs to implement.
There are two mandatory methods that one needs to implement.
The first one is :func:`make_node`. The second one
would describe the computations that are required to be done
at run time. Currently there are 2 different possibilites:
implement the :func:`perform`
and/or :func:`c_code <Op.c_code>` (and other related :ref:`c methods
<cop>`), or the :func:`make_thunk` method. The ``perform`` allows
to easily wrap an existing python function into Theano. The ``c_code``
and related methods allow the op to generate c code that will be
compiled and linked by Theano. On the other hand, the ``make_thunk``
method will be called only once during compilation and should generate
and/or :func:`c_code <Op.c_code>` methods (and other related :ref:`c methods
<cop>`), or the :func:`make_thunk` method. ``perform`` allows
to easily wrap an existing Python function into Theano. ``c_code``
and the related methods allow the op to generate C code that will be
compiled and linked by Theano. On the other hand, ``make_thunk``
will be called only once during compilation and should generate
a ``thunk``: a standalone function that when called will do the wanted computations.
This is useful if you want to generate code and compile it yourself. For
example, this allows you to use PyCUDA to compile gpu code.
example, this allows you to use PyCUDA to compile GPU code.
Also there are 2 methods that are highly recommended to be implemented. They are
Also there are two methods whose implementations are highly recommended. They are
needed in order to merge duplicate computations involving your op. So if you
do not want Theano to execute your op multiple times with the same inputs,
do implement them. Those methods are :func:`__eq__` and
:func:`__hash__`.
The :func:`infer_shape` method allows to infer shape of some variable, somewhere in the
The :func:`infer_shape` method allows to infer the shape of some variable, somewhere in the
middle of the computational graph without actually computing the outputs (when possible).
This could be helpful if one only needs the shape of the output instead of the actual outputs.
......@@ -118,13 +120,13 @@ The :func:`grad` method is required if you want to differentiate some cost whose
includes your op.
The :func:`__str__` method is useful in order to provide a more meaningful
string representation of your Op.
string representation of your op.
The :func:`R_op` method is needed if you want `theano.tensor.Rop` to
The :func:`R_op` method is needed if you want ``theano.tensor.Rop`` to
work with your op.
Op example
----------
Op Example
==========
.. code-block:: python
......@@ -155,7 +157,7 @@ Op example
def grad(self, inputs, output_grads):
return [output_grads[0] * 2]
def R_op(self, inputs, eval_points):
def R_op(self, inputs, eval_points):
# R_op can receive None as eval_points.
# That mean there is no diferientiable path through that input
# If this imply that you cannot compute some outputs,
......@@ -164,7 +166,7 @@ Op example
return eval_points
return self.grad(inputs, eval_points)
Try it!
You can try it as follows:
.. code-block:: python
......@@ -177,19 +179,20 @@ Try it!
print inp
print out
How to test it
--------------
Theano has some functions to simplify testing. These help test the
How To Test it
==============
Theano has some functionalities to simplify testing. These help test the
``infer_shape``, ``grad`` and ``R_op`` methods. Put the following code
in a file and execute it with the ``nosetests`` program.
in a file and execute it with the ``theano-nose`` program.
Basic tests
===========
Basic Tests
-----------
Basic tests are done by you just by using the Op and checking that it
Basic tests are done by you just by using the op and checking that it
returns the right answer. If you detect an error, you must raise an
exception. You can use the `assert` keyword to automatically raise an
*exception*. You can use the ``assert`` keyword to automatically raise an
``AssertionError``.
.. code-block:: python
......@@ -210,23 +213,24 @@ exception. You can use the `assert` keyword to automatically raise an
# Compare the result computed to the expected value.
assert numpy.allclose(inp * 2, out)
Testing the infer_shape
=======================
-----------------------
When a class inherits from the ``InferShapeTester`` class, it gets the
`self._compile_and_check` method that tests the Op ``infer_shape``
method. It tests that the Op gets optimized out of the graph if only
``self._compile_and_check`` method that tests the op's ``infer_shape``
method. It tests that the op gets optimized out of the graph if only
the shape of the output is needed and not the output
itself. Additionally, it checks that such an optimized graph computes
itself. Additionally, it checks that the optimized graph computes
the correct shape, by comparing it to the actual shape of the computed
output.
`self._compile_and_check` compiles a Theano function. It takes as
``self._compile_and_check`` compiles a Theano function. It takes as
parameters the lists of input and output Theano variables, as would be
provided to theano.function, and a list of real values to pass to the
compiled function (don't use shapes that are symmetric, e.g. (3, 3),
as they can easily to hide errors). It also takes the Op class to
verify that no Ops of that type appear in the shape-optimized graph.
provided to ``theano.function``, and a list of real values to pass to the
compiled function (do not use symmetric shapes, e.g. (3, 3),
as they can easily hide errors). It also takes the op class as a parameter
in order to verify that no instance of it appears in the shape-optimized graph.
If there is an error, the function raises an exception. If you want to
see it fail, you can implement an incorrect ``infer_shape``.
......@@ -249,10 +253,10 @@ see it fail, you can implement an incorrect ``infer_shape``.
self.op_class)
Testing the gradient
====================
--------------------
The function :ref:`verify_grad <validating_grad>`
verifies the gradient of an Op or Theano graph. It compares the
verifies the gradient of an op or Theano graph. It compares the
analytic (symbolically computed) gradient and the numeric
gradient (computed through the Finite Difference Method).
......@@ -267,15 +271,16 @@ the multiplication by 2).
[numpy.random.rand(5, 7, 2)])
Testing the Rop
===============
---------------
.. TODO: repair defective links in the following paragraph
The class :class:`RopLop_checker`, give the functions
:func:`RopLop_checker.check_mat_rop_lop`,
:func:`RopLop_checker.check_rop_lop` and
:func:`RopLop_checker.check_nondiff_rop` that allow to test the
implementation of the Rop method of one Op.
The class :class:`RopLop_checker` defines the functions
:func:`RopLop_checker.check_mat_rop_lop`, :func:`RopLop_checker.check_rop_lop` and
:func:`RopLop_checker.check_nondiff_rop`. These allow to test the
implementation of the Rop method of a particular op.
To verify the Rop method of the DoubleOp, you can use this:
For instance, to verify the Rop method of the DoubleOp, you can use this:
.. code-block:: python
......@@ -288,20 +293,64 @@ To verify the Rop method of the DoubleOp, you can use this:
def test_double_rop(self):
self.check_rop_lop(DoubleRop()(self.x), self.in_shape)
Running your tests
Testing GPU Ops
---------------
Ops to be executed on the GPU should inherit from the ``theano.sandbox.cuda.GpuOp``
and not ``theano.Op``. This allows Theano to distinguish them. Currently, we
use this to test if the NVIDIA driver works correctly with our sum reduction code on the
GPU.
Running Your Tests
==================
You can run ``nosetests`` in the Theano folder to run all of Theano's
tests, including yours if they are somewhere in the directory
structure. You can run ``nosetests test_file.py`` to run only the
tests in that file. You can run ``nosetests
test_file.py:test_DoubleRop`` to run only the tests inside that test
class. You can run ``nosetests
test_file.py:test_DoubleRop.test_double_op`` to run only one
particular test. More `nosetests
<http://readthedocs.org/docs/nose/en/latest/>`_ documentation.
To perform your tests, you may select either one of the three following methods:
theano-nose
-----------
The method of choice to conduct tests is to run the file ``theano-nose``. In a regular
Theano installation, the latter will be on the operating system's path and directly accessible
from any folder. Otherwise, it can be accessed in the ``Theano/bin`` folder. The following command
lines may be used for the corresponding purposes:
* ``theano-nose --theano``: Run every test found in Theano's path.
* ``theano-nose folder_name``: Run every test found in the folder *folder_name*.
* ``theano-nose test_file.py``: Run every test found in the file *test_file.py*.
The following are particularly useful for development purposes since they call for
particular classes or even for particular tests:
You can also add this at the end of the test file:
* ``theano-nose test_file.py:test_DoubleRop``: Run every test found inside the class *test_DoubleRop*.
* ``theano-nose test_file.py:test_DoubleRop.test_double_op``: Run only the test *test_double_op*
in the class *test_DoubleRop*.
Help with the use and functionalities of ``theano-nose`` may be obtained by running
it with the command line parameter ``--help (-h)``.
nosetests
---------
The command ``nosetests`` can also be used. Although it lacks the useful
functionalities that ``theano-nose`` provides, ``nosetests`` can be called similarly
to ``theano-nose`` from any folder in Python's path like so:
``nosetests [suffix similar to the above]``.
More documentation on ``nosetests`` is available here:
`nosetests <http://readthedocs.org/docs/nose/en/latest/>`_.
In-file
-------
One may also add a block of code similar to the following at the end of the
file containing a specific test of interest and run the file. In this example, the test
*test_DoubleRop* in the class *test_double_op* would be performed.
.. code-block:: python
......@@ -310,14 +359,30 @@ You can also add this at the end of the test file:
t.setUp()
t.test_double_rop()
Exercises 8
-----------
We recommand that when we execute a file, we run all tests in that
file. This can be done by adding this at the end of your test files:
- Run the code in the file double_op.py.
- Modify and execute to compute: x * y
- Modify and execute the example to return 2 outputs: x + y and x - y
.. code-block:: python
- Our current element-wise fusion generates computation with only 1 output.
if __name__ == '__main__':
unittest.main()
Exercise
========
Run the code of the *DoubleOp* example above.
Modify and execute to compute: x * y.
Modify and execute the example to return two outputs: x + y and x - y.
You can omit the Rop functions. Try to implement the testing apparatus described above.
(Notice that Theano's current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)
SciPy
-----
......@@ -361,18 +426,15 @@ don't forget to call the parent ``setUp`` function.
For more details see :ref:`random_value_in_tests`.
GPU Op
------
Op that execute on the GPU should inherit from the
``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows Theano
to make the distinction between both. Currently, we use this to test
if the NVIDIA driver works correctly with our sum reduction code on the
gpu.
:download:`Solution<extending_theano_solution_1.py>`
Final Note
==========
Documentation
-------------
A more extensive discussion of this section's content may be found in the advanced
tutorial :ref:`Extending Theano<extending>`
See :ref:`metadocumentation`, for some information on how to generate
the documentation.
......
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Extending Theano'
import unittest
import theano
# 1. Op returns x * y
class ProdOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, x, y):
x = theano.tensor.as_tensor_variable(x)
y = theano.tensor.as_tensor_variable(y)
outdim = x.ndim
output = (theano.tensor.TensorType
(dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False] * outdim)())
return theano.Apply(self, inputs=[x, y], outputs=[output])
def perform(self, node, inputs, output_storage):
x, y = inputs
z = output_storage[0]
z[0] = x * y
def infer_shape(self, node, i0_shapes):
return [i0_shapes[0]]
def grad(self, inputs, output_grads):
return [output_grads[0] * inputs[1], output_grads[0] * inputs[0]]
# 2. Op returns x + y and x - y
class SumDiffOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, x, y):
x = theano.tensor.as_tensor_variable(x)
y = theano.tensor.as_tensor_variable(y)
outdim = x.ndim
output1 = (theano.tensor.TensorType
(dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False] * outdim)())
output2 = (theano.tensor.TensorType
(dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False] * outdim)())
return theano.Apply(self, inputs=[x, y], outputs=[output1, output2])
def perform(self, node, inputs, output_storage):
x, y = inputs
z1, z2 = output_storage
z1[0] = x + y
z2[0] = x - y
def infer_shape(self, node, i0_shapes):
return [i0_shapes[0], i0_shapes[0]]
def grad(self, inputs, output_grads):
og1, og2 = output_grads
if og1 is None:
og1 = theano.tensor.zeros_like(og2)
if og2 is None:
og2 = theano.tensor.zeros_like(og1)
return [og1 + og2, og1 - og2]
# 3. Testing apparatus
import numpy
from theano.gof import Op, Apply
from theano import tensor, function, printing
from theano.tests import unittest_tools as utt
class TestProdOp(utt.InferShapeTester):
rng = numpy.random.RandomState(43)
def setUp(self):
super(TestProdOp, self).setUp()
self.op_class = ProdOp # case 1
def test_perform(self):
x = theano.tensor.matrix()
y = theano.tensor.matrix()
f = theano.function([x, y], self.op_class()(x, y))
x_val = numpy.random.rand(5, 4)
y_val = numpy.random.rand(5, 4)
out = f(x_val, y_val)
assert numpy.allclose(x_val * y_val, out)
def test_gradient(self):
utt.verify_grad(self.op_class(), [numpy.random.rand(5, 4),
numpy.random.rand(5, 4)],
n_tests=1, rng=TestProdOp.rng)
def test_infer_shape(self):
x = tensor.dmatrix()
y = tensor.dmatrix()
self._compile_and_check([x, y], [self.op_class()(x, y)],
[numpy.random.rand(5, 6),
numpy.random.rand(5, 6)],
self.op_class)
class TestSumDiffOp(utt.InferShapeTester):
rng = numpy.random.RandomState(43)
def setUp(self):
super(TestSumDiffOp, self).setUp()
self.op_class = SumDiffOp
def test_perform(self):
x = theano.tensor.matrix()
y = theano.tensor.matrix()
f = theano.function([x, y], self.op_class()(x, y))
x_val = numpy.random.rand(5, 4)
y_val = numpy.random.rand(5, 4)
out = f(x_val, y_val)
assert numpy.allclose([x_val + y_val, x_val - y_val], out)
def test_gradient(self):
def output_0(x, y):
return self.op_class()(x, y)[0]
def output_1(x, y):
return self.op_class()(x, y)[1]
utt.verify_grad(output_0, [numpy.random.rand(5, 4),
numpy.random.rand(5, 4)],
n_tests=1, rng=TestSumDiffOp.rng)
utt.verify_grad(output_1, [numpy.random.rand(5, 4),
numpy.random.rand(5, 4)],
n_tests=1, rng=TestSumDiffOp.rng)
def test_infer_shape(self):
x = tensor.dmatrix()
y = tensor.dmatrix()
# adapt the choice of the next instruction to the op under test
self._compile_and_check([x, y], self.op_class()(x, y),
[numpy.random.rand(5, 6),
numpy.random.rand(5, 6)],
self.op_class)
if __name__ == "__main__":
unittest.main()
......@@ -8,33 +8,46 @@ Frequently Asked Questions
TypeError: object of type 'TensorVariable' has no len()
-------------------------------------------------------
If you receive this error:
If you receive the following error, it is because the Python function *__len__* cannot
be implemented on Theano variables:
.. code-block:: python
TypeError: object of type 'TensorVariable' has no len()
We can't implement the __len__ function on Theano Variables. This is
because Python requires that this function returns an integer, but we
can't do this as we are working with symbolic variables. You can use
`var.shape[0]` as a workaround.
Python requires that *__len__* returns an integer, yet it cannot be done as Theano's variables are symbolic. However, `var.shape[0]` can be used as a workaround.
Also we can't change the above error message into a more explicit one
because of some other Python internal behavior that can't be modified.
This error message cannot be made more explicit because the relevant aspects of Python's
internals cannot be modified.
Faster gcc optimization
-----------------------
You can enable faster gcc optimization with the cxxflags. This list of flags was suggested on the mailing list::
You can enable faster gcc optimization with the ``cxxflags``. This list of flags was suggested on the mailing list::
cxxflags=-march=native -O3 -ffast-math -ftree-loop-distribution -funroll-loops -ftracer
Use it at your own risk. Some people warned that the -ftree-loop-distribution optimization caused them wrong results in the past.
Also the -march=native must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
Use it at your own risk. Some people warned that the ``-ftree-loop-distribution`` optimization resulted in wrong results in the past.
Also the ``-march=native`` flag must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
Related Projects
----------------
We try to list in this `wiki page <https://github.com/Theano/Theano/wiki/Related-projects>`_ other Theano related projects.
"What are Theano's Limitations?"
--------------------------------
Theano offers a good amount of flexibility, but has some limitations too.
You must answer for yourself the following question: How can my algorithm be cleverly written
so as to make the most of what Theano can do?
Here is a list of some of the known limitations:
- *While*- or *for*-Loops within an expression graph are supported, but only via
the :func:`theano.scan` op (which puts restrictions on how the loop body can
interact with the rest of the graph).
- Neither *goto* nor *recursion* is supported or planned within expression graphs.
......@@ -7,54 +7,130 @@ PyCUDA/CUDAMat/Gnumpy compatibility
PyCUDA
======
Currently PyCUDA and Theano have different object to store GPU
Currently, PyCUDA and Theano have different objects to store GPU
data. The two implementations do not support the same set of features.
Theano's implementation is called CudaNdarray and supports
strides. It support only the float32 dtype. PyCUDA's implementation
is called GPUArray and doesn't support strides. Instead it can deal with all numpy and Cuda dtypes.
Theano's implementation is called *CudaNdarray* and supports
*strides*. It also only supports the *float32* dtype. PyCUDA's implementation
is called *GPUArray* and doesn't support *strides*. However, it can deal with
all NumPy and CUDA dtypes.
We are currently working on having the same base object that will
mimic numpy. Until this is ready, here is some information on how to
use both Project in the same script.
We are currently working on having the same base object for both that will
also mimic Numpy. Until this is ready, here is some information on how to
use both objects in the same script.
Transfer
--------
You can use the `theano.misc.pycuda_utils` module to convert GPUArray to and
from CudaNdarray. The function `to_cudandarray(x, copyif=False)` and
`to_gpuarray(x)` return a new object that share the same memory space
as the original. Otherwise it raise an ValueError. Because GPUArray don't
You can use the ``theano.misc.pycuda_utils`` module to convert GPUArray to and
from CudaNdarray. The functions ``to_cudandarray(x, copyif=False)`` and
``to_gpuarray(x)`` return a new object that occupies the same memory space
as the original. Otherwise it raises a *ValueError*. Because GPUArrays don't
support strides, if the CudaNdarray is strided, we could copy it to
have a non-strided copy. The resulting GPUArray won't share the same
memory region. If you want this behavior, set `copyif=True` in
`to_gpuarray`.
memory region. If you want this behavior, set ``copyif=True`` in
``to_gpuarray``.
Compiling with PyCUDA
---------------------
You can use PyCUDA to compile some CUDA function that work directly on
CudaNdarray. There is an example in the function `test_pycuda_theano`
in the file `theano/misc/tests/test_pycuda_theano_simple.py`. Also,
there is an example that shows how to make an op that calls a pycuda
function :ref:`here <pyCUDA_theano>`.
Theano op using PyCUDA function
-------------------------------
You can use gpu function compiled with PyCUDA in a Theano op. Look
into the `HPCS2011 tutorial
<http://www.iro.umontreal.ca/~lisa/pointeurs/tutorial_hpcs2011_fixed.pdf>`_ for an example.
You can use PyCUDA to compile CUDA functions that work directly on
CudaNdarrays. Here is an example from the file ``theano/misc/tests/test_pycuda_theano_simple.py``:
.. code-block:: python
import sys
import numpy
import theano
import theano.sandbox.cuda as cuda_ndarray
import theano.misc.pycuda_init
import pycuda
import pycuda.driver as drv
import pycuda.gpuarray
def test_pycuda_theano():
"""Simple example with pycuda function and Theano CudaNdarray object."""
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(100).astype(numpy.float32)
b = numpy.random.randn(100).astype(numpy.float32)
# Test with Theano object
ga = cuda_ndarray.CudaNdarray(a)
gb = cuda_ndarray.CudaNdarray(b)
dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
multiply_them(dest, ga, gb,
block=(400, 1, 1), grid=(1, 1))
assert (numpy.asarray(dest) == a * b).all()
Theano Op using a PyCUDA function
---------------------------------
You can use a GPU function compiled with PyCUDA in a Theano op:
.. code-block:: python
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i] * 2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
CUDAMat
=======
There is conversion function between CUDAMat object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.cudamat_utils.py
There are functions for conversion between CUDAMat objects and Theano's CudaNdArray objects.
They obey the same principles as Theano's PyCUDA functions and can be found in
``theano.misc.cudamat_utils.py``.
.. TODO: this statement is unclear:
WARNING: there is a strange problem with stride/shape with those converter. The test to work need a transpose and reshape...
WARNING: There is a peculiar problem associated with stride/shape with those converters.
In order to work, the test needs a *transpose* and *reshape*...
Gnumpy
======
There is conversion function between gnumpy garray object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.gnumpy_utils.py
There are conversion functions between Gnumpy *garray* objects and Theano CudaNdArray objects.
They are also similar to Theano's PyCUDA functions and can be found in ``theano.misc.gnumpy_utils.py``.
......@@ -6,24 +6,26 @@
Derivatives in Theano
=====================
Computing gradients
Computing Gradients
===================
Now let's use Theano for a slightly more sophisticated task: create a
function which computes the derivative of some expression ``y`` with
respect to its parameter ``x``. To do this we will use the macro ``T.grad``.
function which computes the derivative of some expression *y* with
respect to its parameter *x*. To do this we will use the macro ``T.grad``.
For instance, we can compute the
gradient of :math:`x^2` with respect to :math:`x`. Note that:
:math:`d(x^2)/dx = 2 \cdot x`.
Here is code to compute this gradient:
.. TODO: fix the vertical positioning of the expressions in the preceding paragraph
Here is the code to compute this gradient:
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_4
>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x**2
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy) # print out the gradient prior to optimization
'((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
......@@ -33,10 +35,10 @@ array(8.0)
>>> f(94.2)
array(188.40000000000001)
In the example above, we can see from ``pp(gy)`` that we are computing
In this example, we can see from ``pp(gy)`` that we are computing
the correct symbolic gradient.
``fill((x ** 2), 1.0)`` means to make a matrix of the same shape as
``x ** 2`` and fill it with 1.0.
*x* ** *2* and fill it with *1.0*.
.. note::
The optimizer simplifies the symbolic gradient expression. You can see
......@@ -56,7 +58,7 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.
.. figure:: dlogistic.png
A plot of the gradient of the logistic function, with x on the x-axis
A plot of the gradient of the logistic function, with *x* on the x-axis
and :math:`ds(x)/dx` on the y-axis.
......@@ -71,133 +73,137 @@ logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.
array([[ 0.25 , 0.19661193],
[ 0.19661193, 0.10499359]])
In general, for any **scalar** expression ``s``, ``T.grad(s, w)`` provides
the theano expression for computing :math:`\frac{\partial s}{\partial w}`. In
In general, for any **scalar** expression *s*, ``T.grad(s, w)`` provides
the Theano expression for computing :math:`\frac{\partial s}{\partial w}`. In
this way Theano can be used for doing **efficient** symbolic differentiation
(as
the expression return by ``T.grad`` will be optimized during compilation) even for
function with many inputs. ( see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
(as the expression returned by ``T.grad`` will be optimized during compilation), even for
function with many inputs. (see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
of symbolic differentiation).
.. note::
The second argument of ``T.grad`` can be a list, in which case the
output is also a list. The order in both list is important, element
output is also a list. The order in both lists is important: element
*i* of the output list is the gradient of the first argument of
``T.grad`` with respect to the *i*-th element of the list given as second argument.
The first argument of ``T.grad`` has to be a scalar (a tensor
of size 1). For more information on the semantics of the arguments of
``T.grad`` and details about the implementation, see :ref:`this <libdoc_gradient>`.
``T.grad`` and details about the implementation, see
:ref:`this<libdoc_gradient>` section of the library.
Additional information on the inner workings of differentiation may also be
found in the more advanced tutorial :ref:`Extending Theano<extending>`.
Computing the Jacobian
======================
Theano implements :func:`theano.gradient.jacobian` macro that does all
what is needed to compute the Jacobian. The following text explains how
In Theano's parlance, the term *Jacobian* designates the tensor comprising the
first partial derivatives of the output of a function with respect to its inputs.
(This is a generalization of to the so-called Jacobian matrix in Mathematics.)
Theano implements the :func:`theano.gradient.jacobian` macro that does all
that is needed to compute the Jacobian. The following text explains how
to do it manually.
In order to manually compute the Jacobian of some function ``y`` with
respect to some parameter ``x`` we need to use ``scan``. What we
do is to loop over the entries in ``y`` and compute the gradient of
``y[i]`` with respect to ``x``.
In order to manually compute the Jacobian of some function *y* with
respect to some parameter *x* we need to use ``scan``. What we
do is to loop over the entries in *y* and compute the gradient of
*y[i]* with respect to *x*.
.. note::
``scan`` is a generic op in Theano that allows writting in a symbolic
manner all kind of recurrent equations. While in principle, creating
``scan`` is a generic op in Theano that allows writing in a symbolic
manner all kinds of recurrent equations. While creating
symbolic loops (and optimizing them for performance) is a hard task,
effort is being done for improving the performance of ``scan``. More
information about how to use this op, see :ref:`this <lib_scan>`.
effort is being done for improving the performance of ``scan``. We
shall return to :ref:`scan<tutloop>` later in this tutorial.
>>> x = T.dvector('x')
>>> y = x**2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences = T.arange(y.shape[0]), non_sequences = [y,x])
>>> f = function([x], J, updates = updates)
>>> f([4,4])
>>> y = x ** 2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
>>> f = function([x], J, updates=updates)
>>> f([4, 4])
array([[ 8., 0.],
[ 0., 8.]])
What we did in this code, is to generate a sequence of ints from ``0`` to
What we do in this code is to generate a sequence of *ints* from *0* to
``y.shape[0]`` using ``T.arange``. Then we loop through this sequence, and
at each step, we compute the gradient of element ``y[[i]`` with respect to
``x``. ``scan`` automatically concatenates all these rows, generating a
matrix, which corresponds to the Jacobian.
at each step, we compute the gradient of element *y[i]* with respect to
*x*. ``scan`` automatically concatenates all these rows, generating a
matrix which corresponds to the Jacobian.
.. note::
There are a few gotchas regarding ``T.grad``. One of them is that you
can not re-write the above expression of the jacobian as
There are some pitfalls to be aware of regarding ``T.grad``. One of them is that you
cannot re-write the above expression of the Jacobian as
``theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
non_sequences=x)``, even though from the documentation of scan this
seems possible. The reason is that ``y_i`` will not be a function of
``x`` anymore, while ``y[i]`` still is.
seems possible. The reason is that *y_i* will not be a function of
*x* anymore, while *y[i]* still is.
Computing the Hessian
=====================
Theano implements :func:`theano.gradient.hessian` macro that does all
In Theano, the term *Hessian* has the usual mathematical acception: It is the
matrix comprising the second order partial derivative of a function with scalar
output and vector input. Theano implements :func:`theano.gradient.hessian` macro that does all
that is needed to compute the Hessian. The following text explains how
to do it manually.
You can compute the Hessian manually as the Jacobian. The only
You can compute the Hessian manually similarly to the Jacobian. The only
difference is that now, instead of computing the Jacobian of some expression
``y``, we compute the Jacobian of ``T.grad(cost,x)``, where ``cost`` is some
*y*, we compute the Jacobian of ``T.grad(cost,x)``, where *cost* is some
scalar.
>>> x = T.dvector('x')
>>> y = x**2
>>> y = x ** 2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences = T.arange(gy.shape[0]), non_sequences = [gy,x])
>>> f = function([x], H, updates = updates)
>>> f([4,4])
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2., 0.],
[ 0., 2.]])
Jacobian times a vector
Jacobian times a Vector
=======================
Sometimes we can express the algorithm in terms of Jacobians times vectors,
or vectors times Jacobians. Compared to evaluating the Jacobian and then
doing the product, there are methods that computes the wanted result, while
avoiding actually evaluating the Jacobian. This can bring about significant
doing the product, there are methods that compute the desired results while
avoiding actual evaluation of the Jacobian. This can bring about significant
performance gains. A description of one such algorithm can be found here:
* Barak A. Pearlmutter, "Fast Exact Multiplication by the Hessian", *Neural
Computation, 1994*
While in principle we would want Theano to identify such patterns for us,
in practice, implementing such optimizations in a generic manner can be
close to impossible. As such, we offer special functions that
can be used to compute such expression.
While in principle we would want Theano to identify these patterns automatically for us,
in practice, implementing such optimizations in a generic manner is extremely
difficult. Therefore, we provide special functions dedicated to these tasks.
R-operator
----------
The *R operator* is suppose to evaluate the product between a Jacobian and a
The *R operator* is built to evaluate the product between a Jacobian and a
vector, namely :math:`\frac{\partial f(x)}{\partial x} v`. The formulation
can be extended even for `x` being a matrix, or a tensor in general, case in
can be extended even for *x* being a matrix, or a tensor in general, case in
which also the Jacobian becomes a tensor and the product becomes some kind
of tensor product. Because in practice we end up needing to compute such
expression in terms of weight matrices, theano supports this more generic
meaning of the operation. In order to evaluate the *R-operation* of
expression ``y``, with respect to ``x``, multiplying the Jacobian with ``v``
expressions in terms of weight matrices, Theano supports this more generic
form of the operation. In order to evaluate the *R-operation* of
expression *y*, with respect to *x*, multiplying the Jacobian with *v*
you need to do something similar to this:
>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x,W)
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W,V,x], JV)
>>> f([[1,1],[1,1]], [[2,2,],[2,2]], [0,1])
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2., 2.])
:ref:`List <R_op_list>` of Op that implement Rop.
......@@ -205,51 +211,50 @@ array([ 2., 2.])
L-operator
----------
Similar to *R-operator* the *L-operator* would compute a *row* vector times
the Jacobian. The mathematical forumla would be :math:`v \frac{\partial
f(x)}{\partial x}`. As for the *R-operator*, the *L-operator* is supported
for generic tensors (not only for vectors). Similarly, it can be used as
follows:
In similitude to the *R-operator*, the *L-operator* would compute a *row* vector times
the Jacobian. The mathematical formula would be :math:`v \frac{\partial
f(x)}{\partial x}`. The *L-operator* is also supported for generic tensors
(not only for vectors). Similarly, it can be implemented as follows:
>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x,W)
>>> y = T.dot(x, W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([W,v,x], JV)
>>> f([[1,1],[1,1]], [2,2,], [0,1])
>>> f([[1, 1], [1, 1]], [2, 2], [0, 1])
array([[ 0., 0.],
[ 2., 2.]])
.. note::
`v`, the evaluation point, differs between the *L-operator* and the *R-operator*.
For the *L-operator*, the evaluation point needs to have the same shape
as the output, while for the *R-operator* the evaluation point should
have the same shape as the input parameter. Also the result of these two
opeartion differs. The result of the *L-operator* is of the same shape
as the input parameter, while the result of the *R-operator* is the same
as the output.
`v`, the *point of evaluation*, differs between the *L-operator* and the *R-operator*.
For the *L-operator*, the point of evaluation needs to have the same shape
as the output, whereas for the *R-operator* this point should
have the same shape as the input parameter. Furthermore, the results of these two
operations differ. The result of the *L-operator* is of the same shape
as the input parameter, while the result of the *R-operator* has a shape similar
to that of the output.
Hessian times a vector
Hessian times a Vector
======================
If you need to compute the Hessian times a vector, you can make use of the
above defined operators to do it more efficiently than actually computing
the exact Hessian and then doing the product. Due to the symmetry of the
If you need to compute the *Hessian times a vector*, you can make use of the
above-defined operators to do it more efficiently than actually computing
the exact Hessian and then performing the product. Due to the symmetry of the
Hessian matrix, you have two options that will
give you the same result, though these options might exhibit different performance, so we
suggest to profile the methods before using either of the two:
give you the same result, though these options might exhibit differing performances.
Hence, we suggest profiling the methods before using either one of the two:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x**2)
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> vH = T.grad( T.sum(gy*v), x)
>>> f = theano.function([x,v], vH)
>>> f([4,4],[2,2])
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
......@@ -257,10 +262,26 @@ or, making use of the *R-operator*:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x**2)
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy,x,v)
>>> f = theano.function([x,v], Hv)
>>> f([4,4],[2,2])
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
Final Pointers
==============
* The ``grad`` function works symbolically: it receives and returns Theano variables.
* ``grad`` can be compared to a macro since it can be applied repeatedly.
* Scalar costs only can be directly handled by ``grad``. Arrays are handled through repeated applications.
* Built-in functions allow to compute efficiently *vector times Jacobian* and *vector times Hessian*.
* Work is in progress on the optimizations required to compute efficiently the full
Jacobian and the Hessian matrix as well as the *Jacobian times vector*.
......@@ -5,20 +5,21 @@
Tutorial
========
Let us start an interactive session (e.g. ``python`` or ``ipython``) and import Theano.
Let us start an interactive session (e.g. with ``python`` or ``ipython``) and import Theano.
>>> from theano import *
Many of symbols you will need to use are in the ``tensor`` subpackage
of Theano. Let's import that subpackage under a handy name like
``T`` (many tutorials use this convention).
Several of the symbols you will need to use are in the ``tensor`` subpackage
of Theano. Let us import that subpackage under a handy name like
``T`` (the tutorials will frequently use this convention).
>>> import theano.tensor as T
If that worked you are ready for the tutorial, otherwise check your
If that succeeded you are ready for the tutorial, otherwise check your
installation (see :ref:`install`).
Throughout the tutorial, bear in mind that there is a :ref:`glossary` to help
Throughout the tutorial, bear in mind that there is a :ref:`glossary` as well
as *index* and *modules* links in the upper-right corner of each page to help
you out.
.. toctree::
......@@ -27,18 +28,18 @@ you out.
numpy
adding
examples
gradients
loading_and_saving
symbolic_graphs
printing_drawing
gradients
modes
aliasing
loading_and_saving
conditions
loop
sparse
using_gpu
gpu_data_convert
aliasing
shape_info
remarks
extending_theano
debug_faq
extending_theano
faq
......@@ -6,8 +6,8 @@ Loading and Saving
==================
Python's standard way of saving class instances and reloading them
is the pickle_ mechanism. Many Theano objects can be serialized (and
deserialized) by ``pickle``, however, a limitation of ``pickle`` is that
is the pickle_ mechanism. Many Theano objects can be *serialized* (and
*deserialized*) by ``pickle``, however, a limitation of ``pickle`` is that
it does not save the code or data of a class along with the instance of
the class being serialized. As a result, reloading objects created by a
previous version of a class can be really problematic.
......@@ -24,7 +24,7 @@ as you would in the course of any other Python program.
.. _pickle: http://docs.python.org/library/pickle.html
The basics of pickling
The Basics of Pickling
======================
The two modules ``pickle`` and ``cPickle`` have the same functionalities, but
......@@ -45,7 +45,7 @@ You can serialize (or *save*, or *pickle*) objects to a file with
.. note::
If you want your saved object to be stored efficiently, don't forget
to use ``cPickle.HIGHEST_PROTOCOL``, the resulting file can be
to use ``cPickle.HIGHEST_PROTOCOL``. The resulting file can be
dozens of times smaller than with the default protocol.
.. note::
......@@ -81,7 +81,7 @@ For more details about pickle's usage, see
`Python documentation <http://docs.python.org/library/pickle.html#usage>`_.
Short-term serialization
Short-Term Serialization
========================
If you are confident that the class instance you are serializing will be
......@@ -114,7 +114,7 @@ For instance, you can define functions along the lines of:
self.training_set = cPickle.load(file(self.training_set_file, 'rb'))
Long-term serialization
Long-Term Serialization
=======================
If the implementation of the class you want to save is quite unstable, for
......@@ -126,7 +126,7 @@ maybe defining the attributes you want to save, rather than the ones you
don't.
For instance, if the only parameters you want to save are a weight
matrix ``W`` and a bias ``b``, you can define:
matrix *W* and a bias *b*, you can define:
.. code-block:: python
......@@ -138,8 +138,8 @@ matrix ``W`` and a bias ``b``, you can define:
self.W = W
self.b = b
If, at some point in time, ``W`` is renamed to ``weights`` and ``b`` to
``bias``, the older pickled files will still be usable, if you update these
If at some point in time *W* is renamed to *weights* and *b* to
*bias*, the older pickled files will still be usable, if you update these
functions to reflect the change in name:
.. code-block:: python
......@@ -152,6 +152,6 @@ functions to reflect the change in name:
self.weights = W
self.bias = b
For more information on advanced use of pickle and its internals, see Python's
For more information on advanced use of ``pickle`` and its internals, see Python's
pickle_ documentation.
......@@ -4,4 +4,94 @@
Loop
====
You can use :ref:`Scan <lib_scan>` to do all type of loop in Theano. All the documentation about it is in the library for now.
Scan
====
- A general form of *recurrence*, which can be used for looping.
- *Reduction* and *map* (loop over the leading dimensions) are special cases of ``scan``.
- You ``scan`` a function along some input sequence, producing an output at each time-step.
- The function can see the *previous K time-steps* of your function.
- ``sum()`` could be computed by scanning the *z + x(i)* function over a list, given an initial state of *z=0*.
- Often a *for* loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- Advantages of using ``scan`` over *for* loops:
- Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers (if GPU is involved).
- Computes gradients through sequential steps.
- Slightly faster than using a *for* loop in Python with a compiled Theano function.
- Can lower the overall memory usage by detecting the actual amount of memory needed.
The full documentation can be found in the library: :ref:`Scan <lib_scan>`.
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
k = T.iscalar("k")
A = T.vector("A")
def inner_fct(prior_result, A):
return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A ** 1 through A ** k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A, k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import numpy
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x")
max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
-------------------------------------------
**Exercise**
Run both examples.
Modify and execute the polynomial example to have the reduction done by ``scan``.
:download:`Solution<loop_solution_1.py>`
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Loop'
import numpy
import theano
import theano.tensor as tt
# 1. First example
theano.config.warn.subtensor_merge_bug = False
k = tt.iscalar("k")
A = tt.vector("A")
def inner_fct(prior_result, A):
return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=tt.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A ** 1 through A ** k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A, k], outputs=final_result,
updates=updates)
print power(range(10), 2)
# [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
# 2. Second example
coefficients = tt.vector("coefficients")
x = tt.scalar("x")
max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range = tt.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
sequences=[coefficients, full_range],
outputs_info=None,
non_sequences=x)
polynomial = components.sum()
calculate_polynomial1 = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial1(test_coeff, 3)
# 19.0
# 3. Reduction performed inside scan
theano.config.warn.subtensor_merge_bug = False
coefficients = tt.vector("coefficients")
x = tt.scalar("x")
max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range = tt.arange(max_coefficients_supported)
outputs_info = tt.as_tensor_variable(numpy.asarray(0, 'float64'))
components, updates = theano.scan(fn=lambda coeff, power, prior_value, free_var:
prior_value + (coeff * (free_var ** power)),
sequences=[coefficients, full_range],
outputs_info=outputs_info,
non_sequences=x)
polynomial = components[-1]
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial, updates=updates)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
.. _using_modes:
===============================
Using different compiling modes
===============================
==========================================
Configuration Settings and Compiling Modes
==========================================
Configuration
=============
The ``config`` module contains several *attributes* that modify Theano's behavior. Many of these
attributes are examined during the import of the ``theano`` module and several are assumed to be
read-only.
*As a rule, the attributes in the* ``config`` *module should not be modified inside the user code.*
Theano's code comes with default values for these attributes, but you can
override them from your ``.theanorc`` file, and override those values in turn by
the :envvar:`THEANO_FLAGS` environment variable.
The order of precedence is:
1. an assignment to theano.config.<property>
2. an assignment in :envvar:`THEANO_FLAGS`
3. an assignment in the .theanorc file (or the file indicated in :envvar:`THEANORC`)
You can display the current/effective configuration at any time by printing
theano.config. For example, to see a list of all active configuration
variables, type this from the command-line:
.. code-block:: bash
python -c 'import theano; print theano.config' | less
For more detail, see :ref:`Configuration <libdoc_config>` in the library.
-------------------------------------------
**Exercise**
Consider the logistic regression:
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
Modify and execute this example to run on CPU (the default) with floatX=float32 and
time the execution using the command line ``time python file.py``. Save your code
as it will be useful later on.
.. Note::
* Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
* Cast inputs before storing them into a shared variable.
* Circumvent the automatic cast of *int32* with *float32* to *float64*:
* Insert manual cast in your code or use *[u]int{8,16}*.
* Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
* Notice that a new casting mechanism is being developed.
:download:`Solution<modes_solution_1.py>`
-------------------------------------------
Mode
====
Everytime :func:`theano.function <function.function>` is called
Everytime :func:`theano.function <function.function>` is called,
the symbolic relationships between the input and output Theano *variables*
are optimized and compiled. The way this compilation occurs
is controlled by the value of the ``mode`` parameter.
......@@ -17,10 +134,10 @@ Theano defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations, and use C implementations where possible.
- ``'DEBUG_MODE'``: Verify the correctness of all optimizations, and compare C and python
implementations. This mode can take much longer than the other modes,
but can identify many kinds of problems.
- ``'PROFILE_MODE'``: Same optimization then FAST_RUN, put print some profiling information
- ``'DebugMode``: Verify the correctness of all optimizations, and compare C and Python
implementations. This mode can take much longer than the other modes, but can identify
several kinds of problems.
- ``'ProfileMode'``: Same optimization then FAST_RUN, put print some profiling information
The default mode is typically ``FAST_RUN``, but it can be controlled via
the configuration variable :attr:`config.mode`,
......@@ -30,18 +147,18 @@ which can be overridden by passing the keyword argument to
================= =============================================================== ===============================================================================
short name Full constructor What does it do?
================= =============================================================== ===============================================================================
FAST_COMPILE ``compile.mode.Mode(linker='py', optimizer='fast_compile')`` Python implementations only, quick and cheap graph transformations
FAST_RUN ``compile.mode.Mode(linker='c|py', optimizer='fast_run')`` C implementations where available, all available graph transformations.
DEBUG_MODE ``compile.debugmode.DebugMode()`` Both implementations where available, all available graph transformations.
PROFILE_MODE ``compile.profilemode.ProfileMode()`` C implementations where available, all available graph transformations, print profile information.
``FAST_COMPILE`` ``compile.mode.Mode(linker='py', optimizer='fast_compile')`` Python implementations only, quick and cheap graph transformations
``FAST_RUN`` ``compile.mode.Mode(linker='cvm', optimizer='fast_run')`` C implementations where available, all available graph transformations.
``DebugMode`` ``compile.debugmode.DebugMode()`` Both implementations where available, all available graph transformations.
``ProfileMode`` ``compile.profilemode.ProfileMode()`` C implementations where available, all available graph transformations, print profile information.
================= =============================================================== ===============================================================================
Linkers
=======
A mode is composed of 2 things: an optimizer and a linker. Some modes,
like PROFILE_MODE and DEBUG_MODE, add logic around the optimizer and
linker. PROFILE_MODE and DEBUG_MODE use their own linker.
like ``ProfileMode`` and ``DebugMode``, add logic around the optimizer and
linker. ``ProfileMode`` and ``DebugMode`` use their own linker.
You can select witch linker to use with the Theano flag :attr:`config.linker`.
Here is a table to compare the different linkers.
......@@ -49,11 +166,13 @@ Here is a table to compare the different linkers.
============= ========= ================= ========= ===
linker gc [#gc]_ Raise error by op Overhead Definition
============= ========= ================= ========= ===
c|py [#cpy1]_ yes yes "+++" Try c code. If none exist for an op, use python
cvm yes yes "++" As c|py, but the runtime algo to execute the code is in c
cvm_nogc no yes "+" As cvm, but without gc
c|py [#cpy1]_ yes yes "+++" Try C code. If none exists for an op, use Python
c|py_nogc no yes "++" As c|py, but without gc
c no yes "+" Use only c code (if none available for an op, raise an error)
py yes yes "+++" Use only python code
c&py [#cpy2]_ no yes "+++++" Use c and python code
c no yes "+" Use only C code (if none available for an op, raise an error)
py yes yes "+++" Use only Python code
c&py [#cpy2]_ no yes "+++++" Use C and Python code
ProfileMode no no "++++" Compute some extra profiling info
DebugMode no yes VERY HIGH Make many checks on what Theano computes
============= ========= ================= ========= ===
......@@ -62,11 +181,14 @@ DebugMode no yes VERY HIGH Make many checks on what
.. [#gc] Garbage collection of intermediate results during computation.
Otherwise, their memory space used by the ops is kept between
Theano function calls, in order not to
reallocate memory, and lower the overhead (make it faster...)
.. [#cpy1] default
reallocate memory, and lower the overhead (make it faster...).
.. [#cpy1] Default
.. [#cpy2] Deprecated
For more detail, see :ref:`Mode<libdoc_compile_mode>` in the library.
.. _using_debugmode:
Using DebugMode
......@@ -75,11 +197,11 @@ Using DebugMode
While normally you should use the ``FAST_RUN`` or ``FAST_COMPILE`` mode,
it is useful at first (especially when you are defining new kinds of
expressions or new optimizations) to run your code using the DebugMode
(available via ``mode='DEBUG_MODE'``). The DebugMode is designed to
do several self-checks and assertations that can help to diagnose
possible programming errors that can lead to incorect output. Note that
``DEBUG_MODE`` is much slower then ``FAST_RUN`` or ``FAST_COMPILE`` so
use it only during development (not when you launch 1000 process on a
(available via ``mode='DebugMode``). The DebugMode is designed to
run several self-checks and assertions that can help diagnose
possible programming errors leading to incorrect output. Note that
``DebugMode`` is much slower than ``FAST_RUN`` or ``FAST_COMPILE`` so
use it only during development (not when you launch 1000 processes on a
cluster!).
......@@ -92,7 +214,7 @@ DebugMode is used as follows:
x = T.dvector('x')
f = theano.function([x], 10*x, mode='DEBUG_MODE')
f = theano.function([x], 10 * x, mode='DebugMode')
f([5])
f([0])
......@@ -100,46 +222,51 @@ DebugMode is used as follows:
If any problem is detected, DebugMode will raise an exception according to
what went wrong, either at call time (``f(5)``) or compile time (
``f = theano.function(x, 10*x, mode='DEBUG_MODE')``). These exceptions
what went wrong, either at call time (*f(5)*) or compile time (
``f = theano.function(x, 10 * x, mode='DebugMode')``). These exceptions
should *not* be ignored; talk to your local Theano guru or email the
users list if you cannot make the exception go away.
Some kinds of errors can only be detected for certain input value combinations.
In the example above, there is no way to guarantee that a future call to say,
``f(-1)`` won't cause a problem. DebugMode is not a silver bullet.
In the example above, there is no way to guarantee that a future call to, say
*f(-1)*, won't cause a problem. DebugMode is not a silver bullet.
.. TODO: repair the following link
If you instantiate DebugMode using the constructor (see :class:`DebugMode`)
rather than the keyword ``DEBUG_MODE`` you can configure its behaviour via
constructor arguments. See :ref:`DebugMode <debugMode>` for details.
The keyword version of DebugMode (which you get by using ``mode='DEBUG_MODE``)
rather than the keyword ``DebugMode`` you can configure its behaviour via
constructor arguments. The keyword version of DebugMode (which you get by using ``mode='DebugMode'``)
is quite strict.
For more detail, see :ref:`DebugMode<debugmode>` in the library.
.. _using_profilemode:
ProfileMode
===========
Beside checking for errors, another important task is to profile your
Besides checking for errors, another important task is to profile your
code. For this Theano uses a special mode called ProfileMode which has
to be passed as an argument to :func:`theano.function <function.function>`.
Using the ProfileMode is a three-step process.
.. note::
To change the default to it, put the Theano flags
:attr:`config.mode` to ProfileMode. In that case, when the python
process exit, it will automatically print the profiling
information on the stdout.
To switch the default accordingly, set the Theano flag
:attr:`config.mode` to ProfileMode. In that case, when the Python
process exits, it will automatically print the profiling
information on the standard output.
The memory profile of the output of each apply node can be enabled with the
The memory profile of the output of each ``apply`` node can be enabled with the
Theano flag :attr:`config.ProfileMode.profile_memory`.
For more detail, see :ref:`ProfileMode <profilemode>` in the library.
Creating a ProfileMode Instance
-------------------------------
First create a ProfileMode instance.
First create a ProfileMode instance:
>>> from theano import ProfileMode
>>> profmode = theano.ProfileMode(optimizer='fast_run', linker=theano.gof.OpWiseCLinker())
......@@ -151,10 +278,10 @@ implementation only, should use the gof.PerformLinker (or "py" for
short). On the other hand, a user wanting to profile his graph using C
implementations wherever possible should use the ``gof.OpWiseCLinker``
(or "c|py"). For testing the speed of your code we would recommend
using the 'fast_run' optimizer and ``gof.OpWiseCLinker`` linker.
using the ``fast_run`` optimizer and the ``gof.OpWiseCLinker`` linker.
Compiling your Graph with ProfileMode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-------------------------------------
Once the ProfileMode instance is created, simply compile your graph as you
would normally, by specifying the mode parameter.
......@@ -166,19 +293,15 @@ would normally, by specifying the mode parameter.
>>> minst = m.make(mode=profmode)
Retrieving Timing Information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----------------------------
Once your graph is compiled, simply run the program or operation you wish to
profile, then call ``profmode.print_summary()``. This will provide you with
the desired timing information, indicating where your graph is spending most
of its time.
This is best shown through an example.
Lets use the example of logistic
regression. (Code for this example is in the file
``benchmark/regression/regression.py``.)
of its time. This is best shown through an example. Let's use our logistic
regression example.
Compiling the module with ProfileMode and calling ``profmode.print_summary()``
Compiling the module with ``ProfileMode`` and calling ``profmode.print_summary()``
generates the following output:
.. code-block:: python
......@@ -228,16 +351,18 @@ generates the following output:
"""
The summary has two components to it. In the first section called the
Apply-wise summary, timing information is provided for the worst
offending Apply nodes. This corresponds to individual Op applications
within your graph which take the longest to execute (so if you use
This output has two components. In the first section called
*Apply-wise summary*, timing information is provided for the worst
offending ``Apply`` nodes. This corresponds to individual op applications
within your graph which took longest to execute (so if you use
``dot`` twice, you will see two entries there). In the second portion,
the Op-wise summary, the execution time of all Apply nodes executing
the same Op are grouped together and the total execution time per Op
the *Op-wise summary*, the execution time of all ``Apply`` nodes executing
the same op are grouped together and the total execution time per op
is shown (so if you use ``dot`` twice, you will see only one entry
there corresponding to the sum of the time spent in each of them).
Note that the ProfileMode also shows which Ops were running a c
Finally, notice that the ``ProfileMode`` also shows which ops were running a C
implementation.
For more detail, see :ref:`ProfileMode<libdoc_compile_mode>` in the library.
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Configuration Settings and Compiling Modes'
import numpy
import theano
import theano.tensor as tt
theano.config.floatX = 'float32'
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = tt.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = tt.grad(cost, [w, b])
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
......@@ -24,7 +24,7 @@ where each example has dimension 5. If this would be the input of a
neural network then the weights from the input to the first hidden
layer would represent a matrix of size (5, #hid).
If I have an array:
Consider this array:
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])
array([[ 1., 2.],
......@@ -37,7 +37,7 @@ This is a 3x2 matrix, i.e. there are 3 rows and 2 columns.
To access the entry in the 3rd row (row #2) and the 1st column (column #0):
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2,0]
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2, 0]
5.0
......@@ -61,5 +61,5 @@ array([2., 4., 6.])
The smaller array ``b`` (actually a scalar here, which works like a 0-d array) in this case is *broadcasted* to the same size
as ``a`` during the multiplication. This trick is often useful in
simplifying how expression are written. More details about *broadcasting*
can be found at `numpy user guide <http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html>`__.
simplifying how expression are written. More detail about *broadcasting*
can be found in the `numpy user guide <http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html>`__.
.. _tutorial_printing_drawing:
==============================
Printing/Drawing Theano graphs
==============================
.. TODO: repair the defective links in the next paragraph
Theano provides two functions (:func:`theano.pp` and
:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
Theano also provides :func:`pydotprint` that creates a *png* image of the function.
You can read about them in :ref:`libdoc_printing`.
Consider again the logistic regression but notice the additional printing instuctions.
The following output depicts the pre- and post- compilation graphs.
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * T.log(p_1) - (1 - y) * T.log(1 - p_1) # Cross-entropy
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w, b])
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any( [x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any( [x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
# Print the picture graphs
# after compilation
theano.printing.pydotprint(predict,
outfile="pics/logreg_pydotprint_predic.png",
var_with_name_simple=True)
# before compilation
theano.printing.pydotprint_variables(prediction,
outfile="pics/logreg_pydotprint_prediction.png",
var_with_name_simple=True)
theano.printing.pydotprint(train,
outfile="pics/logreg_pydotprint_train.png",
var_with_name_simple=True)
Pretty Printing
===============
``theano.printing.pprint(variable)``
>>> theano.printing.pprint(prediction) # (pre-compilation)
gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))),TensorConstant{0.5})
Debug Printing
==============
``theano.printing.debugprint({fct, variable, list of variables})``
>>> theano.printing.debugprint(prediction) # (pre-compilation)
Elemwise{gt,no_inplace} [@181772236] ''
|Elemwise{true_div,no_inplace} [@181746668] ''
| |InplaceDimShuffle{x} [@181746412] ''
| | |TensorConstant{1} [@181745836]
| |Elemwise{add,no_inplace} [@181745644] ''
| | |InplaceDimShuffle{x} [@181745420] ''
| | | |TensorConstant{1} [@181744844]
| | |Elemwise{exp,no_inplace} [@181744652] ''
| | | |Elemwise{sub,no_inplace} [@181744012] ''
| | | | |Elemwise{neg,no_inplace} [@181730764] ''
| | | | | |dot [@181729676] ''
| | | | | | |x [@181563948]
| | | | | | |w [@181729964]
| | | | |InplaceDimShuffle{x} [@181743788] ''
| | | | | |b [@181730156]
|InplaceDimShuffle{x} [@181771788] ''
| |TensorConstant{0.5} [@181771148]
>>> theano.printing.debugprint(predict) # (post-compilation)
Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] '' 2
|dot [@183018796] '' 1
| |x [@183000780]
| |w [@183000812]
|InplaceDimShuffle{x} [@183133580] '' 0
| |b [@183000876]
|TensorConstant{[ 0.5]} [@183084108]
Picture Printing
================
>>> theano.printing.pydotprint_variables(prediction) # (pre-compilation)
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
:width: 800 px
Notice that ``pydotprint()`` requires *Graphviz* and Python's ``pydot``.
>>> theano.printing.pydotprint(predict) # (post-compilation)
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
:width: 800 px
>>> theano.printing.pydotprint(train) # This is a small train example!
.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
:width: 1500 px
......@@ -5,7 +5,8 @@
Python tutorial
***************
In this documentation, we suppose that reader know python. Here is a small list of python tutorials/exercices if you know know it or need a refresher:
In this documentation, we suppose that the reader knows Python. Here is a small list of Python
tutorials/exercises if you need to learn it or only need a refresher:
* `Python Challenge <http://www.pythonchallenge.com/>`__
* `Dive into Python <http://diveintopython.net/>`__
......
.. _tutorial_general_remarks:
=====================
Some general Remarks
=====================
Theano offers quite a bit of flexibility, but has some limitations too.
How should you write your algorithm to make the most of what Theano can do?
Limitations
-----------
- While- or for-Loops within an expression graph are supported, but only via
the :func:`theano.scan` op (which puts restrictions on how the loop body can
interact with the rest of the graph).
- Neither ``goto`` nor recursion is supported or planned within expression graphs.
.. _shape_info:
============================================
How shape informations are handled by Theano
============================================
==========================================
How Shape Information is Handled by Theano
==========================================
It is not possible to enforce strict shape into a Theano variable when
building a graph. The given parameter of theano.function can change the
shape any TheanoVariable in a graph.
It is not possible to strictly enforce the shape of a Theano variable when
building a graph since the particular value provided at run-time for a parameter of a
Theano function may condition the shape of the Theano variables in its graph.
Currently shape informations are used for 2 things in Theano:
Currently, information regarding shape is used in two ways in Theano:
- When the exact shape is known, we use it to generate faster c code for
the 2d convolution on the cpu and gpu.
- To generate faster C code for the 2d convolution on the CPU and the GPU,
when the exact output shape is known in advance.
- To remove computations in the graph when we only want to know the
shape, but not the actual value of a variable. This is done with the
`Op.infer_shape <http://deeplearning.net/software/theano/extending/cop.html#Op.infer_shape>`_
method.
ex:
Example:
.. code-block:: python
import theano
x = theano.tensor.matrix('x')
f = theano.function([x], (x**2).shape)
f = theano.function([x], (x ** 2).shape)
theano.printing.debugprint(f)
#MakeVector [@43860304] '' 2
# |Shape_i{0} [@43424912] '' 1
......@@ -32,15 +32,15 @@ Currently shape informations are used for 2 things in Theano:
# |Shape_i{1} [@43797968] '' 0
# | |x [@43423568]
The output of this compiled function do not contain any multiplication
The output of this compiled function does not contain any multiplication
or power. Theano has removed them to compute directly the shape of the
output.
Shape inference problem
Shape Inference Problem
=======================
Theano propagates shape information in the graph. Sometimes this
can lead to errors. For example:
Theano propagates information about shape in the graph. Sometimes this
can lead to errors. Consider this example:
.. code-block:: python
......@@ -48,9 +48,9 @@ can lead to errors. For example:
import theano
x = theano.tensor.matrix('x')
y = theano.tensor.matrix('y')
z = theano.tensor.join(0,x,y)
xv = numpy.random.rand(5,4)
yv = numpy.random.rand(3,3)
z = theano.tensor.join(0, x, y)
xv = numpy.random.rand(5, 4)
yv = numpy.random.rand(3, 3)
f = theano.function([x,y], z.shape)
theano.printing.debugprint(f)
......@@ -83,61 +83,61 @@ can lead to errors. For example:
# |y [@44540304]
f(xv,yv)
# Raise a dimensions mismatch error.
As you see, when you ask for the shape of some computation (join in the
example), we sometimes compute an inferred shape directly, without executing
the computation itself (there is no join in the first output or debugprint).
This makes the computation of the shape faster, but it can hide errors. In
the example, the computation of the shape of join is done on the first
theano variable in the join, not on the other.
This can probably happen with many other op as elemwise, dot, ...
Indeed, to make some optimizations (for speed or stability, for instance),
Theano can assume that the computation is correct and consistent
in the first place, this is the case here.
You can detect those problem by running the code without this
optimization, with the Theano flag
`optimizer_excluding=local_shape_to_shape_i`. You can also have the
same effect by running in the mode FAST_COMPILE (it will not apply this
optimization, nor most other optimizations) or DEBUG_MODE (it will test
# Raises a dimensions mismatch error.
As you can see, when asking only for the shape of some computation (``join`` in the
example), an inferred shape is computed directly, without executing
the computation itself (there is no ``join`` in the first output or debugprint).
This makes the computation of the shape faster, but it can also hide errors. In
this example, the computation of the shape of the output of ``join`` is done only
based on the first input Theano variable, which leads to an error.
This might happen with other ops such as ``elemwise`` and ``dot``, for example.
Indeed, to perform some optimizations (for speed or stability, for instance),
Theano assumes that the computation is correct and consistent
in the first place, as it does here.
You can detect those problems by running the code without this
optimization, using the Theano flag
``optimizer_excluding=local_shape_to_shape_i``. You can also obtain the
same effect by running in the modes ``FAST_COMPILE`` (it will not apply this
optimization, nor most other optimizations) or ``DebugMode`` (it will test
before and after all optimizations (much slower)).
Specifing exact shape
Specifing Exact Shape
=====================
Currently, specifying a shape is not as easy as we want. We plan some
upgrade, but this is the current state of what can be done.
Currently, specifying a shape is not as easy and flexible as we wish and we plan some
upgrade. Here is the current state of what can be done:
- You can pass the shape info directly to the `ConvOp` created
when calling conv2d. You must add the parameter image_shape
and filter_shape to that call. They but most be tuple of 4
elements. Ex:
- You can pass the shape info directly to the ``ConvOp`` created
when calling ``conv2d``. You simply set the parameters ``image_shape``
and ``filter_shape`` inside the call. They must be tuples of 4
elements. For example:
.. code-block:: python
theano.tensor.nnet.conv2d(..., image_shape=(7,3,5,5), filter_shape=(2,3,4,4))
theano.tensor.nnet.conv2d(..., image_shape=(7, 3, 5, 5), filter_shape=(2, 3, 4, 4))
- You can use the SpecifyShape op to add shape anywhere in the
graph. This allows to do some optimizations. In the following example,
this allows to precompute the Theano function to a constant.
- You can use the ``SpecifyShape`` op to add shape information anywhere in the
graph. This allows to perform some optimizations. In the following example,
this makes it possible to precompute the Theano function to a constant.
.. code-block:: python
import theano
x = theano.tensor.matrix()
x_specify_shape = theano.tensor.specify_shape(x, (2,2))
f = theano.function([x], (x_specify_shape**2).shape)
x_specify_shape = theano.tensor.specify_shape(x, (2, 2))
f = theano.function([x], (x_specify_shape ** 2).shape)
theano.printing.debugprint(f)
# [2 2] [@72791376]
Future plans
Future Plans
============
- Add the parameter "constant shape" to theano.shared(). This is probably
the most frequent use case when we will use it. This will make the code
simpler and we will be able to check that the shape does not change when
we update the shared variable.
The parameter "constant shape" will be added to ``theano.shared()``. This is probably
the most frequent occurrence with ``shared`` variables. It will make the code
simpler and will make it possible to check that the shape does not change when
updating the ``shared`` variable.
......@@ -4,9 +4,6 @@
Sparse
======
Sparse Matrices
===============
In general, *sparse* matrices provide the same functionality as regular
matrices. The difference lies in the way the elements of *sparse* matrices are
represented and stored in memory. Only the non-zero elements of the latter are stored.
......
......@@ -5,27 +5,31 @@
Graph Structures
================
Theano Graphs
=============
Debugging or profiling code written in Theano is not that simple if you
do not know what goes on under the hood. This chapter is meant to
introduce you to a required minimum of the inner workings of Theano,
for more details see :ref:`extending`.
introduce you to a required minimum of the inner workings of Theano.
For more detail see :ref:`extending`.
The first step in writing Theano code is to write down all mathematical
relations using symbolic placeholders (**variables**). When writing down
these expressions you use operations like ``+``, ``-``, ``**``,
``sum()``, ``tanh()``. All these are represented internally as **ops**.
An **op** represents a certain computation on some type of inputs
producing some type of output. You can see it as a function definition
An *op* represents a certain computation on some type of inputs
producing some type of output. You can see it as a *function definition*
in most programming languages.
Theano builds internally a graph structure composed of interconnected
**variable** nodes, **op** nodes and **apply** nodes. An
**apply** node represents the application of an **op** to some
**variables**. It is important to make the difference between the
definition of a computation represented by an **op** and its application
to some actual data which is represented by the **apply** node. For more
details about these building blocks see :ref:`variable`, :ref:`op`,
:ref:`apply`. A graph example is the following:
*apply* node represents the application of an *op* to some
*variables*. It is important to draw the difference between the
definition of a computation represented by an *op* and its application
to some actual data which is represented by the *apply* node. For more
detail about these building blocks refer to :ref:`variable`, :ref:`op`,
:ref:`apply`. Here is an example of a graph:
**Code**
......@@ -50,9 +54,9 @@ details about these building blocks see :ref:`variable`, :ref:`op`,
WARNING: hyper-links and ref's seem to break the PDF build when placed
into this figure caption.
Arrows in this :ref:`figure <tutorial-graphfigure>` represent references to the
Arrows in this figure represent references to the
Python objects pointed at. The blue
box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
box is an :ref:`Apply` node. Red boxes are :ref:`Variable` nodes. Green
circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.
......@@ -63,17 +67,17 @@ Take for example the following code:
.. code-block:: python
x = T.dmatrix('x')
y = x*2.
y = x * 2.
If you print `type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``,
If you enter ``type(y.owner)`` you get ``<class 'theano.gof.graph.Apply'>``,
which is the apply node that connects the op and the inputs to get this
output. You can now print the name of the op that is applied to get
``y``:
*y*:
>>> y.owner.op.name
'Elemwise{mul,no_inplace}'
So a elementwise multiplication is used to compute ``y``. This
Hence, an elementwise multiplication is used to compute *y*. This
multiplication is done between the inputs:
>>> len(y.owner.inputs)
......@@ -85,7 +89,7 @@ InplaceDimShuffle{x,x}.0
Note that the second input is not 2 as we would have expected. This is
because 2 was first :term:`broadcasted <broadcasting>` to a matrix of
same shape as x. This is done by using the op ``DimShuffle`` :
same shape as *x*. This is done by using the op ``DimShuffle`` :
>>> type(y.owner.inputs[1])
<class 'theano.tensor.basic.TensorVariable'>
......@@ -97,9 +101,9 @@ same shape as x. This is done by using the op ``DimShuffle`` :
[2.0]
Starting from this graph structure it is easy to understand how
*automatic differentiation* is done, or how the symbolic relations
can be optimized for performance or stability.
Starting from this graph structure it is easier to understand how
*automatic differentiation* proceeds and how the symbolic relations
can be *optimized* for performance or stability.
Automatic Differentiation
......@@ -107,16 +111,19 @@ Automatic Differentiation
Having the graph structure, computing automatic differentiation is
simple. The only thing :func:`tensor.grad` has to do is to traverse the
graph from the outputs back towards the inputs through all :ref:`apply`
nodes (:ref:`apply` nodes are those that define which computations the
graph does). For each such :ref:`apply` node, its :ref:`op` defines
how to compute the gradient of the node's outputs with respect to its
inputs. Note that if an :ref:`op` does not provide this information,
it is assumed that the gradient is not defined.
graph from the outputs back towards the inputs through all *apply*
nodes (*apply* nodes are those that define which computations the
graph does). For each such *apply* node, its *op* defines
how to compute the *gradient* of the node's outputs with respect to its
inputs. Note that if an *op* does not provide this information,
it is assumed that the *gradient* is not defined.
Using the
`chain rule <http://en.wikipedia.org/wiki/Chain_rule>`_
these gradients can be composed in order to obtain the expression of the
gradient of the graph's output with respect to the graph's inputs .
*gradient* of the graph's output with respect to the graph's inputs .
A following section of this tutorial will examine the topic of :ref:`differentiation<tutcomputinggrads>`
in greater detail.
Optimizations
......@@ -124,7 +131,7 @@ Optimizations
When compiling a Theano function, what you give to the
:func:`theano.function <function.function>` is actually a graph
(starting from the outputs variables you can traverse the graph up to
(starting from the output variables you can traverse the graph up to
the input variables). While this graph structure shows how to compute
the output from the input, it also offers the possibility to improve the
way this computation is carried out. The way optimizations work in
......@@ -135,4 +142,27 @@ identical subgraphs and ensure that the same values are not computed
twice or reformulate parts of the graph to a GPU specific version.
For example, one (simple) optimization that Theano uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`.
the pattern :math:`\frac{xy}{y}` by *x.*
Further information regarding the optimization
:ref:`process<optimization>` and the specific :ref:`optimizations<optimizations>` that are applicable
is respectively available in the library and on the entrance page of the documentation.
**Example**
Symbolic programming involves a change of paradigm: it will become clearer
as we apply it. Consider the following example of optimization:
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print f([0, 1, 2]) # prints `array([0,2,1026])`
====================================================== =====================================================
Unoptimized graph Optimized graph
====================================================== =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
====================================================== =====================================================
......@@ -5,13 +5,16 @@
Using the GPU
=============
One of the Theano's design goals is to specify computations at an
For an introductory discussion of *Graphical Processing Units* (GPU) and their use for
intensive parallel computation purposes, see `GPGPU <http://en.wikipedia.org/wiki/GPGPU>`_.
One of Theano's design goals is to specify computations at an
abstract level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take advantage of
this flexibility is in carrying out calculations on an Nvidia graphics card when
there is a CUDA-enabled device in your computer.
the device present in the computer is CUDA-enabled.
Setting up CUDA
Setting Up CUDA
----------------
If you have not done so already, you will need to install Nvidia's
......@@ -41,6 +44,7 @@ file and run it.
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......@@ -52,38 +56,46 @@ file and run it.
else:
print 'Used the gpu'
The program just computes the exp() of a bunch of random numbers.
Note that we use the `shared` function to
make sure that the input `x` are stored on the graphics device.
The program just computes the ``exp()`` of a bunch of random numbers.
Note that we use the ``shared`` function to
make sure that the input *x* is stored on the graphics device.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
whereas on the GPU it takes just over 0.4 seconds. Note that the results are close but not
identical! The GPU will not always produce the exact same floating-point numbers as the CPU.
As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. code-block:: text
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py
Looping 1000 times took 7.17374897003 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 1.62323285]
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python thing.py
Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.418929815292 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Note that for now GPU operations in Theano require floatX to be float32 (see below also).
Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
Returning a handle to device-allocated data
Returning a Handle to Device-Allocated Data
-------------------------------------------
The speedup is not greater in the example above because the function is
returning its result as a numpy ndarray which has already been copied from the
device to the host for your convenience. This is what makes it so easy to swap in device=gpu, but
if you don't mind being less portable, you might prefer to see a bigger speedup by changing
the graph to express a computation with a GPU-stored result. The gpu_from_host
Op means "copy the input from the host to the gpu" and it is optimized away
after the T.exp(x) is replaced by a GPU version of exp().
The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
device to the host for your convenience. This is what makes it so easy to swap in ``device=gpu``, but
if you don't mind less portability, you might gain a bigger speedup by changing
the graph to express a computation with a GPU-stored result. The ``gpu_from_host``
op means "copy the input from the host to the GPU" and it is optimized away
after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_2
......@@ -101,6 +113,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......@@ -117,32 +130,42 @@ The output from this program is
.. code-block:: text
Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.185714006424 seconds
Result is <CudaNdarray object at 0x3e9e970>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Here we've shaved off about 50% of the run-time by simply not copying the
resulting array back to the host.
The object returned by each function call is now not a numpy array but a
"CudaNdarray" which can be converted to a numpy ndarray by the normal
numpy casting mechanism.
The object returned by each function call is now not a NumPy array but a
"CudaNdarray" which can be converted to a NumPy ndarray by the normal
NumPy casting mechanism.
Running the GPU at Full Speed
------------------------------
To really get maximum performance in this simple example, we need to use an :class:`Out`
instance to tell Theano not to copy the output it returns to us. Theano allocates memory for
internal use like a working buffer, but by default it will never return a result that is
allocated in the working buffer. This is normally what you want, but our example is so simple
that it has the un-wanted side-effect of really slowing things down.
To really get maximum performance in this simple example, we need to use an
:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
the output it returns to us. This is because Theano pre-allocates memory for internal use
(like working buffers), and by default will never return a result that is aliased to one of
its internal buffers: instead, it will copy the buffers associated to outputs into newly
allocated memory at each function call. This is to ensure that subsequent function calls will
not overwrite previously computed outputs. Although this is normally what you want, our last
example was so simple that it had the unwanted side-effect of really slowing things down.
..
TODO:
The story here about copying and working buffers is misleading and potentially not correct
... why exactly does borrow=True cut 75% of the runtime ???
.. TODO: Answer by Olivier D: it sounds correct to me -- memory allocations must be slow.
.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_3
.. code-block:: python
......@@ -152,7 +175,7 @@ that it has the un-wanted side-effect of really slowing things down.
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
vlen = 10 * 30 * 768 # 10 x # cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
......@@ -160,6 +183,7 @@ that it has the un-wanted side-effect of really slowing things down.
f = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
borrow=True))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......@@ -172,34 +196,51 @@ that it has the un-wanted side-effect of really slowing things down.
else:
print 'Used the gpu'
Running this version of the code takes just under 0.05 seconds, over 140x faster than
Running this version of the code takes just over 0.05 seconds, that is 60x faster than
the CPU implementation!
.. code-block:: text
Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.0497219562531 seconds
Result is <CudaNdarray object at 0x31eeaf0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
With *flag* ``borrow=False``:
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.31614613533 seconds
Result is <CudaNdarray object at 0x77e9270>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
With *flag* ``borrow=True``:
This version of the code ``using borrow=True`` is slightly less safe because if we had saved
the `r` returned from one function call, we would have to take care and remember that its value might
be over-written by a subsequent function call. Although borrow=True makes a dramatic difference in this example,
be careful! The advantage of
borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
mistake by failing to account for the resulting memory aliasing.
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.0502779483795 seconds
Result is <CudaNdarray object at 0x83e5cb0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
What can be accelerated on the GPU?
------------------------------------
This version of the code including the flag ``borrow=True`` is slightly less safe because if we had saved
the *r* returned from one function call, we would have to take care and remember that its value might
be over-written by a subsequent function call. Although ``borrow=True`` makes a dramatic difference
in this example, be careful! The advantage of ``borrow=True`` is much weaker in larger graphs, and
there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.
What Can Be Accelerated on the GPU
----------------------------------
The performance characteristics will change as we continue to optimize our
implementations, and vary from device to device, but to give a rough idea of
what to expect right now:
* Only computations
with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but
float64 computations are still relatively slow (Jan 2010).
with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
*float64* computations are still relatively slow (Jan 2010).
* Matrix
multiplication, convolution, and large element-wise operations can be
accelerated a lot (5-50x) when arguments are large enough to keep 30
......@@ -208,7 +249,7 @@ what to expect right now:
dimension-shuffling and constant-time reshaping will be equally fast on GPU
as on CPU.
* Summation
over rows/columns of tensors can be a little slower on the GPU than on the CPU
over rows/columns of tensors can be a little slower on the GPU than on the CPU.
* Copying
of large quantities of data to and from a device is relatively slow, and
often cancels most of the advantage of one or two accelerated functions on
......@@ -216,38 +257,358 @@ what to expect right now:
the device pay off.
Tips for improving performance on GPU
--------------------------------------
Tips for Improving Performance on GPU
-------------------------------------
* Consider
adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
GPU work.
* Prefer
constructors like 'matrix' 'vector' and 'scalar' to 'dmatrix', 'dvector' and
'dscalar' because the former will give you float32 variables when
floatX=float32.
constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
``dscalar`` because the former will give you *float32* variables when
``floatX=float32``.
* Ensure
that your output variables have a float32 dtype and not float64. The
more float32 variables are in your graph, the more work the GPU can do for
that your output variables have a *float32* dtype and not *float64*. The
more *float32* variables are in your graph, the more work the GPU can do for
you.
* Minimize
tranfers to the GPU device by using shared 'float32' variables to store
tranfers to the GPU device by using ``shared`` *float32* variables to store
frequently-accessed data (see :func:`shared()<shared.shared>`). When using
the GPU, 'float32' tensor shared variables are stored on the GPU by default to
the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try building your functions with
mode='PROFILE_MODE'. This should print some timing information at program
termination (atexit). Is time being used sensibly? If an Op or Apply is
``mode='ProfileMode'``. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU
programming have a look at how it's implemented in theano.sandbox.cuda.
Check the line like 'Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op'
that can tell you if not enough of your graph is on the gpu or if their
is too much memory transfert.
programming, have a look at how it's implemented in theano.sandbox.cuda.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
This can tell you if not enough of your graph is on the GPU or if there
is too much memory transfer.
Changing the value of shared variables
Changing the Value of Shared Variables
--------------------------------------
To change the value of a shared variable, e.g. to provide new data to process,
To change the value of a ``shared`` variable, e.g. to provide new data to processes,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.
-------------------------------------------
**Exercise**
Consider again the logistic regression:
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
Modify and execute this example to run on GPU with ``floatX=float32`` and
time it using the command line ``time python file.py``. (Of course, you may use some of your answer
to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``ProfileMode``)
What can be done to further increase the speed of the GPU version? Put your ideas to test.
.. Note::
* Only 32 bit floats are currently supported (development is in progress).
* ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
* There is a limit of one GPU per process.
* Use the Theano flag ``device=gpu`` to require use of the GPU device.
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
* Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
* ``Cast`` inputs before storing them into a ``shared`` variable.
* Circumvent the automatic cast of *int32* with *float32* to *float64*:
* Insert manual cast in your code or use *[u]int{8,16}*.
* Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
* Notice that a new casting mechanism is being developed.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
Software for Directly Programming a GPU
---------------------------------------
Leaving aside Theano which is a meta-programmer, there are:
* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C)
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) are maturing.
* **OpenCL**: multi-vendor version of CUDA
* More general, standardized.
* Fewer libraries, lesser spread.
* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel
computation API from Python
* Convenience:
Makes it easy to do GPU meta-programming from within Python.
Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).
GPU memory buffer (``pycuda.gpuarray.GPUArray``).
Helpful documentation.
* Completeness: Binding to all of CUDA's driver API.
* Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
* Speed: PyCUDA's base layer is written in C++.
* Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').
Makes it much easier to write correct, leak- and crash-free code.
PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
allocated in it is also freed).
(This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_
and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)
* **PyOpenCL**: PyCUDA for OpenCL
Learning to Program with PyCUDA
-------------------------------
If you already enjoy a good proficiency with the C programming language, you
may easily leverage your knowledge by learning, first, to program a GPU with the
CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
API with a Python wrapper.
The following resources will assist you in this learning process:
* **CUDA API and CUDA C: Introductory**
* `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_
* `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
* **CUDA API and CUDA C: Advanced**
* `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
(full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
* `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
(full lectures, Kirk-Hwu textbook)
* `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
(extensive coverage, levels from introductory to advanced)
* `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
(on the relationship between grids, blocks and threads; see also linked and related issues on same page)
* `CUDA optimisation <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_
* **PyCUDA: Introductory**
* `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_
* `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_
* **PYCUDA: Advanced**
* `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_
The following examples give a foretaste of programming a GPU with PyCUDA. Once
you feel competent enough, you may try yourself on the corresponding exercises.
**Example: PyCUDA**
.. code-block:: python
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print dest
-------------------------------------------
**Exercise**
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
-------------------------------------------
.. _pyCUDA_theano:
**Example: Theano + PyCUDA**
.. code-block:: python
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512,1,1), grid=grid)
return thunk
Use this code to test it:
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv=numpy.ones((4,5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))
-------------------------------------------
**Exercise**
Run the preceding example.
Modify and execute to multiply two matrices: *x* * *y*.
Modify and execute to return two outputs: *x + y* and *x - y*.
(Notice that Theano's current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)
Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*).
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Using the GPU'
# 1. Raw results
#
# same code as in mode_solution_1 but run with following command lines:
# THEANO_FLAGS=mode=FAST_RUN,device=gpu time python program_name.py
# THEANO_FLAGS=mode=FAST_RUN,device=cpu time python program_name.py
# for GPU and CPU respectively
# typical time: 20 sec (CPU), 10 sec (GPU)
import numpy
import theano
import theano.tensor as tt
from theano import sandbox, Out
theano.config.floatX = 'float32'
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = tt.matrix("x")
y = tt.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = tt.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = tt.grad(cost, [w, b])
"""
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
name="predict")
"""
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
"""
# 2. Profiling
#
# same code as above but run with following command lines:
# THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
# THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py
# for GPU and CPU
# 2.1 Profiling output for CPU computations
$ THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py
Used the cpu
target values for D
prediction on D
Used the cpu
target values for D
prediction on D
ProfileMode.print_summary()
---------------------------
Time since import 12.586s
Theano compile time: 0.000s (0.0% since import)
Optimization time: 0.000s
Linker time: 0.000s
Theano fct call 5.147s (40.9% since import)
Theano Op time 3.595s 28.6%(since import) 69.8%(of fct call)
Theano function overhead in ProfileMode 1.552s 12.3%(since import) 30.2%(of fct call)
20002 Theano fct call, 0.000s per call
Rest of the time since import 7.440s 59.1%
Theano fct summary:
<% total fct time> <total time> <time per call> <nb call> <fct name>
49.9% 2.567s 2.57e-04s 10000 train
0.0% 0.000s 1.24e-04s 1 predict
0.0% 0.000s 1.26e-04s 1 predict
50.1% 2.579s 2.58e-04s 10000 train
Single Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name>
59.3% 59.3% 2.133s 2.133s 5.33e-05s * 40002 1 6 <class 'theano.tensor.blas_c.CGemv'>
34.4% 93.8% 1.238s 3.371s 6.19e-06s * 200002 11 22 <class 'theano.tensor.elemwise.Elemwise'>
2.8% 96.6% 0.100s 3.471s 2.51e-06s * 40002 1 6 <class 'theano.tensor.basic.Alloc'>
2.1% 98.7% 0.075s 3.546s 1.26e-06s * 60002 2 8 <class 'theano.tensor.elemwise.DimShuffle'>
0.7% 99.3% 0.024s 3.571s 6.11e-07s * 40002 1 6 <class 'theano.tensor.opt.Shape_i'>
0.7% 100.0% 0.024s 3.595s 1.18e-06s * 20000 1 2 <class 'theano.tensor.elemwise.Sum'>
... (remaining 0 single Op account for 0.00%(0.00s) of the runtime)
(*) Op is running a c implementation
Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb apply> <Op name>
59.3% 59.3% 2.133s 2.133s 5.33e-05s * 40002 6 CGemv{inplace}
18.1% 77.4% 0.650s 2.783s 3.25e-05s * 20000 2 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}
6.4% 83.9% 0.231s 3.014s 1.16e-05s * 20000 2 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)]
4.0% 87.8% 0.142s 3.157s 7.11e-06s * 20000 2 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)]
2.8% 90.6% 0.100s 3.257s 2.51e-06s * 40002 6 Alloc
1.4% 92.1% 0.052s 3.309s 1.30e-06s * 40002 6 InplaceDimShuffle{x}
1.1% 93.1% 0.038s 3.347s 1.92e-06s * 20000 2 Elemwise{Cast{float32}}
1.1% 94.2% 0.038s 3.386s 1.91e-06s * 20000 2 Elemwise{sub,no_inplace}
1.0% 95.2% 0.036s 3.421s 1.79e-06s * 20000 2 Elemwise{gt,no_inplace}
0.8% 96.0% 0.029s 3.450s 1.44e-06s * 20000 2 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
0.8% 96.8% 0.028s 3.479s 1.42e-06s * 20000 2 Elemwise{neg,no_inplace}
0.7% 97.5% 0.024s 3.503s 6.11e-07s * 40002 6 Shape_i{0}
0.7% 98.1% 0.024s 3.527s 1.18e-06s * 20000 2 Sum
0.6% 98.8% 0.023s 3.550s 1.16e-06s * 20000 2 InplaceDimShuffle{1,0}
0.6% 99.4% 0.023s 3.573s 1.15e-06s * 20000 2 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.6% 100.0% 0.022s 3.595s 1.08e-06s * 20000 2 Elemwise{inv,no_inplace}
0.0% 100.0% 0.000s 3.595s 1.19e-05s * 2 2 Elemwise{Composite{[Composite{[Composite{[Composite{[GT(scalar_sigmoid(i0), i1)]}(neg(i0), i1)]}(sub(i0, i1), i2)]}(neg(i0), i1, i2)]}}
... (remaining 0 Op account for 0.00%(0.00s) of the runtime)
(*) Op is running a c implementation
Apply-wise summary:
<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name>
14.9% 14.9% 0.536s 0.536s 5.36e-05s * 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0})
14.9% 29.8% 0.534s 1.070s 5.34e-05s * 10000 18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
14.8% 44.6% 0.532s 1.602s 5.32e-05s * 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0})
14.7% 59.3% 0.530s 2.132s 5.30e-05s * 10000 18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
9.1% 68.4% 0.327s 2.460s 3.27e-05s * 10000 13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
9.0% 77.4% 0.323s 2.783s 3.23e-05s * 10000 13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
3.2% 80.6% 0.116s 2.899s 1.16e-05s * 10000 16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
3.2% 83.9% 0.116s 3.014s 1.16e-05s * 10000 16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
2.0% 85.8% 0.071s 3.086s 7.12e-06s * 10000 14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0)
2.0% 87.8% 0.071s 3.156s 7.09e-06s * 10000 14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0)
0.9% 88.8% 0.034s 3.190s 3.38e-06s * 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
0.9% 89.7% 0.034s 3.224s 3.37e-06s * 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
0.5% 90.2% 0.019s 3.243s 1.93e-06s * 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.5% 90.8% 0.019s 3.262s 1.92e-06s * 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.5% 91.3% 0.019s 3.282s 1.90e-06s * 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
... (remaining 35 Apply instances account for 8.71%(0.31s) of the runtime)
(*) Op is running a c implementation
Profile of Theano functions memory:
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
# 2.2 Profiling output for GPU computations
$ THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
Using gpu device 0: GeForce GTX 580
Used the gpu
target values for D
prediction on D
Used the gpu
target values for D
prediction on D
ProfileMode.print_summary()
---------------------------
Time since import 25.682s
Theano compile time: 0.000s (0.0% since import)
Optimization time: 0.000s
Linker time: 0.000s
Theano fct call 17.052s (66.4% since import)
Theano Op time 14.548s 56.6%(since import) 85.3%(of fct call)
Theano function overhead in ProfileMode 2.505s 9.8%(since import) 14.7%(of fct call)
20002 Theano fct call, 0.001s per call
Rest of the time since import 8.630s 33.6%
Theano fct summary:
<% total fct time> <total time> <time per call> <nb call> <fct name>
50.0% 8.526s 8.53e-04s 10000 train
0.0% 0.001s 1.09e-03s 1 predict
50.0% 8.524s 8.52e-04s 10000 train
0.0% 0.001s 1.10e-03s 1 predict
Single Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name>
54.8% 54.8% 7.968s 7.968s 1.33e-04s 60002 1 8 <class 'theano.sandbox.cuda.basic_ops.GpuFromHost'>
16.2% 71.0% 2.358s 10.325s 1.47e-05s * 160002 9 18 <class 'theano.sandbox.cuda.basic_ops.GpuElemwise'>
12.3% 83.3% 1.795s 12.120s 4.49e-05s * 40002 1 6 <class 'theano.sandbox.cuda.blas.GpuGemv'>
7.0% 90.4% 1.024s 13.144s 2.56e-05s 40002 1 6 <class 'theano.sandbox.cuda.basic_ops.HostFromGpu'>
5.0% 95.4% 0.728s 13.872s 1.82e-05s * 40002 1 6 <class 'theano.sandbox.cuda.basic_ops.GpuAlloc'>
2.1% 97.4% 0.300s 14.171s 1.50e-05s * 20000 1 2 <class 'theano.sandbox.cuda.basic_ops.GpuSum'>
1.3% 98.7% 0.189s 14.360s 3.15e-06s * 60002 3 8 <class 'theano.sandbox.cuda.basic_ops.GpuDimShuffle'>
0.6% 99.4% 0.094s 14.454s 2.35e-06s * 40002 2 6 <class 'theano.tensor.elemwise.Elemwise'>
0.3% 99.7% 0.048s 14.503s 1.21e-06s * 40002 1 6 <class 'theano.tensor.opt.Shape_i'>
0.3% 100.0% 0.045s 14.548s 2.25e-06s * 20000 1 2 <class 'theano.tensor.elemwise.DimShuffle'>
... (remaining 0 single Op account for 0.00%(0.00s) of the runtime)
(*) Op is running a c implementation
Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb apply> <Op name>
54.8% 54.8% 7.968s 7.968s 1.33e-04s 60002 8 GpuFromHost
12.3% 67.1% 1.795s 9.763s 4.49e-05s * 40002 6 GpuGemv{inplace}
7.0% 74.1% 1.024s 10.786s 2.56e-05s 40002 6 HostFromGpu
5.0% 79.1% 0.728s 11.514s 1.82e-05s * 40002 6 GpuAlloc
2.3% 81.4% 0.334s 11.848s 1.67e-05s * 20000 2 GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)]
2.2% 83.6% 0.319s 12.167s 1.59e-05s * 20000 2 GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}
2.1% 85.7% 0.301s 12.468s 1.50e-05s * 20000 2 GpuElemwise{neg,no_inplace}
2.1% 87.8% 0.300s 12.768s 1.50e-05s * 20000 2 GpuSum{1}
2.0% 89.8% 0.292s 13.060s 1.46e-05s * 20000 2 GpuElemwise{inv,no_inplace}
1.9% 91.7% 0.283s 13.343s 1.42e-05s * 20000 2 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
1.9% 93.7% 0.281s 13.625s 1.41e-05s * 20000 2 GpuElemwise{sub,no_inplace}
1.9% 95.5% 0.273s 13.898s 1.37e-05s * 20000 2 GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)]
1.9% 97.4% 0.273s 14.171s 1.37e-05s * 20000 2 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
1.0% 98.4% 0.141s 14.313s 7.06e-06s * 20002 4 GpuDimShuffle{x}
0.4% 98.8% 0.057s 14.370s 2.87e-06s * 20002 4 Elemwise{gt,no_inplace}
0.3% 99.1% 0.048s 14.418s 1.21e-06s * 40002 6 Shape_i{0}
0.3% 99.4% 0.045s 14.463s 2.25e-06s * 20000 2 InplaceDimShuffle{x}
0.3% 99.7% 0.037s 14.500s 1.83e-06s * 20000 2 Elemwise{Cast{float32}}
0.2% 99.8% 0.025s 14.525s 1.24e-06s * 20000 2 GpuDimShuffle{0}
0.2% 100.0% 0.023s 14.548s 1.14e-06s * 20000 2 GpuDimShuffle{1,0}
... (remaining 1 Op account for 0.00%(0.00s) of the runtime)
(*) Op is running a c implementation
Apply-wise summary:
<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name>
24.0% 24.0% 3.493s 3.493s 3.49e-04s 10000 1 GpuFromHost(x)
23.9% 47.9% 3.479s 6.972s 3.48e-04s 10000 1 GpuFromHost(x)
4.3% 52.3% 0.629s 7.602s 6.29e-05s * 10000 24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
4.3% 56.6% 0.629s 8.231s 6.29e-05s * 10000 24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417})
1.8% 58.4% 0.269s 8.499s 2.69e-05s * 10000 9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0})
1.8% 60.3% 0.268s 8.767s 2.68e-05s * 10000 9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0})
1.8% 62.1% 0.266s 9.033s 2.66e-05s 10000 18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0)
1.8% 63.9% 0.262s 9.296s 2.62e-05s 10000 18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0)
1.8% 65.7% 0.260s 9.555s 2.60e-05s 10000 3 GpuFromHost(y)
1.8% 67.5% 0.258s 9.813s 2.58e-05s 10000 3 GpuFromHost(y)
1.7% 69.2% 0.248s 10.061s 2.48e-05s 10000 20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0)
1.7% 70.9% 0.247s 10.309s 2.47e-05s 10000 20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0)
1.6% 72.5% 0.238s 10.547s 2.38e-05s 10000 12 GpuFromHost(Elemwise{Cast{float32}}.0)
1.6% 74.1% 0.237s 10.785s 2.37e-05s 10000 12 GpuFromHost(Elemwise{Cast{float32}}.0)
1.3% 75.4% 0.185s 10.969s 1.85e-05s * 10000 6 GpuAlloc(CudaNdarrayConstant{[ 1.58212732e-09]}, Shape_i{0}.0)
... (remaining 53 Apply instances account for 24.60%(3.58s) of the runtime)
(*) Op is running a c implementation
Some info useful for gpu:
Spent 1.211s(8.324%) in cpu Op, 13.337s(91.676%) in gpu Op and 0.000s(0.000%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in outputs
(Useful to know if we forgot some cast when using floatX=float32 or gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Profile of Theano functions memory:
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
# 3. Conclusions
Facts:
Examine and compare 'Single Op-wise' summaries for CPU and GPU. GPU ops 'GpuFromHost' (and 'HostFromGpu') by themselves
consume a large amount of extra time. Furthermore, notice that each of the GPU ops consumes more time than its CPU counterpart.
An additional experiment also confirms that adding an 'out' instance in the GPU version only brings about a minor
improvement in this situation.
Tentative conclusion:
The large number of external training steps (10000) generates disproportionate GPU overhead costs.
Tentative solution:
Include the training steps inside the definition of the Theano function.
Implement this solution and put it to test.
"""
......@@ -30,7 +30,7 @@ _logger = logging.getLogger("theano.printing")
def debugprint(obj, depth=-1, print_type=False,
file=None, ids='CHAR', stop_on_name=False):
"""Print a computation graph to file
"""Print a computation graph as text to stdout or a file.
:type obj: Variable, Apply, or Function instance
:param obj: symbolic thing to print
......@@ -56,12 +56,12 @@ def debugprint(obj, depth=-1, print_type=False,
The first part of the text identifies whether it is an input
(if a name or type is printed) or the output of some Apply (in which case
the Op is printed).
The second part of the text is the memory location of the Variable.
The second part of the text is an identifier of the Variable.
If print_type is True, we add a part containing the type of the Variable
If a Variable is encountered multiple times in the depth-first search,
it is only printed recursively the first time. Later, just the Variable
and its memory location are printed.
identifier is printed.
If an Apply has multiple outputs, then a '.N' suffix will be appended
to the Apply's identifier, to indicate which output a line corresponds to.
......@@ -461,7 +461,9 @@ pprint.assign(lambda pstate, r: hasattr(pstate, 'target')
LeafPrinter())
pp = pprint
"""
Print to the terminal a math-like expression.
"""
# colors not used: orange, amber#FFBF00, purple, pink,
# used by default: green, blue, grey, red
......@@ -530,7 +532,7 @@ def pydotprint(fct, outfile=None,
blue boxes are outputs variables of the graph
grey boxes are variables that are not outputs and are not used
red ellipses are transfers from/to the gpu (ops with names GpuFromHost,
HostFromGpu)
HostFromGpu)
"""
if colorCodes is None:
......
......@@ -197,11 +197,12 @@ def scan(fn,
* ``initial`` -- Theano variable that represents the initial
state of a given output. In case the output is not computed
recursively (think of a map) and does not require a initial
state this field can be skiped. Given that only the previous
time step of the output is used by ``fn`` the initial state
should have the same shape as the output. If multiple time
taps are used, the initial state should have one extra
recursively (think of a map) and does not require an initial
state this field can be skipped. Given that (only) the previous
time step of the output is used by ``fn``, the initial state
**should have the same shape** as the output and **should not
involve a downcast** of the data type of the output. If multiple
time taps are used, the initial state should have one extra
dimension that should cover all the possible taps. For example
if we use ``-5``, ``-2`` and ``-1`` as past taps, at step 0,
``fn`` will require (by an abuse of notation) ``output[-5]``,
......
......@@ -797,6 +797,7 @@ class T_using_gpu(unittest.TestCase):
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
# print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......@@ -813,7 +814,6 @@ class T_using_gpu(unittest.TestCase):
assert numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()])
def test_using_gpu_2(self):
if theano.config.device.find('gpu') > -1:
......@@ -829,6 +829,7 @@ class T_using_gpu(unittest.TestCase):
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
# print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......@@ -844,9 +845,6 @@ class T_using_gpu(unittest.TestCase):
assert not numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()])
def test_using_gpu_3(self):
if theano.config.device.find('gpu') >-1:
......@@ -864,6 +862,7 @@ class T_using_gpu(unittest.TestCase):
f = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
borrow=True))
# print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论