Add a tutorial section on how tu use the new multi-gpu functionality.

7beebd0c · Arnaud Bergeron · d6156c63 · 7beebd0c · 7beebd0c
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -37,6 +37,7 @@ you out.
    loop
    sparse
    using_gpu
+    using_multi_gpu
    gpu_data_convert
    aliasing
    shape_info

--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
+
+.. _using_multi_gpu:
+
+===================
+Using multiple GPUs
+===================
+
+Theano has a feature to allow the use of multiple GPUs at the same
+time in one function.  The multiple gpu feature requires the use of
+the :ref:`gpuarray` backend, so make sure that works correctly.
+
+In order to keep a reasonably high level of abstraction you do not
+refer to device names directly for multiple-gpu use.  You instead
+referer to what we call context names.  These are then mapped to a
+device using the theano configuration.  This allows portability of
+models between machines.
+
+.. warning::
+
+   The code is rather new and is still considered experimental at this
+   point.  It has been tested and seems to perform correctly in all
+   cases observed, but make sure to double-check your results before
+   publishing a paper or anything of the sort.
+
+
+Defining the context map
+------------------------
+
+The mapping from context names to devices is done through the
+:attr:`config.contexts` option.  The format looks like this::
+
+    dev0->cuda0;dev1->cuda1
+
+Let's break it down.  First there is a list of mappings.  Each of
+these mappings is separeted by a semicolon ';'.  There can be any
+number of such mappings, but in the example above we have two of them:
+`dev0->cuda0` and `dev1->cuda1`.
+
+The mappings themselves are composed of a context name followed by the
+two characters '->' and the device name.  The context name is a simple
+string which oes not have any special meaning for Theano.  For parsing
+reasons, the context name cannot contain the sequence '->'.  The
+device name is a device in the form that gpuarray expects like 'cuda0'
+or 'opencl0:0'.
+
+If you don't have enough GPUs for a certain model, you can assign the
+same device to more than one name. You can also assign extra names
+that a model doesn't need to some other devices.  However, a
+proliferation of names is not always a good idea since theano often
+assumes that different context names will be on different devices and
+will optimize accordingly.  So you may get faster performance for a
+single name and a single device.
+
+.. note::
+
+   It is often the case that multi-gpu operation requires or assumes
+   that all the GPUs involved are equivalent.  This is not the case
+   for this implementation.  Since the user has the task of
+   distrubuting the jobs across the different device a model can be
+   built on the assumption that one of the GPU is slower or has
+   smaller memory.
+
+
+A simple graph on two GPUs
+--------------------------
+
+The following simple program works on two GPUs.  It builds a function
+which perform two dot products on two different GPUs.
+
+.. code-block:: python
+
+   import numpy
+   import theano
+
+   v01 = theano.shared(numpy.random.random(1024, 1024).astype('float32'),
+                       target='dev0')
+   v02 = theano.shared(numpy.random.random(1024, 1024).astype('float32'),
+                       target='dev0')
+   v11 = theano.shared(numpy.random.random(1024, 1024).astype('float32'),
+                       target='dev1')
+   v12 = theano.shared(numpy.random.random(1024, 1024).astype('float32'),
+                       target='dev1')
+
+   f = theano.function([], [theano.tensor.dot(v01, v02),
+                            theano.tensor.dot(v11, v12)])
+
+   f()
+
+This model requires a context map with assignations for 'dev0' and
+'dev1'.  It should run twice as fast when the devices are different.
+
+.. note::
+
+   While the above *should* be true, there are still some implentation
+   problems which may cause the program to exhibit no speedups while
+   running on two devices.  These are being worked on and will be
+   resolved as soon as possible.
+
+
+Explicit transfers of data
+--------------------------
+
+Since operations themselves cannot work on more than one device, they
+will pick a device to work on based on their inputs and automatically
+insert transfers for any input which is not on the right device.
+
+However you may want some explicit control over where and how these
+transfers are done at some points.  This is done by using the new
+:meth:`transfer` method that is present on variables.  It works for
+moving data between GPUs and also between the host and the GPUs.  Here
+is a example.
+
+.. code-block:: python
+
+   import theano
+
+   v = theano.tensor.fmatrix()
+
+   # Move to the device associated with 'gpudev'
+   gv = v.transfer('gpudev')
+
+   # Move back to the cpu
+   cv = gv.transfer('cpu')
+
+Of course you can mix transfers and operations in any order you
+choose.  However you should try to minimize transfer operations
+because they will introduce overhead any may reduce performance.