Add a section about GpuArray in the Using GPU section of the tutorial.

3c9d6446 · Arnaud Bergeron · 5b1c0b3b · 3c9d6446 · 3c9d6446
--- a/doc/install.txt
+++ b/doc/install.txt
@@ -69,12 +69,17 @@ The following libraries and software are optional:
        To be able to make picture of Theano computation graph.
    `NVIDIA CUDA drivers and SDK`_
-        Required for GPU code generation/execution. Only NVIDIA GPUs using
+        Required for GPU code generation/execution on NVIDIA gpus
-        32-bit floating point numbers are currently supported.
+    `libgpuarray`_
+        Required for GPU/CPU code generation on CUDA and OpenCL devices
+        :note: OpenCL support is still minimal for now.
 .. _LaTeX: http://www.latex-project.org/
 .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
 .. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
+.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
 Linux
 -----

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -12,12 +12,15 @@ and their use for intensive parallel computation purposes, see `GPGPU
 One of Theano's design goals is to specify computations at an abstract
 level, so that the internal function compiler has a lot of flexibility
 about how to carry out those computations.  One of the ways we take
-advantage of this flexibility is in carrying out calculations on an
+advantage of this flexibility is in carrying out calculations on a
-Nvidia graphics card when the device present in the computer is
+graphics card.
-CUDA-enabled.
-Setting Up CUDA
+There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
----------------
+.. _cuda:
+CUDA backend
+------------
 If you have not done so already, you will need to install Nvidia's
 GPU-programming toolchain (CUDA) and configure Theano to use it.
@@ -420,6 +423,247 @@ What can be done to further increase the speed of the GPU version? Put your idea
 :download:`Solution<using_gpu_solution_1.py>`
+-------------------------------------------
+.. _gpuarray:
+GpuArray Backend
+----------------
+If you have not done so already, you will need to install libgpuarray
+as well as at least one computing toolkit.  Instructions for doing so
+are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
+While all types of devices are supported if using OpenCL, for the
+remainder of this section, whatever compute device you are using will
+be referred to as GPU.
+Testing Theano with GPU
+-----------------------
+To see if your GPU is being used, cut and paste the following program
+into a file and run it.
+.. code-block:: python
+  from theano import function, config, shared, tensor, sandbox
+  import numpy
+  import time
+  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+  iters = 1000
+  rng = numpy.random.RandomState(22)
+  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+  f = function([], tensor.exp(x))
+  print f.maker.fgraph.toposort()
+  t0 = time.time()
+  for i in xrange(iters):
+      r = f()
+  t1 = time.time()
+  print 'Looping %d times took' % iters, t1 - t0, 'seconds'
+  print 'Result is', r
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
+      print 'Used the cpu'
+  else:
+      print 'Used the gpu'
+The program just compute ``exp()`` of a bunch of random numbers.  Note
+that we use the :func:`theano.shared` function to make sure that the
+input *x* is stored on the GPU.
+.. code-block:: text
+  $ THEANO_FLAGS=device=cpu python check1.py
+  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
+  Looping 1000 times took 2.6071999073 seconds
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
+    1.62323285]
+  Used the cpu
+  $ THEANO_FLAGS=device=cuda0 python check1.py
+  Using device cuda0: GeForce GTX 275
+  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
+  Looping 1000 times took 2.28562092781 seconds
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
+    1.62323285]
+  Used the gpu
+Returning a Handle to Device-Allocated Data
+-------------------------------------------
+By default functions that execute on the GPU still return a standard
+numpy ndarray.  A transfer operation is inserted just before the
+results are returned to ensure a consistent interface with CPU code.
+This allows changing the deivce some code runs on by only replacing
+the value of the ``device`` flag without touching the code.
+If you don't mind a loss of flexibility, you can ask theano to return
+the GPU object directly.  The following code is modifed to do just that.
+.. code-block:: python
+  :emphasize-lines: 10,17
+  from theano import function, config, shared, tensor, sandbox
+  import numpy
+  import time
+  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+  iters = 1000
+  rng = numpy.random.RandomState(22)
+  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+  f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
+  print f.maker.fgraph.toposort()
+  t0 = time.time()
+  for i in xrange(iters):
+      r = f()
+  t1 = time.time()
+  print 'Looping %d times took' % iters, t1 - t0, 'seconds'
+  print 'Result is', numpy.asarray(r)
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
+      print 'Used the cpu'
+  else:
+      print 'Used the gpu'
+Here the :func:`theano.sandbox.gpuarray.basic.gpu_from_host` call
+means "copy input to the GPU".  However during the optimization phase,
+since the result will already be on th gpu, it will be removed.  It is
+used here to tell theano that we want the result on the GPU.
+The output is
+.. code-block:: text
+  $ THEANO_FLAGS=device=cuda0 python check2.py
+  Using device cuda0: GeForce GTX 275
+  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
+  Looping 1000 times took 0.455810785294 seconds
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
+    1.62323285]
+  Used the gpu
+While the time per call appears to be much lower than the two previous
+invocations (and should indeed be lower, since we avoid a transfer)
+the massive speedup we obtained is in part due to asynchronous nature
+of execution on GPUs, meaning that the work isn't completed yet, just
+'launched'.  We'll talk about that later.
+The object returned is a GpuArray from pygpu.  It mostly acts as a
+numpy ndarray with some exceptions due to its data being on the GPU.
+You can copy it to the host and convert it to a regular ndarray by
+using usual numpy casting.
+Running the GPU at Full Speed
+-----------------------------
+Theano, in the interest of safety, usually returns a copy of the
+internal compute memory from its functions.  If it didn't do that
+there are instance where calling the same function again would
+overwrite the returned results which could cause quite a few debugging
+headaches.
+If you are really sure that it is safe for your program, you can ask
+theano to return the internal buffer.
+.. code-block:: python
+  :emphasize-lines: 10-11
+  from theano import function, config, shared, tensor, sandbox
+  import numpy
+  import time
+  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+  iters = 1000
+  rng = numpy.random.RandomState(22)
+  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+  f = function([], Out(sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)),
+                       borrow=True))
+  print f.maker.fgraph.toposort()
+  t0 = time.time()
+  for i in xrange(iters):
+      r = f()
+  t1 = time.time()
+  print 'Looping %d times took' % iters, t1 - t0, 'seconds'
+  print 'Result is', numpy.asarray(r)
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
+      print 'Used the cpu'
+  else:
+      print 'Used the gpu'
+Running this version produces the following output
+.. code-block:: text
+  Using device cuda0: GeForce GTX 275
+  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
+  Looping 1000 times took 0.0259871482849 seconds
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
+    1.62323285]
+  Used the gpu
+It is again much faster, but the same explanation about asynchronous
+execution applies.
+.. note::
+   The advantages that ``borrow=True`` confers tend to diminish as the
+   graph gets bigger.  It also has the notable disadvantage of
+   introducing more potential for bugs.  In order to avoid output
+   copies, it is recommended to investigate :ref:`shared variable updates
+   <functionstateexample>` instead.
+What Can be Accelerated on the GPU
+----------------------------------
+The performance characteristics will of course vary from device to
+device, and also as we refine our implementation.
+This backend supports all regular theano data types (float32, float64,
+int, ...) however GPU support varies and some units can't deal with
+double (float64) or small (less than 32 bits like int16) data types.
+You will get an error at compile time or runtime if this is the case.
+Complex support is untested and most likely completely broken.
+In general, large operations like matrix multiplication, or
+element-wise operations with large inputs, will be significatly
+faster.
+GPU Async Capabilities
+----------------------
+By default, all operations on the GPU are run asynchronously.  This
+means that they are only scheduled to run and the function returns.
+This is made somewhat transparently by the underlying libgpuarray.
+A forced synchronization point is introduced when doing memory
+transfers between device and host. Another is introduced when
+releasing active memory buffers on the GPU (active buffers are buffers
+that are still in use by a kernel).
+It is possible to force synchronization for a particular GpuArray by
+calling its ``sync()`` method.  This is useful to get accurate timings
+when doing benchmarks.
+The forced synchronization points interact with the garbage collection
+of the intermediate results.  To get the fastest speed possible, you
+should disable the garbage collector by using the theano flag
+``allow_gc=False``.  Be aware that this will increase memory usage
+sometimes significantly.
 -------------------------------------------
@@ -620,7 +864,6 @@ Use this code to test it:
 Exercise
 ========
 Run the preceding example.
 Modify and execute to multiply two matrices: *x* * *y*.