For an introductory discussion of *Graphical Processing Units* (GPU) and their use for
For an introductory discussion of *Graphical Processing Units* (GPU)
intensive parallel computation purposes, see `GPGPU <http://en.wikipedia.org/wiki/GPGPU>`_.
and their use for intensive parallel computation purposes, see `GPGPU
<http://en.wikipedia.org/wiki/GPGPU>`_.
One of Theano's design goals is to specify computations at an
One of Theano's design goals is to specify computations at an abstract
abstract level, so that the internal function compiler has a lot of flexibility
level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take advantage of
about how to carry out those computations. One of the ways we take
this flexibility is in carrying out calculations on an Nvidia graphics card when
advantage of this flexibility is in carrying out calculations on a
the device present in the computer is CUDA-enabled.
graphics card.
Setting Up CUDA
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
----------------
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's
If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
GPU-programming toolchain (CUDA) and configure Theano to use it.
...
@@ -23,7 +28,7 @@ We provide installation instructions for :ref:`Linux <gpu_linux>`,
...
@@ -23,7 +28,7 @@ We provide installation instructions for :ref:`Linux <gpu_linux>`,
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
Testing Theano with GPU
Testing Theano with GPU
-----------------------
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program into a
To see if your GPU is being used, cut and paste the following program into a
file and run it.
file and run it.
...
@@ -60,10 +65,10 @@ The program just computes the ``exp()`` of a bunch of random numbers.
...
@@ -60,10 +65,10 @@ The program just computes the ``exp()`` of a bunch of random numbers.
Note that we use the ``shared`` function to
Note that we use the ``shared`` function to
make sure that the input *x* is stored on the graphics device.
make sure that the input *x* is stored on the graphics device.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. code-block:: text
.. code-block:: text
...
@@ -87,7 +92,7 @@ Note that GPU operations in Theano require for now ``floatX`` to be *float32* (s
...
@@ -87,7 +92,7 @@ Note that GPU operations in Theano require for now ``floatX`` to be *float32* (s
Returning a Handle to Device-Allocated Data
Returning a Handle to Device-Allocated Data
-------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The speedup is not greater in the preceding example because the function is
The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
returning its result as a NumPy ndarray which has already been copied from the
...
@@ -139,144 +144,61 @@ The output from this program is
...
@@ -139,144 +144,61 @@ The output from this program is
1.62323296]
1.62323296]
Used the gpu
Used the gpu
Here we've shaved off about 50% of the run-time by simply not copying the
Here we've shaved off about 50% of the run-time by simply not copying
resulting array back to the host.
the resulting array back to the host. The object returned by each
The object returned by each function call is now not a NumPy array but a
function call is now not a NumPy array but a "CudaNdarray" which can
"CudaNdarray" which can be converted to a NumPy ndarray by the normal
be converted to a NumPy ndarray by the normal NumPy casting mechanism
NumPy casting mechanism.
using something like ``numpy.asarray()``.
Running the GPU at Full Speed
------------------------------
To really get maximum performance in this simple example, we need to use an
:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
the output it returns to us. This is because Theano pre-allocates memory for internal use
(like working buffers), and by default will never return a result that is aliased to one of
its internal buffers: instead, it will copy the buffers associated to outputs into newly
allocated memory at each function call. This is to ensure that subsequent function calls will
not overwrite previously computed outputs. Although this is normally what you want, our last
example was so simple that it had the unwanted side-effect of really slowing things down.
..
TODO:
The story here about copying and working buffers is misleading and potentially not correct
... why exactly does borrow=True cut 75% of the runtime ???
.. TODO: Answer by Olivier D: it sounds correct to me -- memory allocations must be slow.