@@ -12,12 +12,15 @@ and their use for intensive parallel computation purposes, see `GPGPU
...
@@ -12,12 +12,15 @@ and their use for intensive parallel computation purposes, see `GPGPU
One of Theano's design goals is to specify computations at an abstract
One of Theano's design goals is to specify computations at an abstract
level, so that the internal function compiler has a lot of flexibility
level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take
about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on an
advantage of this flexibility is in carrying out calculations on a
Nvidia graphics card when the device present in the computer is
graphics card.
CUDA-enabled.
Setting Up CUDA
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
----------------
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's
If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
GPU-programming toolchain (CUDA) and configure Theano to use it.
...
@@ -62,10 +65,10 @@ The program just computes the ``exp()`` of a bunch of random numbers.
...
@@ -62,10 +65,10 @@ The program just computes the ``exp()`` of a bunch of random numbers.
Note that we use the ``shared`` function to
Note that we use the ``shared`` function to
make sure that the input *x* is stored on the graphics device.
make sure that the input *x* is stored on the graphics device.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. code-block:: text
.. code-block:: text
...
@@ -153,15 +156,15 @@ Running the GPU at Full Speed
...
@@ -153,15 +156,15 @@ Running the GPU at Full Speed
To really get maximum performance in this simple example, we need to use an
To really get maximum performance in this simple example, we need to use an
:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
the output it returns to us. This is because Theano pre-allocates memory for internal use
the output it returns to us. This is because Theano pre-allocates memory for internal use
(like working buffers), and by default will never return a result that is aliased to one of
(like working buffers), and by default will never return a result that is aliased to one of
its internal buffers: instead, it will copy the buffers associated to outputs into newly
its internal buffers: instead, it will copy the buffers associated to outputs into newly
allocated memory at each function call. This is to ensure that subsequent function calls will
allocated memory at each function call. This is to ensure that subsequent function calls will
not overwrite previously computed outputs. Although this is normally what you want, our last
not overwrite previously computed outputs. Although this is normally what you want, our last
example was so simple that it had the unwanted side-effect of really slowing things down.
example was so simple that it had the unwanted side-effect of really slowing things down.
..
..
TODO:
TODO:
The story here about copying and working buffers is misleading and potentially not correct
The story here about copying and working buffers is misleading and potentially not correct
... why exactly does borrow=True cut 75% of the runtime ???
... why exactly does borrow=True cut 75% of the runtime ???
...
@@ -182,7 +185,7 @@ example was so simple that it had the unwanted side-effect of really slowing thi
...
@@ -182,7 +185,7 @@ example was so simple that it had the unwanted side-effect of really slowing thi
rng = numpy.random.RandomState(22)
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))