@@ -15,28 +15,41 @@ about how to carry out those computations. One of the ways we take
...
@@ -15,28 +15,41 @@ about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on a
advantage of this flexibility is in carrying out calculations on a
graphics card.
graphics card.
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
There are two ways currently to use a gpu, on that should support any OpenCL
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend which
only supports NVIDIA cards (:ref:`cuda`).
.. _cuda:
.. _gpuarray:
CUDA backend
GpuArray Backend
------------
----------------
If you have not done so already, you will need to install Nvidia's
If you have not done so already, you will need to install libgpuarray
GPU-programming toolchain (CUDA) and configure Theano to use it.
as well as at least one computing toolkit. Instructions for doing so
We provide installation instructions for :ref:`Linux <gpu_linux>`,
are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
.. warning::
The backend was designed to support OpenCL, however current support is
incomplete. A lot of very useful ops still do not support it because they
were ported from the old backend with minimal change.
Testing Theano with GPU
Testing Theano with GPU
~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program into a
To see if your GPU is being used, cut and paste the following program
file and run it.
into a file and run it.
Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
``device=cuda{0,1,...}`` to specify which GPU to use.
.. testcode::
.. testcode::
from theano import function, config, shared, sandbox
from theano import function, config, shared, tensor
import theano.tensor as T
import numpy
import numpy
import time
import time
...
@@ -45,7 +58,7 @@ file and run it.
...
@@ -45,7 +58,7 @@ file and run it.
rng = numpy.random.RandomState(22)
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
print(f.maker.fgraph.toposort())
t0 = time.time()
t0 = time.time()
for i in range(iters):
for i in range(iters):
...
@@ -53,20 +66,16 @@ file and run it.
...
@@ -53,20 +66,16 @@ file and run it.
t1 = time.time()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
print('Used the cpu')
else:
else:
print('Used the gpu')
print('Used the gpu')
The program just computes the ``exp()`` of a bunch of random numbers.
The program just compute ``exp()`` of a bunch of random numbers. Note
Note that we use the ``shared`` function to
that we use the :func:`theano.shared` function to make sure that the
make sure that the input *x* is stored on the graphics device.
input *x* is stored on the GPU.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. testoutput::
.. testoutput::
:hide:
:hide:
...
@@ -79,40 +88,36 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump
...
@@ -79,40 +88,36 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump