提交 3c9d6446 authored 作者: Arnaud Bergeron's avatar Arnaud Bergeron

Add a section about GpuArray in the Using GPU section of the tutorial.

上级 5b1c0b3b
...@@ -69,12 +69,17 @@ The following libraries and software are optional: ...@@ -69,12 +69,17 @@ The following libraries and software are optional:
To be able to make picture of Theano computation graph. To be able to make picture of Theano computation graph.
`NVIDIA CUDA drivers and SDK`_ `NVIDIA CUDA drivers and SDK`_
Required for GPU code generation/execution. Only NVIDIA GPUs using Required for GPU code generation/execution on NVIDIA gpus
32-bit floating point numbers are currently supported.
`libgpuarray`_
Required for GPU/CPU code generation on CUDA and OpenCL devices
:note: OpenCL support is still minimal for now.
.. _LaTeX: http://www.latex-project.org/ .. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/ .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html .. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
Linux Linux
----- -----
......
...@@ -12,12 +12,15 @@ and their use for intensive parallel computation purposes, see `GPGPU ...@@ -12,12 +12,15 @@ and their use for intensive parallel computation purposes, see `GPGPU
One of Theano's design goals is to specify computations at an abstract One of Theano's design goals is to specify computations at an abstract
level, so that the internal function compiler has a lot of flexibility level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on an advantage of this flexibility is in carrying out calculations on a
Nvidia graphics card when the device present in the computer is graphics card.
CUDA-enabled.
Setting Up CUDA There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
----------------
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it. GPU-programming toolchain (CUDA) and configure Theano to use it.
...@@ -420,6 +423,247 @@ What can be done to further increase the speed of the GPU version? Put your idea ...@@ -420,6 +423,247 @@ What can be done to further increase the speed of the GPU version? Put your idea
:download:`Solution<using_gpu_solution_1.py>` :download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
.. _gpuarray:
GpuArray Backend
----------------
If you have not done so already, you will need to install libgpuarray
as well as at least one computing toolkit. Instructions for doing so
are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
Testing Theano with GPU
-----------------------
To see if your GPU is being used, cut and paste the following program
into a file and run it.
.. code-block:: python
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
The program just compute ``exp()`` of a bunch of random numbers. Note
that we use the :func:`theano.shared` function to make sure that the
input *x* is stored on the GPU.
.. code-block:: text
$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
Returning a Handle to Device-Allocated Data
-------------------------------------------
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the deivce some code runs on by only replacing
the value of the ``device`` flag without touching the code.
If you don't mind a loss of flexibility, you can ask theano to return
the GPU object directly. The following code is modifed to do just that.
.. code-block:: python
:emphasize-lines: 10,17
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
Here the :func:`theano.sandbox.gpuarray.basic.gpu_from_host` call
means "copy input to the GPU". However during the optimization phase,
since the result will already be on th gpu, it will be removed. It is
used here to tell theano that we want the result on the GPU.
The output is
.. code-block:: text
$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
of execution on GPUs, meaning that the work isn't completed yet, just
'launched'. We'll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting.
Running the GPU at Full Speed
-----------------------------
Theano, in the interest of safety, usually returns a copy of the
internal compute memory from its functions. If it didn't do that
there are instance where calling the same function again would
overwrite the returned results which could cause quite a few debugging
headaches.
If you are really sure that it is safe for your program, you can ask
theano to return the internal buffer.
.. code-block:: python
:emphasize-lines: 10-11
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], Out(sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)),
borrow=True))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
Running this version produces the following output
.. code-block:: text
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.0259871482849 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
It is again much faster, but the same explanation about asynchronous
execution applies.
.. note::
The advantages that ``borrow=True`` confers tend to diminish as the
graph gets bigger. It also has the notable disadvantage of
introducing more potential for bugs. In order to avoid output
copies, it is recommended to investigate :ref:`shared variable updates
<functionstateexample>` instead.
What Can be Accelerated on the GPU
----------------------------------
The performance characteristics will of course vary from device to
device, and also as we refine our implementation.
This backend supports all regular theano data types (float32, float64,
int, ...) however GPU support varies and some units can't deal with
double (float64) or small (less than 32 bits like int16) data types.
You will get an error at compile time or runtime if this is the case.
Complex support is untested and most likely completely broken.
In general, large operations like matrix multiplication, or
element-wise operations with large inputs, will be significatly
faster.
GPU Async Capabilities
----------------------
By default, all operations on the GPU are run asynchronously. This
means that they are only scheduled to run and the function returns.
This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory
transfers between device and host. Another is introduced when
releasing active memory buffers on the GPU (active buffers are buffers
that are still in use by a kernel).
It is possible to force synchronization for a particular GpuArray by
calling its ``sync()`` method. This is useful to get accurate timings
when doing benchmarks.
The forced synchronization points interact with the garbage collection
of the intermediate results. To get the fastest speed possible, you
should disable the garbage collector by using the theano flag
``allow_gc=False``. Be aware that this will increase memory usage
sometimes significantly.
------------------------------------------- -------------------------------------------
...@@ -620,7 +864,6 @@ Use this code to test it: ...@@ -620,7 +864,6 @@ Use this code to test it:
Exercise Exercise
======== ========
Run the preceding example. Run the preceding example.
Modify and execute to multiply two matrices: *x* * *y*. Modify and execute to multiply two matrices: *x* * *y*.
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论