提交 3e3ba8f8 authored 作者: slefrancois's avatar slefrancois

Correct using_gpu tutorial

上级 21ae3bd0
...@@ -19,6 +19,12 @@ There are two ways currently to use a gpu, one that should support any OpenCL ...@@ -19,6 +19,12 @@ There are two ways currently to use a gpu, one that should support any OpenCL
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
only supports NVIDIA cards (:ref:`cuda`). only supports NVIDIA cards (:ref:`cuda`).
.. warning::
If you want to use the new GpuArray backend, make sure to have the
development version of Theano installed. The 0.8.X releases have not
been optimized to work correctly with the new backend.
.. _gpuarray: .. _gpuarray:
GpuArray Backend GpuArray Backend
...@@ -73,7 +79,7 @@ Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag ...@@ -73,7 +79,7 @@ Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
else: else:
print('Used the gpu') print('Used the gpu')
The program just compute ``exp()`` of a bunch of random numbers. Note The program just computes ``exp()`` of a bunch of random numbers. Note
that we use the :func:`theano.shared` function to make sure that the that we use the :func:`theano.shared` function to make sure that the
input *x* is stored on the GPU. input *x* is stored on the GPU.
...@@ -88,21 +94,22 @@ input *x* is stored on the GPU. ...@@ -88,21 +94,22 @@ input *x* is stored on the GPU.
.. code-block:: none .. code-block:: none
$ THEANO_FLAGS=device=cpu python check1.py $ THEANO_FLAGS=device=cpu python gpu_tutorial1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)] [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds Looping 1000 times took 2.271284 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285] 1.62323285]
Used the cpu Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py $ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
Using device cuda0: GeForce GTX 275 Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)] [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds Looping 1000 times took 1.202734 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285] 1.62323285]
Used the gpu Used the gpu
Returning a Handle to Device-Allocated Data Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -126,8 +133,7 @@ the GPU object directly. The following code is modified to do just that. ...@@ -126,8 +133,7 @@ the GPU object directly. The following code is modified to do just that.
rng = numpy.random.RandomState(22) rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
gx = x.transfer(None) # Transfer variable to GPU f = function([], tensor.exp(x).transfer('dev0'))
f = function([], tensor.exp(gx))
print(f.maker.fgraph.toposort()) print(f.maker.fgraph.toposort())
t0 = time.time() t0 = time.time()
for i in range(iters): for i in range(iters):
...@@ -142,9 +148,9 @@ the GPU object directly. The following code is modified to do just that. ...@@ -142,9 +148,9 @@ the GPU object directly. The following code is modified to do just that.
else: else:
print('Used the gpu') print('Used the gpu')
Here ``gx = x.transfer(None)`` means "copy variable x to the GPU", with Here ``tensor.exp(x).transfer('None')`` means "copy ``exp(x)`` to the GPU",
``None`` the default GPU context when not explicitly given. For information with ``None`` the default GPU context when not explicitly given.
on how to set GPU contexts, see :ref:`tut_using_multi_gpu`. For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
The output is The output is
...@@ -160,15 +166,14 @@ The output is ...@@ -160,15 +166,14 @@ The output is
.. code-block:: none .. code-block:: none
$ THEANO_FLAGS=device=cuda0 python check2.py $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
Using device cuda0: GeForce GTX 275 Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)] [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.455810785294 seconds Looping 1000 times took 0.089194 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285] 1.62323285]
Used the gpu Used the gpu
While the time per call appears to be much lower than the two previous While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer) invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature the massive speedup we obtained is in part due to asynchronous nature
...@@ -217,33 +222,24 @@ Tips for Improving Performance on GPU ...@@ -217,33 +222,24 @@ Tips for Improving Performance on GPU
* Consider adding ``floatX=float32`` (or the type you are using) to your * Consider adding ``floatX=float32`` (or the type you are using) to your
``.theanorc`` file if you plan to do a lot of GPU work. ``.theanorc`` file if you plan to do a lot of GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async` * The GPU backend supports *float64* variables, but they are still slower
to compute than *float32*. The more *float32*, the better GPU performance
you will get.
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to * Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
``dmatrix``, ``dvector`` and ``dscalar`` because the former will give ``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
you *float32* variables when ``floatX=float32``. you *float32* variables and ignore the type given to ``floatX``.
* Ensure that your output variables have a *float32* dtype and not *float64*. * Minimize transfers to the GPU device by using ``shared`` variables
The more *float32* variables are in your graph, the more work the GPU can do for
you.
* Minimize transfers to the GPU device by using ``shared`` *float32* variables
to store frequently-accessed data (see :func:`shared()<shared.shared>`). to store frequently-accessed data (see :func:`shared()<shared.shared>`).
When using the GPU, *float32* tensor ``shared`` variables are stored on When using the GPU, tensor ``shared`` variables are stored on
the GPU by default to eliminate transfer time for GPU ops using those variables. the GPU by default to eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try running your script with * If you aren't happy with the performance you see, try running your script with
``profile=True`` flag. This should print some timing information at program ``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.sandbox.cuda. programming, have a look at how it's implemented in theano.gpuarray.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
Xs(X%) in transfer op*. This can tell you if not enough of your graph is Xs(X%) in transfer op*. This can tell you if not enough of your graph is
on the GPU or if there is too much memory transfer. on the GPU or if there is too much memory transfer.
* Use nvcc options. nvcc supports those options to speed up some computations:
`-ftz=true` to `flush denormals values to zeros.
<https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
`--prec-div=false` and `--prec-sqrt=false` options to speed up
division and square root operation by being less precise. You can
enable all of them with the `nvcc.flags=--use_fast_math` Theano
flag or you can enable them individually as in this example:
`nvcc.flags=-ftz=true --prec-div=false`.
* To investigate whether all the Ops in the computational graph are * To investigate whether all the Ops in the computational graph are
running on GPU, it is possible to debug or check your code by providing running on GPU, it is possible to debug or check your code by providing
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论