Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
3e3ba8f8
提交
3e3ba8f8
authored
5月 16, 2016
作者:
slefrancois
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Correct using_gpu tutorial
上级
21ae3bd0
显示空白字符变更
内嵌
并排
正在显示
1 个修改的文件
包含
29 行增加
和
33 行删除
+29
-33
using_gpu.txt
doc/tutorial/using_gpu.txt
+29
-33
没有找到文件。
doc/tutorial/using_gpu.txt
浏览文件 @
3e3ba8f8
...
@@ -19,6 +19,12 @@ There are two ways currently to use a gpu, one that should support any OpenCL
...
@@ -19,6 +19,12 @@ There are two ways currently to use a gpu, one that should support any OpenCL
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
only supports NVIDIA cards (:ref:`cuda`).
only supports NVIDIA cards (:ref:`cuda`).
.. warning::
If you want to use the new GpuArray backend, make sure to have the
development version of Theano installed. The 0.8.X releases have not
been optimized to work correctly with the new backend.
.. _gpuarray:
.. _gpuarray:
GpuArray Backend
GpuArray Backend
...
@@ -73,7 +79,7 @@ Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
...
@@ -73,7 +79,7 @@ Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
else:
else:
print('Used the gpu')
print('Used the gpu')
The program just compute ``exp()`` of a bunch of random numbers. Note
The program just compute
s
``exp()`` of a bunch of random numbers. Note
that we use the :func:`theano.shared` function to make sure that the
that we use the :func:`theano.shared` function to make sure that the
input *x* is stored on the GPU.
input *x* is stored on the GPU.
...
@@ -88,21 +94,22 @@ input *x* is stored on the GPU.
...
@@ -88,21 +94,22 @@ input *x* is stored on the GPU.
.. code-block:: none
.. code-block:: none
$ THEANO_FLAGS=device=cpu python
check
1.py
$ THEANO_FLAGS=device=cpu python
gpu_tutorial
1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.
6071999073
seconds
Looping 1000 times took 2.
271284
seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
1.62323285]
Used the cpu
Used the cpu
$ THEANO_FLAGS=device=cuda0 python
check
1.py
$ THEANO_FLAGS=device=cuda0 python
gpu_tutorial
1.py
Using device cuda0: GeForce GTX 275
Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
[GpuElemwise{exp,no_inplace}(<GpuArray
<float64>
>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
[GpuElemwise{exp,no_inplace}(<GpuArray
Type<None>(float64, (False,))
>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took
2.28562092781
seconds
Looping 1000 times took
1.202734
seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
1.62323285]
Used the gpu
Used the gpu
Returning a Handle to Device-Allocated Data
Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
@@ -126,8 +133,7 @@ the GPU object directly. The following code is modified to do just that.
...
@@ -126,8 +133,7 @@ the GPU object directly. The following code is modified to do just that.
rng = numpy.random.RandomState(22)
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
gx = x.transfer(None) # Transfer variable to GPU
f = function([], tensor.exp(x).transfer('dev0'))
f = function([], tensor.exp(gx))
print(f.maker.fgraph.toposort())
print(f.maker.fgraph.toposort())
t0 = time.time()
t0 = time.time()
for i in range(iters):
for i in range(iters):
...
@@ -142,9 +148,9 @@ the GPU object directly. The following code is modified to do just that.
...
@@ -142,9 +148,9 @@ the GPU object directly. The following code is modified to do just that.
else:
else:
print('Used the gpu')
print('Used the gpu')
Here ``
gx = x.transfer(None)`` means "copy variable x to the GPU", with
Here ``
tensor.exp(x).transfer('None')`` means "copy ``exp(x)`` to the GPU",
``None`` the default GPU context when not explicitly given. For information
with ``None`` the default GPU context when not explicitly given.
on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
For information
on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
The output is
The output is
...
@@ -160,15 +166,14 @@ The output is
...
@@ -160,15 +166,14 @@ The output is
.. code-block:: none
.. code-block:: none
$ THEANO_FLAGS=device=cuda0 python
check
2.py
$ THEANO_FLAGS=device=cuda0 python
gpu_tutorial
2.py
Using device cuda0: GeForce GTX 275
Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
[GpuElemwise{exp,no_inplace}(<GpuArray
<float64>
>)]
[GpuElemwise{exp,no_inplace}(<GpuArray
Type<None>(float64, (False,))
>)]
Looping 1000 times took 0.
4558107852
94 seconds
Looping 1000 times took 0.
0891
94 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
1.62323285]
Used the gpu
Used the gpu
While the time per call appears to be much lower than the two previous
While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
the massive speedup we obtained is in part due to asynchronous nature
...
@@ -217,33 +222,24 @@ Tips for Improving Performance on GPU
...
@@ -217,33 +222,24 @@ Tips for Improving Performance on GPU
* Consider adding ``floatX=float32`` (or the type you are using) to your
* Consider adding ``floatX=float32`` (or the type you are using) to your
``.theanorc`` file if you plan to do a lot of GPU work.
``.theanorc`` file if you plan to do a lot of GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* The GPU backend supports *float64* variables, but they are still slower
to compute than *float32*. The more *float32*, the better GPU performance
you will get.
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
you *float32* variables when ``floatX=float32``.
you *float32* variables and ignore the type given to ``floatX``.
* Ensure that your output variables have a *float32* dtype and not *float64*.
* Minimize transfers to the GPU device by using ``shared`` variables
The more *float32* variables are in your graph, the more work the GPU can do for
you.
* Minimize transfers to the GPU device by using ``shared`` *float32* variables
to store frequently-accessed data (see :func:`shared()<shared.shared>`).
to store frequently-accessed data (see :func:`shared()<shared.shared>`).
When using the GPU,
*float32*
tensor ``shared`` variables are stored on
When using the GPU, tensor ``shared`` variables are stored on
the GPU by default to eliminate transfer time for GPU ops using those variables.
the GPU by default to eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try running your script with
* If you aren't happy with the performance you see, try running your script with
``profile=True`` flag. This should print some timing information at program
``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is
termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU
taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.
sandbox.cuda
.
programming, have a look at how it's implemented in theano.
gpuarray
.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
Xs(X%) in transfer op*. This can tell you if not enough of your graph is
Xs(X%) in transfer op*. This can tell you if not enough of your graph is
on the GPU or if there is too much memory transfer.
on the GPU or if there is too much memory transfer.
* Use nvcc options. nvcc supports those options to speed up some computations:
`-ftz=true` to `flush denormals values to zeros.
<https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
`--prec-div=false` and `--prec-sqrt=false` options to speed up
division and square root operation by being less precise. You can
enable all of them with the `nvcc.flags=--use_fast_math` Theano
flag or you can enable them individually as in this example:
`nvcc.flags=-ftz=true --prec-div=false`.
* To investigate whether all the Ops in the computational graph are
* To investigate whether all the Ops in the computational graph are
running on GPU, it is possible to debug or check your code by providing
running on GPU, it is possible to debug or check your code by providing
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
...
...
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论