Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
30617ff5
提交
30617ff5
authored
10月 23, 2015
作者:
Arnaud Bergeron
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Rest of libdoc for gpuarray.
上级
3bf6f4cb
隐藏空白字符变更
内嵌
并排
正在显示
5 个修改的文件
包含
240 行增加
和
50 行删除
+240
-50
dnn.txt
doc/library/sandbox/gpuarray/dnn.txt
+161
-0
extra.txt
doc/library/sandbox/gpuarray/extra.txt
+17
-0
index.txt
doc/library/sandbox/gpuarray/index.txt
+2
-0
kernel_codegen.py
theano/sandbox/gpuarray/kernel_codegen.py
+35
-28
opt_util.py
theano/sandbox/gpuarray/opt_util.py
+25
-22
没有找到文件。
doc/library/sandbox/gpuarray/dnn.txt
0 → 100644
浏览文件 @
30617ff5
.. _libdoc_gpuarray_dnn:
===========================================
:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
===========================================
.. moduleauthor:: LISA
`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
with functionality used by deep neural networks. It provides optimized
versions of some operations like the convolution. cuDNN is not
currently installed with CUDA. You must download and install it
yourself.
To install it, decompress the downloaded file and make the ``*.h`` and
``*.so*`` files available to the compilation environment.
There are at least three possible ways of doing so:
- The easiest is to include them in your CUDA installation. Copy the
``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
on Linux).
- Alternatively, on Linux, you can set the environment variables
``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
extracted from the download. If needed, separate multiple directories
with ``:`` as in the ``PATH`` environment variable.
example::
export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
- And as a third way, also on Linux, you can copy the ``*.h`` files
to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
By default, Theano will detect if it can use cuDNN. If so, it will use
it. If not, Theano optimizations will not introduce cuDNN ops. So
Theano will still work if the user did not introduce them manually.
To get an error if Theano can not use cuDNN, use this Theano flag:
``optimizer_including=cudnn``.
.. note::
CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
faster and offers many more options. We recommend that everybody update to
v3.
.. note::
Starting in CuDNN v3, multiple convolution implementations are offered and
it is possible to use heuristics to automatically choose a convolution
implementation well suited to the parameters of the convolution.
The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
convolution implementation that Theano should use for forward convolutions.
Possible values include :
* ``small`` (default) : use a convolution implementation with small memory
usage
* ``none`` : use a slower implementation with minimal memory usage
* ``large`` : use a sometimes faster implementation with large memory usage
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
convolution implementation that Theano should use for gradient convolutions.
Possible values include :
* ``none`` (default) : use the default non-deterministic convolution
implementation
* ``deterministic`` : use a slower but deterministic implementation
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
``guess_*`` and ``time_*`` flag values take into account the amount of
available memory when selecting an implementation. This means that slower
implementations might be selected if not enough memory is available for the
faster implementations.
.. note::
Normally you should not call GPU Ops directly, but the CPU interface
currently does not allow all options supported by cuDNN ops. So it is
possible that you will need to call them manually.
.. note::
The documentation of CUDNN tells that, for the 2 following operations, the
reproducibility is not guaranteed with the default implementation:
`cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
Those correspond to the gradient wrt the weights and the gradient wrt the
input of the convolution. They are also used sometimes in the forward
pass, when they give a speed up.
The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
slower but deterministic convolution implementation.
.. note::
There is a problem we do not understand yet when cudnn paths are
used with symbolic links. So avoid using that.
.. note::
cudnn.so* must be readable and executable by everybody.
cudnn.h must be readable by everybody.
Functions
=========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: dnn_conv, dnn_pool
Convolution Ops
===============
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
Pooling Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
Softmax Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
doc/library/sandbox/gpuarray/extra.txt
0 → 100644
浏览文件 @
30617ff5
.. _libdoc_gpuarray_extra:
=================
Utility functions
=================
Optimisation
------------
.. automodule:: theano.sandbox.gpuarray.opt_util
:members:
Kernel generation
-----------------
.. automodule:: theano.sandbox.gpuarray.kernel_codegen
:members:
doc/library/sandbox/gpuarray/index.txt
浏览文件 @
30617ff5
...
@@ -14,4 +14,6 @@
...
@@ -14,4 +14,6 @@
:maxdepth: 1
:maxdepth: 1
op
op
dnn
type
type
extra
theano/sandbox/gpuarray/kernel_codegen.py
浏览文件 @
30617ff5
...
@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
...
@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
count
count
Number of executing threads.
Number of executing threads.
manner_fn
manner_fn
A function that accepts strings of arguments a and b, and returns c code
A function that accepts strings of arguments a and b, and
for their reduction.
returns c code for their reduction.
Example: return "
%(a)
s +
%(b)
s" for a sum reduction.
:postcondition:
return "
%(a)
s +
%(b)
s"
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this fun
ction.
for a sum redu
ction.
Notes
Notes
-----
-----
buf should be in gpu shared memory, we access it many times.
`buf` should be in gpu shared memory, we access it many times.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
"""
"""
loop_line
=
manner_fn
(
"
%
s[
%
s]"
%
(
buf
,
pos
),
"
%
s[i]"
%
(
buf
))
loop_line
=
manner_fn
(
"
%
s[
%
s]"
%
(
buf
,
pos
),
"
%
s[i]"
%
(
buf
))
...
@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
...
@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
inline_reduce_sum
.
code_version
)
inline_reduce_sum
.
code_version
)
def
inline_softmax
(
N
,
buf
,
buf2
,
threadPos
,
threadCount
,
dtype
=
"float32"
):
def
inline_softmax
(
N
,
buf
,
buf2
,
threadPos
,
threadCount
,
dtype
=
"float32"
):
"""
"""
Generate code for a softmax.
On entry, `buf` and `buf2` must contain two identical copies of
the input to softmax.
After the code returns `buf` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters
Parameters
----------
----------
...
@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
...
@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
dtype
dtype
Dtype of the softmax's output.
Dtype of the softmax's output.
:Precondition: buf and buf2 contain two identical copies of the input
to softmax
:Postcondition: buf contains the softmax, buf2 contains un-normalized
softmax
Notes
Notes
-----
-----
buf and buf2 should be in gpu shared memory, we access it many times.
`buf` and `buf2` should be in gpu shared memory, we access it many
times.
We use __i as an int variable in a loop.
We use __i as an int variable in a loop.
...
@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
...
@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
"""
"""
Return C++ code for a function that reduces a contiguous buffer.
Return C++ code for a function that reduces a contiguous buffer.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
Parameters
Parameters
----------
----------
N
N
...
@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
...
@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
dtype
dtype
Optional, the dtype of the output.
Optional, the dtype of the output.
manner_fn
manner_fn
A function that accepts strings of arguments a and b, and returns c code
A function that accepts strings of arguments a and b, and
for their reduction.
returns c code for their reduction.
Example: return "
%(a)
s +
%(b)
s" for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c code for its
initialization.
:postcondition:
return "
%(a)
s +
%(b)
s"
This function leaves the answer in position 0 of the buffer. The rest of the
buffer is trashed by this function.
for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c
code for its initialization.
Notes
Notes
-----
-----
buf
should be in gpu shared memory, we access it many times.
`buf`
should be in gpu shared memory, we access it many times.
"""
"""
if
b
:
if
b
:
...
@@ -320,6 +327,10 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
...
@@ -320,6 +327,10 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
b
=
''
,
stride_b
=
''
,
load_b
=
''
,
b
=
''
,
stride_b
=
''
,
load_b
=
''
,
dtype
=
"float32"
):
dtype
=
"float32"
):
"""
"""
Generate code to perform softmax with a fixed amount of shared
memory.
On entry, `buf` is assumed to be empty.
Parameters
Parameters
----------
----------
...
@@ -352,13 +363,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
...
@@ -352,13 +363,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
dtype
dtype
Optional, the dtype of the softmax's output if not float32.
Optional, the dtype of the softmax's output if not float32.
:Precondition: buf is empty
:Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
softmax
Notes
Notes
-----
-----
buf
should be in gpu shared memory, we access it many times.
`buf`
should be in gpu shared memory, we access it many times.
We use tx as an int variable in a loop.
We use tx as an int variable in a loop.
...
...
theano/sandbox/gpuarray/opt_util.py
浏览文件 @
30617ff5
...
@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
...
@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
Parameters
Parameters
----------
----------
v
: variable
v
Theano variable to extract the constant value from.
Theano variable to extract the constant value from.
nd : int
nd : int
Expected number of dimensions for the variable (for
Expected number of dimensions for the variable (for
...
@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
...
@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
Parameters
Parameters
----------
----------
v
: variable
v
The variable to dig through
The variable to dig through
cls : Op class
cls : Op class
The type of the node we are looking for
The type of the node we are looking for
...
@@ -84,9 +84,9 @@ def is_equal(var, val):
...
@@ -84,9 +84,9 @@ def is_equal(var, val):
Parameters
Parameters
----------
----------
var
: variable
var
Variable to compare
Variable to compare
val
: value
val
Python value
Python value
"""
"""
...
@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
...
@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
"""
"""
Decorator to merge multiplication by a scalar on the output.
Decorator to merge multiplication by a scalar on the output.
This will find a pattern of scal * <yourop>(some, params, alpha,
This will find a pattern of
`
scal * <yourop>(some, params, alpha,
beta) and update it so that the scalar multiplication happens as
beta)
`
and update it so that the scalar multiplication happens as
part of your op.
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way:
:
out = Op() * alpha + out_like * beta
out = Op() * alpha + out_like * beta
...
@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
...
@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
and gets added to the "real" output of the operation. An example
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, *inputs)
maker(node, *inputs)
...
@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
...
@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
for your op so that the new version performs the same computation.
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as:
:
def maker(node, *inputs):
def maker(node, *inputs):
return node.op(*inputs)
return node.op(*inputs)
...
@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
...
@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
Returns
Returns
-------
-------
This returns an unregistered local optimizer that has the same
local optimizer
name as the decorated function.
an unregistered local optimizer that has the same name as the
decorated function.
Notes
Notes
-----
-----
...
@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
...
@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
"""
"""
Decorator to merge addition by a value on the output.
Decorator to merge addition by a value on the output.
This will find a pattern of val * <yourop>(some, params, alpha,
This will find a pattern of
`
val * <yourop>(some, params, alpha,
beta, out_like) and update it so that the addtition happens as
beta, out_like)
`
and update it so that the addtition happens as
part of your op.
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way:
:
out = Op() * alpha + out_like * beta
out = Op() * alpha + out_like * beta
...
@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
...
@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
and gets added to the "real" output of the operation. An example
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, *inputs)
maker(node, *inputs)
...
@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
...
@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
for your op so that the new version performs the same computation.
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as:
:
def maker(node, *inputs):
def maker(node, *inputs):
return node.op(*inputs)
return node.op(*inputs)
...
@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
...
@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
Returns
Returns
-------
-------
This returns an unregistered local optimizer that has the same
local optimizer
name as the decorated function.
an unregistered local optimizer that has the same name as the
decorated function.
Notes
Notes
-----
-----
...
@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
...
@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
This will duplicate the alloc input if it has more than one client
This will duplicate the alloc input if it has more than one client
to allow the op to work on it inplace.
to allow the op to work on it inplace.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, inputs)
maker(node, inputs)
...
@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
...
@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
You should also switch the op to work inplace. The `*inputs`
You should also switch the op to work inplace. The `*inputs`
parameters contains the new inputs for your op. You MUST use
parameters contains the new inputs for your op. You MUST use
those inputs instead of the ones on `node`. Note that this
those inputs instead of the ones on `node`. Note that this
function can be as simple as:
function can be as simple as:
:
def maker(node, inputs):
def maker(node, inputs):
return [node.op.__class__(inplace=True)(*inputs)]
return [node.op.__class__(inplace=True)(*inputs)]
...
@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
...
@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
Returns
Returns
-------
-------
This returns an unregistered inplace local optimizer that has the
local optimizer
same name as the decorated function.
an unregistered inplace local optimizer that has the same name
as the decorated function.
"""
"""
def
wrapper
(
maker
):
def
wrapper
(
maker
):
...
...
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论