提交 30617ff5 authored 作者: Arnaud Bergeron's avatar Arnaud Bergeron

Rest of libdoc for gpuarray.

上级 3bf6f4cb
.. _libdoc_gpuarray_dnn:
===========================================
:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
===========================================
.. moduleauthor:: LISA
`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
with functionality used by deep neural networks. It provides optimized
versions of some operations like the convolution. cuDNN is not
currently installed with CUDA. You must download and install it
yourself.
To install it, decompress the downloaded file and make the ``*.h`` and
``*.so*`` files available to the compilation environment.
There are at least three possible ways of doing so:
- The easiest is to include them in your CUDA installation. Copy the
``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
on Linux).
- Alternatively, on Linux, you can set the environment variables
``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
extracted from the download. If needed, separate multiple directories
with ``:`` as in the ``PATH`` environment variable.
example::
export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
- And as a third way, also on Linux, you can copy the ``*.h`` files
to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
By default, Theano will detect if it can use cuDNN. If so, it will use
it. If not, Theano optimizations will not introduce cuDNN ops. So
Theano will still work if the user did not introduce them manually.
To get an error if Theano can not use cuDNN, use this Theano flag:
``optimizer_including=cudnn``.
.. note::
CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
faster and offers many more options. We recommend that everybody update to
v3.
.. note::
Starting in CuDNN v3, multiple convolution implementations are offered and
it is possible to use heuristics to automatically choose a convolution
implementation well suited to the parameters of the convolution.
The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
convolution implementation that Theano should use for forward convolutions.
Possible values include :
* ``small`` (default) : use a convolution implementation with small memory
usage
* ``none`` : use a slower implementation with minimal memory usage
* ``large`` : use a sometimes faster implementation with large memory usage
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
convolution implementation that Theano should use for gradient convolutions.
Possible values include :
* ``none`` (default) : use the default non-deterministic convolution
implementation
* ``deterministic`` : use a slower but deterministic implementation
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
``guess_*`` and ``time_*`` flag values take into account the amount of
available memory when selecting an implementation. This means that slower
implementations might be selected if not enough memory is available for the
faster implementations.
.. note::
Normally you should not call GPU Ops directly, but the CPU interface
currently does not allow all options supported by cuDNN ops. So it is
possible that you will need to call them manually.
.. note::
The documentation of CUDNN tells that, for the 2 following operations, the
reproducibility is not guaranteed with the default implementation:
`cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
Those correspond to the gradient wrt the weights and the gradient wrt the
input of the convolution. They are also used sometimes in the forward
pass, when they give a speed up.
The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
slower but deterministic convolution implementation.
.. note::
There is a problem we do not understand yet when cudnn paths are
used with symbolic links. So avoid using that.
.. note::
cudnn.so* must be readable and executable by everybody.
cudnn.h must be readable by everybody.
Functions
=========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: dnn_conv, dnn_pool
Convolution Ops
===============
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
Pooling Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
Softmax Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
.. _libdoc_gpuarray_extra:
=================
Utility functions
=================
Optimisation
------------
.. automodule:: theano.sandbox.gpuarray.opt_util
:members:
Kernel generation
-----------------
.. automodule:: theano.sandbox.gpuarray.kernel_codegen
:members:
...@@ -14,4 +14,6 @@ ...@@ -14,4 +14,6 @@
:maxdepth: 1 :maxdepth: 1
op op
dnn
type type
extra
...@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn): ...@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
count count
Number of executing threads. Number of executing threads.
manner_fn manner_fn
A function that accepts strings of arguments a and b, and returns c code A function that accepts strings of arguments a and b, and
for their reduction. returns c code for their reduction.
Example: return "%(a)s + %(b)s" for a sum reduction.
:postcondition: return "%(a)s + %(b)s"
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function. for a sum reduction.
Notes Notes
----- -----
buf should be in gpu shared memory, we access it many times. `buf` should be in gpu shared memory, we access it many times.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
""" """
loop_line = manner_fn("%s[%s]" % (buf, pos), "%s[i]" % (buf)) loop_line = manner_fn("%s[%s]" % (buf, pos), "%s[i]" % (buf))
...@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count): ...@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
inline_reduce_sum.code_version) inline_reduce_sum.code_version)
def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"): def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
""" """
Generate code for a softmax.
On entry, `buf` and `buf2` must contain two identical copies of
the input to softmax.
After the code returns `buf` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters Parameters
---------- ----------
...@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"): ...@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
dtype dtype
Dtype of the softmax's output. Dtype of the softmax's output.
:Precondition: buf and buf2 contain two identical copies of the input
to softmax
:Postcondition: buf contains the softmax, buf2 contains un-normalized
softmax
Notes Notes
----- -----
buf and buf2 should be in gpu shared memory, we access it many times. `buf` and `buf2` should be in gpu shared memory, we access it many
times.
We use __i as an int variable in a loop. We use __i as an int variable in a loop.
...@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count, ...@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
""" """
Return C++ code for a function that reduces a contiguous buffer. Return C++ code for a function that reduces a contiguous buffer.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
Parameters Parameters
---------- ----------
N N
...@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count, ...@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
dtype dtype
Optional, the dtype of the output. Optional, the dtype of the output.
manner_fn manner_fn
A function that accepts strings of arguments a and b, and returns c code A function that accepts strings of arguments a and b, and
for their reduction. returns c code for their reduction.
Example: return "%(a)s + %(b)s" for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c code for its
initialization.
:postcondition: return "%(a)s + %(b)s"
This function leaves the answer in position 0 of the buffer. The rest of the
buffer is trashed by this function. for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c
code for its initialization.
Notes Notes
----- -----
buf should be in gpu shared memory, we access it many times. `buf` should be in gpu shared memory, we access it many times.
""" """
if b: if b:
...@@ -320,6 +327,10 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x, ...@@ -320,6 +327,10 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
b='', stride_b='', load_b='', b='', stride_b='', load_b='',
dtype="float32"): dtype="float32"):
""" """
Generate code to perform softmax with a fixed amount of shared
memory.
On entry, `buf` is assumed to be empty.
Parameters Parameters
---------- ----------
...@@ -352,13 +363,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x, ...@@ -352,13 +363,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
dtype dtype
Optional, the dtype of the softmax's output if not float32. Optional, the dtype of the softmax's output if not float32.
:Precondition: buf is empty
:Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
softmax
Notes Notes
----- -----
buf should be in gpu shared memory, we access it many times. `buf` should be in gpu shared memory, we access it many times.
We use tx as an int variable in a loop. We use tx as an int variable in a loop.
......
...@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd): ...@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
Parameters Parameters
---------- ----------
v : variable v
Theano variable to extract the constant value from. Theano variable to extract the constant value from.
nd : int nd : int
Expected number of dimensions for the variable (for Expected number of dimensions for the variable (for
...@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False): ...@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
Parameters Parameters
---------- ----------
v : variable v
The variable to dig through The variable to dig through
cls : Op class cls : Op class
The type of the node we are looking for The type of the node we are looking for
...@@ -84,9 +84,9 @@ def is_equal(var, val): ...@@ -84,9 +84,9 @@ def is_equal(var, val):
Parameters Parameters
---------- ----------
var : variable var
Variable to compare Variable to compare
val : value val
Python value Python value
""" """
...@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in): ...@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
""" """
Decorator to merge multiplication by a scalar on the output. Decorator to merge multiplication by a scalar on the output.
This will find a pattern of scal * <yourop>(some, params, alpha, This will find a pattern of `scal * <yourop>(some, params, alpha,
beta) and update it so that the scalar multiplication happens as beta)` and update it so that the scalar multiplication happens as
part of your op. part of your op.
The op needs to accept an alpha and a beta scalar which act this way: The op needs to accept an alpha and a beta scalar which act this way::
out = Op() * alpha + out_like * beta out = Op() * alpha + out_like * beta
...@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in): ...@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
and gets added to the "real" output of the operation. An example and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas. of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature: The decorated function must have this signature::
maker(node, *inputs) maker(node, *inputs)
...@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in): ...@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
for your op so that the new version performs the same computation. for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as: this function can be as simple as::
def maker(node, *inputs): def maker(node, *inputs):
return node.op(*inputs) return node.op(*inputs)
...@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in): ...@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
Returns Returns
------- -------
This returns an unregistered local optimizer that has the same local optimizer
name as the decorated function. an unregistered local optimizer that has the same name as the
decorated function.
Notes Notes
----- -----
...@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in): ...@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
""" """
Decorator to merge addition by a value on the output. Decorator to merge addition by a value on the output.
This will find a pattern of val * <yourop>(some, params, alpha, This will find a pattern of `val * <yourop>(some, params, alpha,
beta, out_like) and update it so that the addtition happens as beta, out_like)` and update it so that the addtition happens as
part of your op. part of your op.
The op needs to accept an alpha and a beta scalar which act this way: The op needs to accept an alpha and a beta scalar which act this way::
out = Op() * alpha + out_like * beta out = Op() * alpha + out_like * beta
...@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in): ...@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
and gets added to the "real" output of the operation. An example and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas. of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature: The decorated function must have this signature::
maker(node, *inputs) maker(node, *inputs)
...@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in): ...@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
for your op so that the new version performs the same computation. for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as: this function can be as simple as::
def maker(node, *inputs): def maker(node, *inputs):
return node.op(*inputs) return node.op(*inputs)
...@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in): ...@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
Returns Returns
------- -------
This returns an unregistered local optimizer that has the same local optimizer
name as the decorated function. an unregistered local optimizer that has the same name as the
decorated function.
Notes Notes
----- -----
...@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx): ...@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
This will duplicate the alloc input if it has more than one client This will duplicate the alloc input if it has more than one client
to allow the op to work on it inplace. to allow the op to work on it inplace.
The decorated function must have this signature: The decorated function must have this signature::
maker(node, inputs) maker(node, inputs)
...@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx): ...@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
You should also switch the op to work inplace. The `*inputs` You should also switch the op to work inplace. The `*inputs`
parameters contains the new inputs for your op. You MUST use parameters contains the new inputs for your op. You MUST use
those inputs instead of the ones on `node`. Note that this those inputs instead of the ones on `node`. Note that this
function can be as simple as: function can be as simple as::
def maker(node, inputs): def maker(node, inputs):
return [node.op.__class__(inplace=True)(*inputs)] return [node.op.__class__(inplace=True)(*inputs)]
...@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx): ...@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
Returns Returns
------- -------
This returns an unregistered inplace local optimizer that has the local optimizer
same name as the decorated function. an unregistered inplace local optimizer that has the same name
as the decorated function.
""" """
def wrapper(maker): def wrapper(maker):
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论