提交 54e96754 authored 作者: Pascal Lamblin's avatar Pascal Lamblin

Merge pull request #3559 from abergeron/multi_gpu_doc

Multi gpu doc
......@@ -132,7 +132,7 @@ Roughly in order of what you'll want to check out:
* :ref:`extending` -- Learn to add a Type, Op, or graph optimization.
* :ref:`dev_start_guide` -- How to contribute code to Theano.
* :ref:`developer` -- Primarily of interest to developers of Theano
* :ref:`internal` -- How to maintain Theano, LISA-specific tips, and more...
* :ref:`internal` -- How to maintain Theano and more...
* :ref:`release` -- How our release should work.
* :ref:`acknowledgement` -- What we took from other projects.
* `Related Projects`_ -- link to other projects that implement new functionalities on top of Theano
......
......@@ -5,16 +5,11 @@
Internal Documentation
======================
If you're feeling ambitious, go fix some `pylint
<http://lgcm.iro.umontreal.ca/auto_theano_pylint/pylint_global.html>` errors!
.. toctree::
:maxdepth: 2
release
dev_start_guide
lisa_labo
mammouth
metadocumentation
python
how_to_release
.. _lisa_labo:
===============================
LISA Labo specific instructions
===============================
Tips for running at LISA
------------------------
Shell configuration files ``/opt/lisa/os/.local.{bash,csh}rc`` should define
:envvar:`THEANORC` to include ``/opt/lisa/os/.local.theanorc`` as a
configuration file.
``/opt/lisa/os/.local.theanorc`` should include the right default values for
the lab, in particular, ``blas.ldflags`` should contain '-lgoto'.
Tips for running on a cluster
-----------------------------
:ref:`mammouth`
For instructions on running Theano on the mammouth cluster.
.. _mammouth:
===========================
Running Theano on Mammouth
===========================
To run Theano on the Mammouth cluster, follow these simple steps:
* Make sure to source Fred's .local.bashrc file. It contains all
the goodies for using the latest and greatest (optimized) libraries
(numpy, scipy, etc.)
.. code-block:: sh
source /home/bastienf/.local.bashrc
Perhaps even put this in your ``.bashrc``
* set ``config.blas.ldflags`` to ``'-lmkl -lguide -fopenmp'``
(see :mod:`config` to know how)
Note: the -lguide flag works, however the fix should probably be considered temporary.
Intel has deprecated libguide.so in favor of the newer library libiomp5.so. However,
both libraries are mutually exclusive and one component (theano, numpy or scipy?) already
seems to be using libguide.so (hence -liomp5 causes a linking error when compiling thunks)
......@@ -110,9 +110,6 @@ pylint output is not autogenerated anymore.
Pylint documentation is generated using pylintrc file: ``Theano/doc/pylintrc``
You can see a list of all `pylint messages
<http://www.logilab.org/card/pylintfeatures>`__.
.. _metadocumentation_nightly_build:
......
.. _libdoc_gpuarray_dnn:
===========================================
:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
===========================================
.. moduleauthor:: LISA
`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
with functionality used by deep neural networks. It provides optimized
versions of some operations like the convolution. cuDNN is not
currently installed with CUDA. You must download and install it
yourself.
To install it, decompress the downloaded file and make the ``*.h`` and
``*.so*`` files available to the compilation environment.
There are at least three possible ways of doing so:
- The easiest is to include them in your CUDA installation. Copy the
``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
on Linux).
- Alternatively, on Linux, you can set the environment variables
``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
extracted from the download. If needed, separate multiple directories
with ``:`` as in the ``PATH`` environment variable.
example::
export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
- And as a third way, also on Linux, you can copy the ``*.h`` files
to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
By default, Theano will detect if it can use cuDNN. If so, it will use
it. If not, Theano optimizations will not introduce cuDNN ops. So
Theano will still work if the user did not introduce them manually.
To get an error if Theano can not use cuDNN, use this Theano flag:
``optimizer_including=cudnn``.
.. note::
CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
faster and offers many more options. We recommend that everybody update to
v3.
.. note::
Starting in CuDNN v3, multiple convolution implementations are offered and
it is possible to use heuristics to automatically choose a convolution
implementation well suited to the parameters of the convolution.
The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
convolution implementation that Theano should use for forward convolutions.
Possible values include :
* ``small`` (default) : use a convolution implementation with small memory
usage
* ``none`` : use a slower implementation with minimal memory usage
* ``large`` : use a sometimes faster implementation with large memory usage
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
convolution implementation that Theano should use for gradient convolutions.
Possible values include :
* ``none`` (default) : use the default non-deterministic convolution
implementation
* ``deterministic`` : use a slower but deterministic implementation
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
``guess_*`` and ``time_*`` flag values take into account the amount of
available memory when selecting an implementation. This means that slower
implementations might be selected if not enough memory is available for the
faster implementations.
.. note::
Normally you should not call GPU Ops directly, but the CPU interface
currently does not allow all options supported by cuDNN ops. So it is
possible that you will need to call them manually.
.. note::
The documentation of CUDNN tells that, for the 2 following operations, the
reproducibility is not guaranteed with the default implementation:
`cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
Those correspond to the gradient wrt the weights and the gradient wrt the
input of the convolution. They are also used sometimes in the forward
pass, when they give a speed up.
The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
slower but deterministic convolution implementation.
.. note::
There is a problem we do not understand yet when cudnn paths are
used with symbolic links. So avoid using that.
.. note::
cudnn.so* must be readable and executable by everybody.
cudnn.h must be readable by everybody.
Functions
=========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: dnn_conv, dnn_pool
Convolution Ops
===============
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
Pooling Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
Softmax Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
.. _libdoc_gpuarray_extra:
=================
Utility functions
=================
Optimisation
------------
.. automodule:: theano.sandbox.gpuarray.opt_util
:members:
Kernel generation
-----------------
.. automodule:: theano.sandbox.gpuarray.kernel_codegen
:members:
.. _libdoc_gpuarray:
=======================================================
:mod:`theano.sandbox.gpuarray` -- The (new) GPU backend
=======================================================
.. module:: theano.sandbox.gpuarray
:platform: Unix, Windows
:synopsis: Code for GPU programming (new)
.. moduleauthor:: MILA
.. toctree::
:maxdepth: 1
op
dnn
type
extra
.. _libdoc_gpuarray_op:
================================
List of gpuarray Ops implemented
================================
.. moduleauthor:: LISA
Normally you should not call directly those Ops! Theano should
automatically transform cpu ops to their gpu equivalent. So this list
is just useful to let people know what is implemented on the gpu.
Basic Op
========
.. automodule:: theano.sandbox.gpuarray.basic_ops
:members:
Blas Op
=======
.. automodule:: theano.sandbox.gpuarray.blas
:members:
.. automodule:: theano.sandbox.gpuarray.nerv
:members:
Elemwise Op
===========
.. automodule:: theano.sandbox.gpuarray.elemwise
:members:
Subtensor Op
============
.. automodule:: theano.sandbox.gpuarray.subtensor
:members:
Nnet Op
=======
.. automodule:: theano.sandbox.gpuarray.nnet
:members:
.. automodule:: theano.sandbox.gpuarray.neighbours
:members:
.. _libdoc_gpuarray_type:
===================================================
:mod:`theano.sandbox.gpuarray.type` -- Type classes
===================================================
.. automodule:: theano.sandbox.gpuarray.type
:members:
......@@ -14,6 +14,8 @@
:maxdepth: 1
cuda/index
gpuarray/index
linalg
neighbours
rng_mrg
blocksparse
......@@ -37,6 +37,7 @@ you out.
loop
sparse
using_gpu
using_multi_gpu
gpu_data_convert
aliasing
shape_info
......
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Theano has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the theano configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separeted by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Theano. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ THEANO_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Theano will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since theano often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distrubuting the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import theano
v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = theano.function([], [theano.tensor.dot(v01, v02),
theano.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import theano
v = theano.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead any may reduce performance.
......@@ -12,7 +12,6 @@ import numpy
import theano
from theano.sandbox.gpuarray import init_dev
from theano.sandbox.gpuarray.type import gpuarray_shared_constructor as shared
from theano.sandbox.gpuarray.blas import gpu_dot22
......@@ -22,13 +21,13 @@ def main(dev1, dev2):
size = 1024 * 16
data = numpy.random.randn(size, size).astype('float32')
val1a = shared(data, target='ctx1')
val1b = shared(data, target='ctx1')
val1c = shared(data, target='ctx1')
val1d = shared(data, target='ctx1')
val1a = theano.shared(data, target='ctx1')
val1b = theano.shared(data, target='ctx1')
val1c = theano.shared(data, target='ctx1')
val1d = theano.shared(data, target='ctx1')
val2a = shared(data, target='ctx2')
val2b = shared(data, target='ctx2')
val2a = theano.shared(data, target='ctx2')
val2b = theano.shared(data, target='ctx2')
f1 = theano.function([], [gpu_dot22(val1a, val1b),
gpu_dot22(val1c, val1d)])
......
......@@ -27,6 +27,20 @@ from .fp16_help import write_w
def as_gpuarray_variable(x, context_name):
"""
This will attempt to convert `x` into a variable on the GPU.
It can take either a value of another variable. If `x` is already
suitable, it will be returned as-is.
Parameters
----------
x
Object to convert
context_name : str or None
target context name for the result
"""
# If this is already some form of variable, try to avoid an extra transfer
if isinstance(x, Variable):
while True:
......@@ -174,6 +188,13 @@ class Kernel(object):
class GpuKernelBase(object):
"""
Base class for operations that need to compile kernels.
It is not mandatory to use this class, but it helps with a lot of
the small things that you have to pay attention to.
"""
params_type = gpu_context_type
def gpu_kernels(self, node, name):
......@@ -274,10 +295,25 @@ class GpuKernelBase(object):
return (self.c_code_cache_version(), self.kernel_version(node))
def kernel_version(self, node):
"""
If you override :meth:`c_code_cache_version_apply`, call this
method to have the version of the kernel support code and
device.
Parameters
----------
node : apply node
The node that we need the cache version for.
"""
return (3, self.get_params(node).bin_id)
class HostFromGpu(Op):
"""
Transfer data to CPU.
"""
__props__ = ()
_f16_ok = True
......@@ -356,6 +392,10 @@ host_from_gpu = HostFromGpu()
class GpuFromHost(Op):
"""
Transfer data to GPU.
"""
__props__ = ('context_name',)
_f16_ok = True
params_type = gpu_context_type
......@@ -443,6 +483,10 @@ class GpuFromHost(Op):
class GpuToGpu(Op):
"""
Transfer data between GPUs.
"""
__props__ = ('context_name',)
_f16_ok = True
params_type = gpu_context_type
......@@ -494,6 +538,7 @@ class GpuToGpu(Op):
class GpuAlloc(HideC, Alloc):
"""
Allocate initialized memory on the GPU.
Parameters
----------
......@@ -654,6 +699,10 @@ class GpuAlloc(HideC, Alloc):
class GpuAllocEmpty(HideC, Alloc):
"""
Allocate uninitialized memory on the GPU.
"""
__props__ = ('dtype', 'context_name')
_f16_ok = True
params_type = gpu_context_type
......@@ -732,8 +781,10 @@ def empty_like(var):
class GpuContiguous(Op):
"""
Always return a c contiguous output. Copy the input only if it is
not already c contiguous.
Return a C contiguous version of the input.
This may either pass the object as-is (if already C contiguous) or
make a copy.
"""
__props__ = ()
......@@ -793,7 +844,7 @@ gpu_contiguous = GpuContiguous()
class GpuReshape(HideC, tensor.Reshape):
"""
Implement Reshape on the gpu.
Reshape for GPU variables.
"""
......@@ -914,6 +965,10 @@ class GpuReshape(HideC, tensor.Reshape):
class GpuJoin(HideC, Join):
"""
Join for GPU.
"""
_f16_ok = True
params_type = gpu_context_type
......@@ -991,6 +1046,10 @@ gpu_join = GpuJoin()
class GpuSplit(HideC, Split):
"""
Split for GPU.
"""
def make_node(self, x, axis, splits):
node = Split.make_node(self, x, axis, splits)
x = as_gpuarray_variable(x, infer_context_name(x))
......@@ -1002,6 +1061,10 @@ class GpuSplit(HideC, Split):
class GpuEye(GpuKernelBase, Op):
"""
Eye for GPU.
"""
__props__ = ('dtype', 'context_name')
_f16_ok = True
......
......@@ -31,6 +31,10 @@ class BlasOp(Op):
class GpuGemv(BlasOp):
"""
Gemv on the GPU.
"""
__props__ = ('inplace',)
def __init__(self, inplace=False):
......@@ -107,6 +111,10 @@ gpugemv_inplace = GpuGemv(inplace=True)
class GpuGemm(BlasOp):
"""
Gemm on the GPU.
"""
__props__ = ('inplace',)
_f16_ok = True
......@@ -184,6 +192,10 @@ gpugemm_inplace = GpuGemm(inplace=True)
class GpuGer(BlasOp):
"""
Ger on the GPU.
"""
__props__ = ('inplace',)
def __init__(self, inplace=False):
......@@ -256,6 +268,10 @@ gpuger_inplace = GpuGer(inplace=True)
class GpuDot22(BlasOp):
"""
Dot22 on the GPU.
"""
__props__ = ()
def make_node(self, x, y):
......
......@@ -57,6 +57,10 @@ def as_C_string_const(s):
class GpuElemwise(GpuKernelBase, HideC, Elemwise):
"""
Elemwise on the GPU.
"""
nin = property(lambda self: self.scalar_op.nin)
nout = property(lambda self: self.scalar_op.nout)
_f16_ok = True
......@@ -445,6 +449,10 @@ class SupportCodeError(Exception):
class GpuDimShuffle(HideC, DimShuffle):
"""
DimShuffle on the GPU.
"""
_f16_ok = True
def make_node(self, input):
......@@ -548,7 +556,7 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
Parameters
----------
reduce-mask
reduce_mask
The dimensions along which to reduce. The `reduce_mask` is a tuple of
booleans (actually integers 0 or 1) that specify for each input
dimension, whether to reduce it (1) or not (0).
......@@ -1279,14 +1287,6 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
""" % locals()
def c_code_reduce_ccontig(self, sio, node, name, x, z, fail):
"""
WRITEME
IG: I believe, based on how this is called in c_code, that it
is for the case where we are reducing on all axes and x is
C contiguous.
"""
in_dtype = "npy_" + node.inputs[0].dtype
out_dtype = "npy_" + node.outputs[0].dtype
if getattr(self.scalar_op, 'identity', None) == 0:
......@@ -2666,8 +2666,6 @@ class GpuCAReduceCPY(GpuKernelBase, HideC, CAReduceDtype):
"""
CAReduce that reuse the python code from gpuarray.
Too slow for now as it only have a python interface.
"""
def __init__(self, scalar_op, axis=None, dtype=None, acc_dtype=None):
if not hasattr(scalar_op, 'identity'):
......
......@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
count
Number of executing threads.
manner_fn
A function that accepts strings of arguments a and b, and returns c code
for their reduction.
Example: return "%(a)s + %(b)s" for a sum reduction.
A function that accepts strings of arguments a and b, and
returns c code for their reduction.
:postcondition:
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
return "%(a)s + %(b)s"
for a sum reduction.
Notes
-----
buf should be in gpu shared memory, we access it many times.
`buf` should be in gpu shared memory, we access it many times.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
"""
loop_line = manner_fn("%s[%s]" % (buf, pos), "%s[i]" % (buf))
......@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
inline_reduce_sum.code_version)
def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
"""
Generate code for a softmax.
On entry, `buf` and `buf2` must contain two identical copies of
the input to softmax.
After the code returns `buf` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters
----------
......@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
dtype
Dtype of the softmax's output.
:Precondition: buf and buf2 contain two identical copies of the input
to softmax
:Postcondition: buf contains the softmax, buf2 contains un-normalized
softmax
Notes
-----
buf and buf2 should be in gpu shared memory, we access it many times.
`buf` and `buf2` should be in gpu shared memory, we access it many
times.
We use __i as an int variable in a loop.
......@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
"""
Return C++ code for a function that reduces a contiguous buffer.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
Parameters
----------
N
......@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
dtype
Optional, the dtype of the output.
manner_fn
A function that accepts strings of arguments a and b, and returns c code
for their reduction.
Example: return "%(a)s + %(b)s" for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c code for its
initialization.
A function that accepts strings of arguments a and b, and
returns c code for their reduction.
:postcondition:
This function leaves the answer in position 0 of the buffer. The rest of the
buffer is trashed by this function.
return "%(a)s + %(b)s"
for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c
code for its initialization.
Notes
-----
buf should be in gpu shared memory, we access it many times.
`buf` should be in gpu shared memory, we access it many times.
"""
if b:
......@@ -320,6 +327,13 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
b='', stride_b='', load_b='',
dtype="float32"):
"""
Generate code to perform softmax with a fixed amount of shared
memory.
On entry, `buf` is assumed to be empty.
On exit, `buf[0]` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters
----------
......@@ -352,13 +366,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
dtype
Optional, the dtype of the softmax's output if not float32.
:Precondition: buf is empty
:Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
softmax
Notes
-----
buf should be in gpu shared memory, we access it many times.
`buf` should be in gpu shared memory, we access it many times.
We use tx as an int variable in a loop.
......
......@@ -17,6 +17,10 @@ from .type import GpuArrayType
class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
"""
Images2Neibs for the GPU.
"""
def __init__(self, mode='valid'):
if mode not in ['valid', 'ignore_borders', 'wrap_centered']:
raise NotImplementedError("Only the mode valid, ignore_borders"
......
......@@ -41,6 +41,9 @@ def ensure_float(val, name):
class Gemm16(COp):
"""
Gemm for float16 using the nervena kernels.
"""
__props__ = ('relu', 'inplace')
_f16_ok = True
params_type = gpu_context_type
......
......@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
Parameters
----------
v : variable
v
Theano variable to extract the constant value from.
nd : int
Expected number of dimensions for the variable (for
......@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
Parameters
----------
v : variable
v
The variable to dig through
cls : Op class
The type of the node we are looking for
......@@ -84,9 +84,9 @@ def is_equal(var, val):
Parameters
----------
var : variable
var
Variable to compare
val : value
val
Python value
"""
......@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
"""
Decorator to merge multiplication by a scalar on the output.
This will find a pattern of scal * <yourop>(some, params, alpha,
beta) and update it so that the scalar multiplication happens as
This will find a pattern of `scal * <yourop>(some, params, alpha,
beta)` and update it so that the scalar multiplication happens as
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way::
out = Op() * alpha + out_like * beta
......@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature::
maker(node, *inputs)
......@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as::
def maker(node, *inputs):
return node.op(*inputs)
......@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
Returns
-------
This returns an unregistered local optimizer that has the same
name as the decorated function.
local optimizer
an unregistered local optimizer that has the same name as the
decorated function.
Notes
-----
......@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
"""
Decorator to merge addition by a value on the output.
This will find a pattern of val * <yourop>(some, params, alpha,
beta, out_like) and update it so that the addtition happens as
This will find a pattern of `val * <yourop>(some, params, alpha,
beta, out_like)` and update it so that the addtition happens as
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way::
out = Op() * alpha + out_like * beta
......@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature::
maker(node, *inputs)
......@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as::
def maker(node, *inputs):
return node.op(*inputs)
......@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
Returns
-------
This returns an unregistered local optimizer that has the same
name as the decorated function.
local optimizer
an unregistered local optimizer that has the same name as the
decorated function.
Notes
-----
......@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
This will duplicate the alloc input if it has more than one client
to allow the op to work on it inplace.
The decorated function must have this signature:
The decorated function must have this signature::
maker(node, inputs)
......@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
You should also switch the op to work inplace. The `*inputs`
parameters contains the new inputs for your op. You MUST use
those inputs instead of the ones on `node`. Note that this
function can be as simple as:
function can be as simple as::
def maker(node, inputs):
return [node.op.__class__(inplace=True)(*inputs)]
......@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
Returns
-------
This returns an unregistered inplace local optimizer that has the
same name as the decorated function.
local optimizer
an unregistered inplace local optimizer that has the same name
as the decorated function.
"""
def wrapper(maker):
......
......@@ -24,6 +24,9 @@ from .elemwise import GpuElemwise
class GpuSubtensor(HideC, Subtensor):
"""
Subtensor on the GPU.
"""
_f16_ok = True
def make_node(self, x, *inputs):
......@@ -173,8 +176,8 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
The optimization to make this inplace is in tensor/opt.
The same optimization handles IncSubtensor and GpuIncSubtensor.
This Op has c_code too; it inherits tensor.IncSubtensor's c_code.
The helper methods like do_type_checking, copy_of_x, etc. specialize
the c_code for this Op.
The helper methods like :meth:`do_type_checking`,
:meth:`copy_of_x`, etc. specialize the c_code for this Op.
"""
......@@ -405,6 +408,9 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1):
"""
AdvancedSubrensor1 on the GPU.
"""
def make_node(self, x, ilist):
ctx_name = infer_context_name(x, ilist)
x_ = as_gpuarray_variable(x, ctx_name)
......@@ -580,8 +586,10 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1):
_f16_ok = True
def make_node(self, x, y, ilist):
"""It defer from GpuAdvancedIncSubtensor1 in that it make sure
the index are of type long.
"""
It differs from GpuAdvancedIncSubtensor1 in that it makes sure
the indexes are of type long.
"""
ctx_name = infer_context_name(x, y, ilist)
x_ = as_gpuarray_variable(x, ctx_name)
......
......@@ -67,6 +67,7 @@ def get_context(name):
def list_contexts():
"""
Return an iterable of all the registered context names.
"""
return _context_reg.keys()
......@@ -85,6 +86,54 @@ def _unreg_context(name):
class GpuArrayType(Type):
"""
The type that represents an array on a gpu.
The `dtype` indicates what scalar data type the elements of
variables of this type will be.
`broadcastable` indicates whether each dimension is broadcastable
or not (to be broadcastable a dimension must always be of length
1).
The `context_name` is the name of the context on will values of
variables of this type will be stored.
Parameters
----------
dtype : str
The name of a numpy dtype
broadcastable : tuple of bools
A tuple that indicates both the number of dimensions (by its
length) and whether those dimensions are broadcastable or not
(by the boolean values).
context_name : str
The name of the context the that this type is attached to
(default: None, which is the context specified by
config.device).
name : string, optional
A name for the type that will be used in printouts.
Attributes
----------
dtype : str
Data type used for scalar elements of variables.
broadcastable : tuple of bools
Indicates whether the dimensions are broadcastable or not.
ndim : int
The number of dimensions
context_name : str
The name of a gpu context on which variables will have their values.
name : str
A string used to print the type if given.
typecode : int
The gpuarray typecode for `dtype`
See Also
--------
theano.gof.type.PureType
"""
def __init__(self, dtype, broadcastable, context_name=None, name=None):
# In case this was not provided and no global value is available
self.dtype = str(dtype)
......@@ -111,6 +160,11 @@ class GpuArrayType(Type):
# This is a property to keep the type pickleable
@property
def context(self):
"""
The context object mapped to the type's :attr:`context_name`.
This is a property.
"""
return get_context(self.context_name)
def __repr__(self):
......@@ -306,8 +360,6 @@ class GpuArrayType(Type):
This function is used internally as part of C code generation.
"""
# TODO: add more type correspondances for e.g. int32, int64, float32,
# complex64, etc.
try:
return {
'float16': (float, 'npy_float16', 'NPY_FLOAT16'),
......@@ -321,8 +373,8 @@ class GpuArrayType(Type):
'int32': (int, 'npy_int32', 'NPY_INT32'),
'uint64': (int, 'npy_uint64', 'NPY_UINT64'),
'int64': (int, 'npy_int64', 'NPY_INT64'),
'complex128': (complex, 'theano_complex128', 'NPY_COMPLEX128'),
'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
# 'complex128': (complex, 'theano_complex128', 'NPY_COMPLEX128'),
# 'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
}[self.dtype]
except KeyError:
raise TypeError("Unsupported dtype for %s: %s" %
......@@ -420,10 +472,21 @@ class _operators(_tensor_py_operators):
class GpuArrayVariable(_operators, Variable):
"""
A variable representing a computation on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
Variable
"""
# override the default
def __repr_test_value__(self):
return repr(numpy.array(theano.gof.op.get_test_value(self)))
pass
GpuArrayType.Variable = GpuArrayVariable
......@@ -436,6 +499,17 @@ class GpuArraySignature(tensor.TensorConstantSignature):
class GpuArrayConstant(_operators, Constant):
"""
A constant representing a value on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
Constant
"""
def signature(self):
return GpuArraySignature((self.type, numpy.asarray(self.data)))
......@@ -453,6 +527,17 @@ GpuArrayType.Constant = GpuArrayConstant
class GpuArraySharedVariable(_operators, SharedVariable):
"""
A variable representing a shared value on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
SharedVariable
"""
def get_value(self, borrow=False, return_internal_type=False):
if return_internal_type:
if borrow:
......@@ -481,6 +566,8 @@ def gpuarray_shared_constructor(value, name=None, strict=False,
"""
SharedVariable constructor for GpuArrayType.
See :func:`theano.shared`.
"""
if target == 'gpu' or target == 'cpu':
raise TypeError('not for me')
......@@ -596,6 +683,13 @@ theano.compile.register_specify_shape_c_code(
class GpuContextType(Type):
"""
Minimal type used for passing contexts to nodes.
This Type is not a complete type and should never be used for
regular graph operations.
"""
def filter(self, data, strict=False, allow_downcast=None):
if not isinstance(data, gpuarray.GpuContext):
raise TypeError('context is not a GpuContext')
......@@ -652,4 +746,8 @@ Py_INCREF(%(name)s);
# Variable, Contstant, ... not declared
"""
Instance of :class:`GpuContextType` to use for the context_type
declaration of an operation.
"""
gpu_context_type = GpuContextType()
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论