Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
54e96754
提交
54e96754
authored
12月 17, 2015
作者:
Pascal Lamblin
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #3559 from abergeron/multi_gpu_doc
Multi gpu doc
上级
4ad36ddc
c5084ac8
隐藏空白字符变更
内嵌
并排
正在显示
23 个修改的文件
包含
678 行增加
和
136 行删除
+678
-136
index.txt
doc/index.txt
+1
-1
index.txt
doc/internal/index.txt
+0
-5
lisa_labo.txt
doc/internal/lisa_labo.txt
+0
-22
mammouth.txt
doc/internal/mammouth.txt
+0
-25
metadocumentation.txt
doc/internal/metadocumentation.txt
+0
-3
dnn.txt
doc/library/sandbox/gpuarray/dnn.txt
+161
-0
extra.txt
doc/library/sandbox/gpuarray/extra.txt
+17
-0
index.txt
doc/library/sandbox/gpuarray/index.txt
+19
-0
op.txt
doc/library/sandbox/gpuarray/op.txt
+47
-0
type.txt
doc/library/sandbox/gpuarray/type.txt
+8
-0
index.txt
doc/library/sandbox/index.txt
+2
-0
index.txt
doc/tutorial/index.txt
+1
-0
using_multi_gpu.txt
doc/tutorial/using_multi_gpu.txt
+140
-0
check_multi_gpu.py
theano/misc/check_multi_gpu.py
+6
-7
basic_ops.py
theano/sandbox/gpuarray/basic_ops.py
+66
-3
blas.py
theano/sandbox/gpuarray/blas.py
+16
-0
elemwise.py
theano/sandbox/gpuarray/elemwise.py
+9
-11
kernel_codegen.py
theano/sandbox/gpuarray/kernel_codegen.py
+38
-28
neighbours.py
theano/sandbox/gpuarray/neighbours.py
+4
-0
nerv.py
theano/sandbox/gpuarray/nerv.py
+3
-0
opt_util.py
theano/sandbox/gpuarray/opt_util.py
+25
-22
subtensor.py
theano/sandbox/gpuarray/subtensor.py
+12
-4
type.py
theano/sandbox/gpuarray/type.py
+103
-5
没有找到文件。
doc/index.txt
浏览文件 @
54e96754
...
...
@@ -132,7 +132,7 @@ Roughly in order of what you'll want to check out:
* :ref:`extending` -- Learn to add a Type, Op, or graph optimization.
* :ref:`dev_start_guide` -- How to contribute code to Theano.
* :ref:`developer` -- Primarily of interest to developers of Theano
* :ref:`internal` -- How to maintain Theano
, LISA-specific tips,
and more...
* :ref:`internal` -- How to maintain Theano and more...
* :ref:`release` -- How our release should work.
* :ref:`acknowledgement` -- What we took from other projects.
* `Related Projects`_ -- link to other projects that implement new functionalities on top of Theano
...
...
doc/internal/index.txt
浏览文件 @
54e96754
...
...
@@ -5,16 +5,11 @@
Internal Documentation
======================
If you're feeling ambitious, go fix some `pylint
<http://lgcm.iro.umontreal.ca/auto_theano_pylint/pylint_global.html>` errors!
.. toctree::
:maxdepth: 2
release
dev_start_guide
lisa_labo
mammouth
metadocumentation
python
how_to_release
doc/internal/lisa_labo.txt
deleted
100644 → 0
浏览文件 @
4ad36ddc
.. _lisa_labo:
===============================
LISA Labo specific instructions
===============================
Tips for running at LISA
------------------------
Shell configuration files ``/opt/lisa/os/.local.{bash,csh}rc`` should define
:envvar:`THEANORC` to include ``/opt/lisa/os/.local.theanorc`` as a
configuration file.
``/opt/lisa/os/.local.theanorc`` should include the right default values for
the lab, in particular, ``blas.ldflags`` should contain '-lgoto'.
Tips for running on a cluster
-----------------------------
:ref:`mammouth`
For instructions on running Theano on the mammouth cluster.
doc/internal/mammouth.txt
deleted
100644 → 0
浏览文件 @
4ad36ddc
.. _mammouth:
===========================
Running Theano on Mammouth
===========================
To run Theano on the Mammouth cluster, follow these simple steps:
* Make sure to source Fred's .local.bashrc file. It contains all
the goodies for using the latest and greatest (optimized) libraries
(numpy, scipy, etc.)
.. code-block:: sh
source /home/bastienf/.local.bashrc
Perhaps even put this in your ``.bashrc``
* set ``config.blas.ldflags`` to ``'-lmkl -lguide -fopenmp'``
(see :mod:`config` to know how)
Note: the -lguide flag works, however the fix should probably be considered temporary.
Intel has deprecated libguide.so in favor of the newer library libiomp5.so. However,
both libraries are mutually exclusive and one component (theano, numpy or scipy?) already
seems to be using libguide.so (hence -liomp5 causes a linking error when compiling thunks)
doc/internal/metadocumentation.txt
浏览文件 @
54e96754
...
...
@@ -110,9 +110,6 @@ pylint output is not autogenerated anymore.
Pylint documentation is generated using pylintrc file: ``Theano/doc/pylintrc``
You can see a list of all `pylint messages
<http://www.logilab.org/card/pylintfeatures>`__.
.. _metadocumentation_nightly_build:
...
...
doc/library/sandbox/gpuarray/dnn.txt
0 → 100644
浏览文件 @
54e96754
.. _libdoc_gpuarray_dnn:
===========================================
:mod:`theano.sandbox.gpuarray.dnn` -- cuDNN
===========================================
.. moduleauthor:: LISA
`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library
with functionality used by deep neural networks. It provides optimized
versions of some operations like the convolution. cuDNN is not
currently installed with CUDA. You must download and install it
yourself.
To install it, decompress the downloaded file and make the ``*.h`` and
``*.so*`` files available to the compilation environment.
There are at least three possible ways of doing so:
- The easiest is to include them in your CUDA installation. Copy the
``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
on Linux).
- Alternatively, on Linux, you can set the environment variables
``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
extracted from the download. If needed, separate multiple directories
with ``:`` as in the ``PATH`` environment variable.
example::
export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
- And as a third way, also on Linux, you can copy the ``*.h`` files
to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
By default, Theano will detect if it can use cuDNN. If so, it will use
it. If not, Theano optimizations will not introduce cuDNN ops. So
Theano will still work if the user did not introduce them manually.
To get an error if Theano can not use cuDNN, use this Theano flag:
``optimizer_including=cudnn``.
.. note::
CuDNN v3 has now been released. CuDNN v2 remains supported but CuDNN v3 is
faster and offers many more options. We recommend that everybody update to
v3.
.. note::
Starting in CuDNN v3, multiple convolution implementations are offered and
it is possible to use heuristics to automatically choose a convolution
implementation well suited to the parameters of the convolution.
The Theano flag ``dnn.conv.algo_fwd`` allows to specify the CuDNN
convolution implementation that Theano should use for forward convolutions.
Possible values include :
* ``small`` (default) : use a convolution implementation with small memory
usage
* ``none`` : use a slower implementation with minimal memory usage
* ``large`` : use a sometimes faster implementation with large memory usage
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
The Theano flag ``dnn.conv.algo_bwd`` allows to specify the CuDNN
convolution implementation that Theano should use for gradient convolutions.
Possible values include :
* ``none`` (default) : use the default non-deterministic convolution
implementation
* ``deterministic`` : use a slower but deterministic implementation
* ``fft`` : use the Fast Fourrier Transform implementation of convolution
(very high memory usage)
* ``guess_once`` : the first time a convolution is executed, the
implementation to use is chosen according to CuDNN's heuristics and reused
for every subsequent execution of the convolution.
* ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
* ``time_once`` : the first time a convolution is executed, every convolution
implementation offered by CuDNN is executed and timed. The fastest is
reused for every subsequent execution of the convolution.
* ``time_on_shape_change`` : like ``time_once`` but a new convolution
implementation selected every time the shapes of the inputs and kernels
don't match the shapes from the last execution.
``guess_*`` and ``time_*`` flag values take into account the amount of
available memory when selecting an implementation. This means that slower
implementations might be selected if not enough memory is available for the
faster implementations.
.. note::
Normally you should not call GPU Ops directly, but the CPU interface
currently does not allow all options supported by cuDNN ops. So it is
possible that you will need to call them manually.
.. note::
The documentation of CUDNN tells that, for the 2 following operations, the
reproducibility is not guaranteed with the default implementation:
`cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
Those correspond to the gradient wrt the weights and the gradient wrt the
input of the convolution. They are also used sometimes in the forward
pass, when they give a speed up.
The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
slower but deterministic convolution implementation.
.. note::
There is a problem we do not understand yet when cudnn paths are
used with symbolic links. So avoid using that.
.. note::
cudnn.so* must be readable and executable by everybody.
cudnn.h must be readable by everybody.
Functions
=========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: dnn_conv, dnn_pool
Convolution Ops
===============
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnConvDesc, GpuDnnConv, GpuDnnConvGradW, GpuDnnConvGradI
Pooling Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnPoolDesc, GpuDnnPool, GpuDnnPoolGrad
Softmax Ops
===========
.. automodule:: theano.sandbox.gpuarray.dnn
:noindex:
:members: GpuDnnSoftmax, GpuDnnSoftmaxGrad
doc/library/sandbox/gpuarray/extra.txt
0 → 100644
浏览文件 @
54e96754
.. _libdoc_gpuarray_extra:
=================
Utility functions
=================
Optimisation
------------
.. automodule:: theano.sandbox.gpuarray.opt_util
:members:
Kernel generation
-----------------
.. automodule:: theano.sandbox.gpuarray.kernel_codegen
:members:
doc/library/sandbox/gpuarray/index.txt
0 → 100644
浏览文件 @
54e96754
.. _libdoc_gpuarray:
=======================================================
:mod:`theano.sandbox.gpuarray` -- The (new) GPU backend
=======================================================
.. module:: theano.sandbox.gpuarray
:platform: Unix, Windows
:synopsis: Code for GPU programming (new)
.. moduleauthor:: MILA
.. toctree::
:maxdepth: 1
op
dnn
type
extra
doc/library/sandbox/gpuarray/op.txt
0 → 100644
浏览文件 @
54e96754
.. _libdoc_gpuarray_op:
================================
List of gpuarray Ops implemented
================================
.. moduleauthor:: LISA
Normally you should not call directly those Ops! Theano should
automatically transform cpu ops to their gpu equivalent. So this list
is just useful to let people know what is implemented on the gpu.
Basic Op
========
.. automodule:: theano.sandbox.gpuarray.basic_ops
:members:
Blas Op
=======
.. automodule:: theano.sandbox.gpuarray.blas
:members:
.. automodule:: theano.sandbox.gpuarray.nerv
:members:
Elemwise Op
===========
.. automodule:: theano.sandbox.gpuarray.elemwise
:members:
Subtensor Op
============
.. automodule:: theano.sandbox.gpuarray.subtensor
:members:
Nnet Op
=======
.. automodule:: theano.sandbox.gpuarray.nnet
:members:
.. automodule:: theano.sandbox.gpuarray.neighbours
:members:
doc/library/sandbox/gpuarray/type.txt
0 → 100644
浏览文件 @
54e96754
.. _libdoc_gpuarray_type:
===================================================
:mod:`theano.sandbox.gpuarray.type` -- Type classes
===================================================
.. automodule:: theano.sandbox.gpuarray.type
:members:
doc/library/sandbox/index.txt
浏览文件 @
54e96754
...
...
@@ -14,6 +14,8 @@
:maxdepth: 1
cuda/index
gpuarray/index
linalg
neighbours
rng_mrg
blocksparse
doc/tutorial/index.txt
浏览文件 @
54e96754
...
...
@@ -37,6 +37,7 @@ you out.
loop
sparse
using_gpu
using_multi_gpu
gpu_data_convert
aliasing
shape_info
...
...
doc/tutorial/using_multi_gpu.txt
0 → 100644
浏览文件 @
54e96754
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Theano has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the theano configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separeted by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Theano. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ THEANO_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Theano will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since theano often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distrubuting the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import theano
v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = theano.function([], [theano.tensor.dot(v01, v02),
theano.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import theano
v = theano.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead any may reduce performance.
theano/misc/check_multi_gpu.py
浏览文件 @
54e96754
...
...
@@ -12,7 +12,6 @@ import numpy
import
theano
from
theano.sandbox.gpuarray
import
init_dev
from
theano.sandbox.gpuarray.type
import
gpuarray_shared_constructor
as
shared
from
theano.sandbox.gpuarray.blas
import
gpu_dot22
...
...
@@ -22,13 +21,13 @@ def main(dev1, dev2):
size
=
1024
*
16
data
=
numpy
.
random
.
randn
(
size
,
size
)
.
astype
(
'float32'
)
val1a
=
shared
(
data
,
target
=
'ctx1'
)
val1b
=
shared
(
data
,
target
=
'ctx1'
)
val1c
=
shared
(
data
,
target
=
'ctx1'
)
val1d
=
shared
(
data
,
target
=
'ctx1'
)
val1a
=
theano
.
shared
(
data
,
target
=
'ctx1'
)
val1b
=
theano
.
shared
(
data
,
target
=
'ctx1'
)
val1c
=
theano
.
shared
(
data
,
target
=
'ctx1'
)
val1d
=
theano
.
shared
(
data
,
target
=
'ctx1'
)
val2a
=
shared
(
data
,
target
=
'ctx2'
)
val2b
=
shared
(
data
,
target
=
'ctx2'
)
val2a
=
theano
.
shared
(
data
,
target
=
'ctx2'
)
val2b
=
theano
.
shared
(
data
,
target
=
'ctx2'
)
f1
=
theano
.
function
([],
[
gpu_dot22
(
val1a
,
val1b
),
gpu_dot22
(
val1c
,
val1d
)])
...
...
theano/sandbox/gpuarray/basic_ops.py
浏览文件 @
54e96754
...
...
@@ -27,6 +27,20 @@ from .fp16_help import write_w
def
as_gpuarray_variable
(
x
,
context_name
):
"""
This will attempt to convert `x` into a variable on the GPU.
It can take either a value of another variable. If `x` is already
suitable, it will be returned as-is.
Parameters
----------
x
Object to convert
context_name : str or None
target context name for the result
"""
# If this is already some form of variable, try to avoid an extra transfer
if
isinstance
(
x
,
Variable
):
while
True
:
...
...
@@ -174,6 +188,13 @@ class Kernel(object):
class
GpuKernelBase
(
object
):
"""
Base class for operations that need to compile kernels.
It is not mandatory to use this class, but it helps with a lot of
the small things that you have to pay attention to.
"""
params_type
=
gpu_context_type
def
gpu_kernels
(
self
,
node
,
name
):
...
...
@@ -274,10 +295,25 @@ class GpuKernelBase(object):
return
(
self
.
c_code_cache_version
(),
self
.
kernel_version
(
node
))
def
kernel_version
(
self
,
node
):
"""
If you override :meth:`c_code_cache_version_apply`, call this
method to have the version of the kernel support code and
device.
Parameters
----------
node : apply node
The node that we need the cache version for.
"""
return
(
3
,
self
.
get_params
(
node
)
.
bin_id
)
class
HostFromGpu
(
Op
):
"""
Transfer data to CPU.
"""
__props__
=
()
_f16_ok
=
True
...
...
@@ -356,6 +392,10 @@ host_from_gpu = HostFromGpu()
class
GpuFromHost
(
Op
):
"""
Transfer data to GPU.
"""
__props__
=
(
'context_name'
,)
_f16_ok
=
True
params_type
=
gpu_context_type
...
...
@@ -443,6 +483,10 @@ class GpuFromHost(Op):
class
GpuToGpu
(
Op
):
"""
Transfer data between GPUs.
"""
__props__
=
(
'context_name'
,)
_f16_ok
=
True
params_type
=
gpu_context_type
...
...
@@ -494,6 +538,7 @@ class GpuToGpu(Op):
class
GpuAlloc
(
HideC
,
Alloc
):
"""
Allocate initialized memory on the GPU.
Parameters
----------
...
...
@@ -654,6 +699,10 @@ class GpuAlloc(HideC, Alloc):
class
GpuAllocEmpty
(
HideC
,
Alloc
):
"""
Allocate uninitialized memory on the GPU.
"""
__props__
=
(
'dtype'
,
'context_name'
)
_f16_ok
=
True
params_type
=
gpu_context_type
...
...
@@ -732,8 +781,10 @@ def empty_like(var):
class
GpuContiguous
(
Op
):
"""
Always return a c contiguous output. Copy the input only if it is
not already c contiguous.
Return a C contiguous version of the input.
This may either pass the object as-is (if already C contiguous) or
make a copy.
"""
__props__
=
()
...
...
@@ -793,7 +844,7 @@ gpu_contiguous = GpuContiguous()
class
GpuReshape
(
HideC
,
tensor
.
Reshape
):
"""
Implement Reshape on the gpu
.
Reshape for GPU variables
.
"""
...
...
@@ -914,6 +965,10 @@ class GpuReshape(HideC, tensor.Reshape):
class
GpuJoin
(
HideC
,
Join
):
"""
Join for GPU.
"""
_f16_ok
=
True
params_type
=
gpu_context_type
...
...
@@ -991,6 +1046,10 @@ gpu_join = GpuJoin()
class
GpuSplit
(
HideC
,
Split
):
"""
Split for GPU.
"""
def
make_node
(
self
,
x
,
axis
,
splits
):
node
=
Split
.
make_node
(
self
,
x
,
axis
,
splits
)
x
=
as_gpuarray_variable
(
x
,
infer_context_name
(
x
))
...
...
@@ -1002,6 +1061,10 @@ class GpuSplit(HideC, Split):
class
GpuEye
(
GpuKernelBase
,
Op
):
"""
Eye for GPU.
"""
__props__
=
(
'dtype'
,
'context_name'
)
_f16_ok
=
True
...
...
theano/sandbox/gpuarray/blas.py
浏览文件 @
54e96754
...
...
@@ -31,6 +31,10 @@ class BlasOp(Op):
class
GpuGemv
(
BlasOp
):
"""
Gemv on the GPU.
"""
__props__
=
(
'inplace'
,)
def
__init__
(
self
,
inplace
=
False
):
...
...
@@ -107,6 +111,10 @@ gpugemv_inplace = GpuGemv(inplace=True)
class
GpuGemm
(
BlasOp
):
"""
Gemm on the GPU.
"""
__props__
=
(
'inplace'
,)
_f16_ok
=
True
...
...
@@ -184,6 +192,10 @@ gpugemm_inplace = GpuGemm(inplace=True)
class
GpuGer
(
BlasOp
):
"""
Ger on the GPU.
"""
__props__
=
(
'inplace'
,)
def
__init__
(
self
,
inplace
=
False
):
...
...
@@ -256,6 +268,10 @@ gpuger_inplace = GpuGer(inplace=True)
class
GpuDot22
(
BlasOp
):
"""
Dot22 on the GPU.
"""
__props__
=
()
def
make_node
(
self
,
x
,
y
):
...
...
theano/sandbox/gpuarray/elemwise.py
浏览文件 @
54e96754
...
...
@@ -57,6 +57,10 @@ def as_C_string_const(s):
class
GpuElemwise
(
GpuKernelBase
,
HideC
,
Elemwise
):
"""
Elemwise on the GPU.
"""
nin
=
property
(
lambda
self
:
self
.
scalar_op
.
nin
)
nout
=
property
(
lambda
self
:
self
.
scalar_op
.
nout
)
_f16_ok
=
True
...
...
@@ -445,6 +449,10 @@ class SupportCodeError(Exception):
class
GpuDimShuffle
(
HideC
,
DimShuffle
):
"""
DimShuffle on the GPU.
"""
_f16_ok
=
True
def
make_node
(
self
,
input
):
...
...
@@ -548,7 +556,7 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
Parameters
----------
reduce
-
mask
reduce
_
mask
The dimensions along which to reduce. The `reduce_mask` is a tuple of
booleans (actually integers 0 or 1) that specify for each input
dimension, whether to reduce it (1) or not (0).
...
...
@@ -1279,14 +1287,6 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
"""
%
locals
()
def
c_code_reduce_ccontig
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
"""
WRITEME
IG: I believe, based on how this is called in c_code, that it
is for the case where we are reducing on all axes and x is
C contiguous.
"""
in_dtype
=
"npy_"
+
node
.
inputs
[
0
]
.
dtype
out_dtype
=
"npy_"
+
node
.
outputs
[
0
]
.
dtype
if
getattr
(
self
.
scalar_op
,
'identity'
,
None
)
==
0
:
...
...
@@ -2666,8 +2666,6 @@ class GpuCAReduceCPY(GpuKernelBase, HideC, CAReduceDtype):
"""
CAReduce that reuse the python code from gpuarray.
Too slow for now as it only have a python interface.
"""
def
__init__
(
self
,
scalar_op
,
axis
=
None
,
dtype
=
None
,
acc_dtype
=
None
):
if
not
hasattr
(
scalar_op
,
'identity'
):
...
...
theano/sandbox/gpuarray/kernel_codegen.py
浏览文件 @
54e96754
...
...
@@ -71,17 +71,19 @@ def inline_reduce(N, buf, pos, count, manner_fn):
count
Number of executing threads.
manner_fn
A function that accepts strings of arguments a and b, and returns c code
for their reduction.
Example: return "
%(a)
s +
%(b)
s" for a sum reduction.
A function that accepts strings of arguments a and b, and
returns c code for their reduction.
:postcondition:
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this fun
ction.
return "
%(a)
s +
%(b)
s"
for a sum redu
ction.
Notes
-----
buf should be in gpu shared memory, we access it many times.
`buf` should be in gpu shared memory, we access it many times.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
"""
loop_line
=
manner_fn
(
"
%
s[
%
s]"
%
(
buf
,
pos
),
"
%
s[i]"
%
(
buf
))
...
...
@@ -149,6 +151,13 @@ def inline_reduce_prod(N, buf, pos, count):
inline_reduce_sum
.
code_version
)
def
inline_softmax
(
N
,
buf
,
buf2
,
threadPos
,
threadCount
,
dtype
=
"float32"
):
"""
Generate code for a softmax.
On entry, `buf` and `buf2` must contain two identical copies of
the input to softmax.
After the code returns `buf` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters
----------
...
...
@@ -161,14 +170,10 @@ def inline_softmax(N, buf, buf2, threadPos, threadCount, dtype="float32"):
dtype
Dtype of the softmax's output.
:Precondition: buf and buf2 contain two identical copies of the input
to softmax
:Postcondition: buf contains the softmax, buf2 contains un-normalized
softmax
Notes
-----
buf and buf2 should be in gpu shared memory, we access it many times.
`buf` and `buf2` should be in gpu shared memory, we access it many
times.
We use __i as an int variable in a loop.
...
...
@@ -205,6 +210,9 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
"""
Return C++ code for a function that reduces a contiguous buffer.
This function leaves the answer in position 0 of the buffer. The
rest of the buffer is trashed by this function.
Parameters
----------
N
...
...
@@ -230,20 +238,19 @@ def inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count,
dtype
Optional, the dtype of the output.
manner_fn
A function that accepts strings of arguments a and b, and returns c code
for their reduction.
Example: return "
%(a)
s +
%(b)
s" for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c code for its
initialization.
A function that accepts strings of arguments a and b, and
returns c code for their reduction.
:postcondition:
This function leaves the answer in position 0 of the buffer. The rest of the
buffer is trashed by this function.
return "
%(a)
s +
%(b)
s"
for a sum reduction.
manner_init
A function that accepts strings of arguments a and return c
code for its initialization.
Notes
-----
buf
should be in gpu shared memory, we access it many times.
`buf`
should be in gpu shared memory, we access it many times.
"""
if
b
:
...
...
@@ -320,6 +327,13 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
b
=
''
,
stride_b
=
''
,
load_b
=
''
,
dtype
=
"float32"
):
"""
Generate code to perform softmax with a fixed amount of shared
memory.
On entry, `buf` is assumed to be empty.
On exit, `buf[0]` contains the softmax, `buf2` contains
un-normalized softmax.
Parameters
----------
...
...
@@ -352,13 +366,9 @@ def inline_softmax_fixed_shared(N, buf, x, stride_x, load_x,
dtype
Optional, the dtype of the softmax's output if not float32.
:Precondition: buf is empty
:Postcondition: buf[0] contains the softmax, buf2 contains un-normalized
softmax
Notes
-----
buf
should be in gpu shared memory, we access it many times.
`buf`
should be in gpu shared memory, we access it many times.
We use tx as an int variable in a loop.
...
...
theano/sandbox/gpuarray/neighbours.py
浏览文件 @
54e96754
...
...
@@ -17,6 +17,10 @@ from .type import GpuArrayType
class
GpuImages2Neibs
(
GpuKernelBase
,
Images2Neibs
,
Op
):
"""
Images2Neibs for the GPU.
"""
def
__init__
(
self
,
mode
=
'valid'
):
if
mode
not
in
[
'valid'
,
'ignore_borders'
,
'wrap_centered'
]:
raise
NotImplementedError
(
"Only the mode valid, ignore_borders"
...
...
theano/sandbox/gpuarray/nerv.py
浏览文件 @
54e96754
...
...
@@ -41,6 +41,9 @@ def ensure_float(val, name):
class
Gemm16
(
COp
):
"""
Gemm for float16 using the nervena kernels.
"""
__props__
=
(
'relu'
,
'inplace'
)
_f16_ok
=
True
params_type
=
gpu_context_type
...
...
theano/sandbox/gpuarray/opt_util.py
浏览文件 @
54e96754
...
...
@@ -22,7 +22,7 @@ def grab_cpu_scalar(v, nd):
Parameters
----------
v
: variable
v
Theano variable to extract the constant value from.
nd : int
Expected number of dimensions for the variable (for
...
...
@@ -55,7 +55,7 @@ def find_node(v, cls, ignore_clients=False):
Parameters
----------
v
: variable
v
The variable to dig through
cls : Op class
The type of the node we are looking for
...
...
@@ -84,9 +84,9 @@ def is_equal(var, val):
Parameters
----------
var
: variable
var
Variable to compare
val
: value
val
Python value
"""
...
...
@@ -101,11 +101,11 @@ def alpha_merge(cls, alpha_in, beta_in):
"""
Decorator to merge multiplication by a scalar on the output.
This will find a pattern of scal * <yourop>(some, params, alpha,
beta) and update it so that the scalar multiplication happens as
This will find a pattern of
`
scal * <yourop>(some, params, alpha,
beta)
`
and update it so that the scalar multiplication happens as
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way:
:
out = Op() * alpha + out_like * beta
...
...
@@ -113,7 +113,7 @@ def alpha_merge(cls, alpha_in, beta_in):
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, *inputs)
...
...
@@ -122,7 +122,7 @@ def alpha_merge(cls, alpha_in, beta_in):
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as:
:
def maker(node, *inputs):
return node.op(*inputs)
...
...
@@ -138,8 +138,9 @@ def alpha_merge(cls, alpha_in, beta_in):
Returns
-------
This returns an unregistered local optimizer that has the same
name as the decorated function.
local optimizer
an unregistered local optimizer that has the same name as the
decorated function.
Notes
-----
...
...
@@ -191,11 +192,11 @@ def output_merge(cls, alpha_in, beta_in, out_in):
"""
Decorator to merge addition by a value on the output.
This will find a pattern of val * <yourop>(some, params, alpha,
beta, out_like) and update it so that the addtition happens as
This will find a pattern of
`
val * <yourop>(some, params, alpha,
beta, out_like)
`
and update it so that the addtition happens as
part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
The op needs to accept an alpha and a beta scalar which act this way:
:
out = Op() * alpha + out_like * beta
...
...
@@ -203,7 +204,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
and gets added to the "real" output of the operation. An example
of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, *inputs)
...
...
@@ -212,7 +213,7 @@ def output_merge(cls, alpha_in, beta_in, out_in):
for your op so that the new version performs the same computation.
The `*inputs` parameters contains the new inputs for your op. You
MUST use those inputs instead of the ones on `node`. Note that
this function can be as simple as:
this function can be as simple as:
:
def maker(node, *inputs):
return node.op(*inputs)
...
...
@@ -230,8 +231,9 @@ def output_merge(cls, alpha_in, beta_in, out_in):
Returns
-------
This returns an unregistered local optimizer that has the same
name as the decorated function.
local optimizer
an unregistered local optimizer that has the same name as the
decorated function.
Notes
-----
...
...
@@ -281,7 +283,7 @@ def inplace_allocempty(op, idx):
This will duplicate the alloc input if it has more than one client
to allow the op to work on it inplace.
The decorated function must have this signature:
The decorated function must have this signature:
:
maker(node, inputs)
...
...
@@ -291,7 +293,7 @@ def inplace_allocempty(op, idx):
You should also switch the op to work inplace. The `*inputs`
parameters contains the new inputs for your op. You MUST use
those inputs instead of the ones on `node`. Note that this
function can be as simple as:
function can be as simple as:
:
def maker(node, inputs):
return [node.op.__class__(inplace=True)(*inputs)]
...
...
@@ -305,8 +307,9 @@ def inplace_allocempty(op, idx):
Returns
-------
This returns an unregistered inplace local optimizer that has the
same name as the decorated function.
local optimizer
an unregistered inplace local optimizer that has the same name
as the decorated function.
"""
def
wrapper
(
maker
):
...
...
theano/sandbox/gpuarray/subtensor.py
浏览文件 @
54e96754
...
...
@@ -24,6 +24,9 @@ from .elemwise import GpuElemwise
class
GpuSubtensor
(
HideC
,
Subtensor
):
"""
Subtensor on the GPU.
"""
_f16_ok
=
True
def
make_node
(
self
,
x
,
*
inputs
):
...
...
@@ -173,8 +176,8 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
The optimization to make this inplace is in tensor/opt.
The same optimization handles IncSubtensor and GpuIncSubtensor.
This Op has c_code too; it inherits tensor.IncSubtensor's c_code.
The helper methods like
do_type_checking, copy_of_x, etc. specialize
the c_code for this Op.
The helper methods like
:meth:`do_type_checking`,
:meth:`copy_of_x`, etc. specialize
the c_code for this Op.
"""
...
...
@@ -405,6 +408,9 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
class
GpuAdvancedSubtensor1
(
HideC
,
tensor
.
AdvancedSubtensor1
):
"""
AdvancedSubrensor1 on the GPU.
"""
def
make_node
(
self
,
x
,
ilist
):
ctx_name
=
infer_context_name
(
x
,
ilist
)
x_
=
as_gpuarray_variable
(
x
,
ctx_name
)
...
...
@@ -580,8 +586,10 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1):
_f16_ok
=
True
def
make_node
(
self
,
x
,
y
,
ilist
):
"""It defer from GpuAdvancedIncSubtensor1 in that it make sure
the index are of type long.
"""
It differs from GpuAdvancedIncSubtensor1 in that it makes sure
the indexes are of type long.
"""
ctx_name
=
infer_context_name
(
x
,
y
,
ilist
)
x_
=
as_gpuarray_variable
(
x
,
ctx_name
)
...
...
theano/sandbox/gpuarray/type.py
浏览文件 @
54e96754
...
...
@@ -67,6 +67,7 @@ def get_context(name):
def
list_contexts
():
"""
Return an iterable of all the registered context names.
"""
return
_context_reg
.
keys
()
...
...
@@ -85,6 +86,54 @@ def _unreg_context(name):
class
GpuArrayType
(
Type
):
"""
The type that represents an array on a gpu.
The `dtype` indicates what scalar data type the elements of
variables of this type will be.
`broadcastable` indicates whether each dimension is broadcastable
or not (to be broadcastable a dimension must always be of length
1).
The `context_name` is the name of the context on will values of
variables of this type will be stored.
Parameters
----------
dtype : str
The name of a numpy dtype
broadcastable : tuple of bools
A tuple that indicates both the number of dimensions (by its
length) and whether those dimensions are broadcastable or not
(by the boolean values).
context_name : str
The name of the context the that this type is attached to
(default: None, which is the context specified by
config.device).
name : string, optional
A name for the type that will be used in printouts.
Attributes
----------
dtype : str
Data type used for scalar elements of variables.
broadcastable : tuple of bools
Indicates whether the dimensions are broadcastable or not.
ndim : int
The number of dimensions
context_name : str
The name of a gpu context on which variables will have their values.
name : str
A string used to print the type if given.
typecode : int
The gpuarray typecode for `dtype`
See Also
--------
theano.gof.type.PureType
"""
def
__init__
(
self
,
dtype
,
broadcastable
,
context_name
=
None
,
name
=
None
):
# In case this was not provided and no global value is available
self
.
dtype
=
str
(
dtype
)
...
...
@@ -111,6 +160,11 @@ class GpuArrayType(Type):
# This is a property to keep the type pickleable
@property
def
context
(
self
):
"""
The context object mapped to the type's :attr:`context_name`.
This is a property.
"""
return
get_context
(
self
.
context_name
)
def
__repr__
(
self
):
...
...
@@ -306,8 +360,6 @@ class GpuArrayType(Type):
This function is used internally as part of C code generation.
"""
# TODO: add more type correspondances for e.g. int32, int64, float32,
# complex64, etc.
try
:
return
{
'float16'
:
(
float
,
'npy_float16'
,
'NPY_FLOAT16'
),
...
...
@@ -321,8 +373,8 @@ class GpuArrayType(Type):
'int32'
:
(
int
,
'npy_int32'
,
'NPY_INT32'
),
'uint64'
:
(
int
,
'npy_uint64'
,
'NPY_UINT64'
),
'int64'
:
(
int
,
'npy_int64'
,
'NPY_INT64'
),
'complex128'
:
(
complex
,
'theano_complex128'
,
'NPY_COMPLEX128'
),
'complex64'
:
(
complex
,
'theano_complex64'
,
'NPY_COMPLEX64'
)
#
'complex128': (complex, 'theano_complex128', 'NPY_COMPLEX128'),
#
'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
}[
self
.
dtype
]
except
KeyError
:
raise
TypeError
(
"Unsupported dtype for
%
s:
%
s"
%
...
...
@@ -420,10 +472,21 @@ class _operators(_tensor_py_operators):
class
GpuArrayVariable
(
_operators
,
Variable
):
"""
A variable representing a computation on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
Variable
"""
# override the default
def
__repr_test_value__
(
self
):
return
repr
(
numpy
.
array
(
theano
.
gof
.
op
.
get_test_value
(
self
)))
pass
GpuArrayType
.
Variable
=
GpuArrayVariable
...
...
@@ -436,6 +499,17 @@ class GpuArraySignature(tensor.TensorConstantSignature):
class
GpuArrayConstant
(
_operators
,
Constant
):
"""
A constant representing a value on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
Constant
"""
def
signature
(
self
):
return
GpuArraySignature
((
self
.
type
,
numpy
.
asarray
(
self
.
data
)))
...
...
@@ -453,6 +527,17 @@ GpuArrayType.Constant = GpuArrayConstant
class
GpuArraySharedVariable
(
_operators
,
SharedVariable
):
"""
A variable representing a shared value on a certain GPU.
This supports all the operations that :class:`TensorType`
supports.
See Also
--------
SharedVariable
"""
def
get_value
(
self
,
borrow
=
False
,
return_internal_type
=
False
):
if
return_internal_type
:
if
borrow
:
...
...
@@ -481,6 +566,8 @@ def gpuarray_shared_constructor(value, name=None, strict=False,
"""
SharedVariable constructor for GpuArrayType.
See :func:`theano.shared`.
"""
if
target
==
'gpu'
or
target
==
'cpu'
:
raise
TypeError
(
'not for me'
)
...
...
@@ -596,6 +683,13 @@ theano.compile.register_specify_shape_c_code(
class
GpuContextType
(
Type
):
"""
Minimal type used for passing contexts to nodes.
This Type is not a complete type and should never be used for
regular graph operations.
"""
def
filter
(
self
,
data
,
strict
=
False
,
allow_downcast
=
None
):
if
not
isinstance
(
data
,
gpuarray
.
GpuContext
):
raise
TypeError
(
'context is not a GpuContext'
)
...
...
@@ -652,4 +746,8 @@ Py_INCREF(%(name)s);
# Variable, Contstant, ... not declared
"""
Instance of :class:`GpuContextType` to use for the context_type
declaration of an operation.
"""
gpu_context_type
=
GpuContextType
()
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论