Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
cfc493d1
提交
cfc493d1
authored
9月 01, 2014
作者:
Frédéric Bastien
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #2033 from f0k/corrmm-faster-fullconv
Faster algorithms and gradients for GpuCorrMM
上级
a81b5cdc
372bab54
全部展开
隐藏空白字符变更
内嵌
并排
正在显示
6 个修改的文件
包含
140 行增加
和
106 行删除
+140
-106
conv.txt
doc/library/tensor/nnet/conv.txt
+68
-45
blas.py
theano/sandbox/cuda/blas.py
+0
-0
caffe_common.hpp
theano/sandbox/cuda/caffe_common.hpp
+0
-47
conv_gemm.cu
theano/sandbox/cuda/conv_gemm.cu
+0
-0
opt.py
theano/sandbox/cuda/opt.py
+72
-14
test_conv_cuda_ndarray.py
theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+0
-0
没有找到文件。
doc/library/tensor/nnet/conv.txt
浏览文件 @
cfc493d1
...
@@ -22,23 +22,28 @@
...
@@ -22,23 +22,28 @@
.. moduleauthor:: LISA
.. moduleauthor:: LISA
TODO: Give examples
for
how to use these things! They are pretty complicated.
TODO: Give examples
on
how to use these things! They are pretty complicated.
- Conv
implemented
- Conv
olution operators implemented:
- :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>`.
- :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>`.
See note above.
- :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
- :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
This is the standard operator for convolutional neural networks working
with batches of multi-channel 2D images, available for CPU and GPU.
Most of the more efficient GPU implementations listed below can be used
as an automatic replacement for nnet.conv2d by enabling specific graph
optimizations.
- :func:`conv2d_fft <theano.sandbox.cuda.fftconv.conv2d_fft>`
- :func:`conv2d_fft <theano.sandbox.cuda.fftconv.conv2d_fft>`
This is a GPU-only version of nnet.conv2d that uses an FFT transform
This is a GPU-only version of nnet.conv2d that uses an FFT transform
to perform the work. conv2d_fft should not be
us
ed directly as it
to perform the work. conv2d_fft should not be
call
ed directly as it
does not
implement a grad function. Instead, you should use
does not
provide a gradient. Instead, use nnet.conv2d and allow
nnet.conv2d and enable the fft optimizat
ion by setting
Theano's graph optimizer to replace it by the FFT vers
ion by setting
'THEANO_FLAGS=optimizer_including=conv_fft_valid:conv_fft_full'
``THEANO_FLAGS=optimizer_including=conv_fft_valid:conv_fft_full``
in your environement. This is not enabled by default because it
in your environement. This is not enabled by default because it
has some restrictions on input and uses more memory. Also note
has some restrictions on input and uses
a lot
more memory. Also note
that it requires CUDA >= 5.0, scikits.cuda >= 0.5.0 and PyCUDA to run.
that it requires CUDA >= 5.0, scikits.cuda >= 0.5.0 and PyCUDA to run.
To de
sactivate the fft
optimization on a specific nnet.conv2d
To de
activate the FFT
optimization on a specific nnet.conv2d
while the optimization flags are active, you can set its
parameters
while the optimization flags are active, you can set its
``version``
version to 'no_fft'. To enable
for just one Theano function:
parameter to ``'no_fft'``. To enable it
for just one Theano function:
.. code-block:: python
.. code-block:: python
...
@@ -47,17 +52,58 @@ TODO: Give examples for how to use these things! They are pretty complicated.
...
@@ -47,17 +52,58 @@ TODO: Give examples for how to use these things! They are pretty complicated.
f = theano.function(..., mode=mode)
f = theano.function(..., mode=mode)
- `cuda-convnet wrapper for 2d correlation <http://deeplearning.net/software/pylearn2/library/alex.html>`_
Wrapper for an open-source GPU-only implementation of conv2d by Alex
Krizhevsky, very fast, but with several restrictions on input and kernel
shapes, and with a different memory layout for the input.
This is in Pylearn2, where it is normally called from the `linear transform
<http://deeplearning.net/software/pylearn2/library/linear.html>`_
implementation, but it can also be used `directly from within Theano
<http://benanne.github.io/2014/04/03/faster-convolutions-in-theano.html>`_
as a manual replacement for nnet.conv2d.
- :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
This is a GPU-only 2d correlation implementation taken from
`caffe <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_
and also used by Torch.
For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
(hence the name: MM is for matrix multiplication).
It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
``(no of channels * filter width * filter height, output width * output height)``.
As it provides a gradient, you can use it as a replacement for nnet.conv2d.
Alternatively, you can use nnet.conv2d and allow Theano's graph optimizer
to replace it by the GEMM version by setting
``THEANO_FLAGS=optimizer_including=conv_gemm`` in your environment.
This is not enabled by default because it uses some extra memory, but the
overhead is small compared to conv2d_fft, there are no restrictions on
input or kernel shapes and it is sometimes still faster than cuda-convnet.
If using it, please see the warning about a bug in CUDA 5.0 to 6.0 below.
To enable it for just one Theano function:
.. code-block:: python
mode = theano.compile.get_default_mode()
mode = mode.including('conv_gemm')
f = theano.function(..., mode=mode)
- :func:`conv3D <theano.tensor.nnet.Conv3D.conv3D>`
- :func:`conv3D <theano.tensor.nnet.Conv3D.conv3D>`
3D Convolution. Doesn't work on the GPU.
3D Convolution applying multi-channel 3D filters to batches of
multi-channel 3D images.
- :func:`conv3d_fft <theano.sandbox.cuda.fftconv.conv3d_fft>`
- :func:`conv3d_fft <theano.sandbox.cuda.fftconv.conv3d_fft>`
GPU-only version of conv3D using FFT transform. conv3d_fft should
GPU-only version of conv3D using FFT transform. conv3d_fft should
not be call
directly as it does not implement a grad function
.
not be call
ed directly as it does not provide a gradient
.
You can enable it by setting THEANO_FLAGS to
Instead, use conv3D and allow Theano's graph optimizer to replace it by
'optimizer_including=conv3d_fft:convgrad3d_fft:convtransp3d_fft'
the FFT version by setting
It does not support strides.
``THEANO_FLAGS=optimizer_including=conv3d_fft:convgrad3d_fft:convtransp3d_fft``
This is not enabled by default because it uses more memory.
in your environment. This is not enabled by default because it does not
Also note that it requires CUDA >= 5.0,
support strides and uses more memory. Also note that it requires
scikits.cuda >= 0.5.0 and PyCUDA to run.
CUDA >= 5.0,
scikits.cuda >= 0.5.0 and PyCUDA to run.
To enable for just one Theano function:
To enable for just one Theano function:
.. code-block:: python
.. code-block:: python
...
@@ -70,33 +116,10 @@ TODO: Give examples for how to use these things! They are pretty complicated.
...
@@ -70,33 +116,10 @@ TODO: Give examples for how to use these things! They are pretty complicated.
- :func:`conv3d2d <theano.tensor.nnet.conv3d2d.conv3d>`
- :func:`conv3d2d <theano.tensor.nnet.conv3d2d.conv3d>`
Another conv3d implementation that uses the conv2d with data reshaping.
Another conv3d implementation that uses the conv2d with data reshaping.
It is faster in some cases than conv3d, specifically on the GPU.
It is faster in some cases than conv3d, specifically on the GPU.
- `Faster conv2d <http://deeplearning.net/software/pylearn2/library/alex.html>`_
This is in Pylearn2, not very documented and uses a different
memory layout for the input. It is important to have the input
in the native memory layout, and not use dimshuffle on the
inputs, otherwise you lose most of the speed up. So this is not
a drop in replacement of conv2d.
Normally those are called from the `linear transform
<http://deeplearning.net/software/pylearn2/library/linear.html>`_
implementation.
Also, there is restrictions on which shape are supported.
- :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
This is a GPU-only version of a correlation that computes correlations
as `caffe <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_.
For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a cuda kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the kernel.
It need extra memory equal to the size of the Toeplitz matrix. Precisely,
the dimensions of this 2D Toeplitz matrix is equal to
``(no of channels * filter width * filter height, output width * output height)``.
You can enable it for call to conv2d 2d by setting ``THEANO_FLAGS=optimizer_including=conv_gemm``
in your environment. This is not enabled by default because it
uses some extra memory. MM mean matrix multiply.
.. autofunction:: theano.tensor.nnet.conv.conv2d
.. autofunction:: theano.tensor.nnet.conv.conv2d
.. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
.. autofunction:: theano.sandbox.cuda.blas.GpuCorrMM
.. autofunction:: theano.tensor.nnet.Conv3D.conv3D
.. autofunction:: theano.tensor.nnet.Conv3D.conv3D
.. autofunction:: theano.sandbox.cuda.fftconv.conv3d_fft
.. autofunction:: theano.tensor.nnet.conv3d2d.conv3d
.. autofunction:: theano.tensor.nnet.conv3d2d.conv3d
.. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
theano/sandbox/cuda/blas.py
浏览文件 @
cfc493d1
差异被折叠。
点击展开。
theano/sandbox/cuda/caffe_common.hpp
deleted
100644 → 0
浏览文件 @
a81b5cdc
/*
Copyright (c) 2014, The Regents of the University of California (Regents)
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef CAFFE_COMMON_HPP_
#define CAFFE_COMMON_HPP_
#include <cublas_v2.h>
#include <cuda.h>
#include <driver_types.h> // cuda driver types
// CUDA: thread number configuration.
// Use 1024 threads per block, which requires cuda sm_2x or above,
// or fall back to attempt compatibility (best of luck to you).
#if __CUDA_ARCH__ >= 200
const
int
CAFFE_CUDA_NUM_THREADS
=
1024
;
#else
const
int
CAFFE_CUDA_NUM_THREADS
=
512
;
#endif
// CUDA: number of blocks for threads.
inline
int
CAFFE_GET_BLOCKS
(
const
int
N
)
{
return
(
N
+
CAFFE_CUDA_NUM_THREADS
-
1
)
/
CAFFE_CUDA_NUM_THREADS
;
}
#endif // CAFFE_COMMON_HPP_
theano/sandbox/cuda/conv_gemm.cu
浏览文件 @
cfc493d1
差异被折叠。
点击展开。
theano/sandbox/cuda/opt.py
浏览文件 @
cfc493d1
...
@@ -25,7 +25,8 @@ from theano.sandbox.cuda.basic_ops import (
...
@@ -25,7 +25,8 @@ from theano.sandbox.cuda.basic_ops import (
GpuIncSubtensor
,
gpu_alloc
,
GpuAlloc
,
gpu_shape
)
GpuIncSubtensor
,
gpu_alloc
,
GpuAlloc
,
gpu_shape
)
from
theano.sandbox.cuda.type
import
CudaNdarrayType
from
theano.sandbox.cuda.type
import
CudaNdarrayType
from
theano.sandbox.cuda.blas
import
(
gpu_dot22
,
gpu_dot22scalar
,
from
theano.sandbox.cuda.blas
import
(
gpu_dot22
,
gpu_dot22scalar
,
gpu_gemm_inplace
,
gpu_gemm_no_inplace
,
GpuConv
,
GpuCorrMM
)
gpu_gemm_inplace
,
gpu_gemm_no_inplace
,
GpuConv
,
GpuCorrMM
,
GpuCorrMM_gradInputs
,
GpuCorrMM_gradWeights
)
from
theano.sandbox.cuda.blas
import
gpu_gemv_inplace
from
theano.sandbox.cuda.blas
import
gpu_gemv_inplace
from
theano.sandbox.cuda.blas
import
gpu_gemv_no_inplace
from
theano.sandbox.cuda.blas
import
gpu_gemv_no_inplace
from
theano.sandbox.cuda.blas
import
gpu_ger_inplace
from
theano.sandbox.cuda.blas
import
gpu_ger_inplace
...
@@ -1121,6 +1122,8 @@ def local_gpu_conv(node):
...
@@ -1121,6 +1122,8 @@ def local_gpu_conv(node):
version
=
op
.
version
,
version
=
op
.
version
,
verbose
=
op
.
verbose
,
verbose
=
op
.
verbose
,
imshp
=
op
.
imshp
,
imshp
=
op
.
imshp
,
nkern
=
op
.
nkern
,
bsize
=
op
.
bsize
,
fft_opt
=
op
.
fft_opt
fft_opt
=
op
.
fft_opt
)
)
if
op
.
imshp_logical
is
not
None
:
if
op
.
imshp_logical
is
not
None
:
...
@@ -1206,15 +1209,25 @@ def _gpu_conv_to_fftconv(node):
...
@@ -1206,15 +1209,25 @@ def _gpu_conv_to_fftconv(node):
node
.
op
.
imshp
[
-
1
]
is
not
None
and
node
.
op
.
imshp
[
-
1
]
is
not
None
and
node
.
op
.
imshp
[
-
1
]
%
2
==
1
):
node
.
op
.
imshp
[
-
1
]
%
2
==
1
):
kwargs
[
'pad_last_dim'
]
=
True
kwargs
[
'pad_last_dim'
]
=
True
# TODO: If the user supplied the full nonsymbolic image_shape and
# If the user supplied the full nonsymbolic image_shape and
# filter_shape in conv2d(), we could pass it on to conv2d_fft(). However,
# filter_shape in conv2d(), we can pass it on to conv2d_fft().
# information on batch size and channel counts is currently discarded
if
((
node
.
op
.
imshp
is
not
None
)
and
# when a ConvOp is replaced by a GpuConv, so this would need more changes.
(
len
(
node
.
op
.
imshp
)
==
3
)
and
#if (node.op.imshp is not None) and (None not in node.op.imshp):
(
None
not
in
node
.
op
.
imshp
)
and
# kwargs['image_shape'] = (bsize, inchannels) + node.op.imshp
(
node
.
op
.
bsize
is
not
None
)):
#if (node.op.kshp is not None) and (None not in node.op.kshp):
kwargs
[
'image_shape'
]
=
(
node
.
op
.
bsize
,)
+
node
.
op
.
imshp
# kwargs['filter_shape'] = (outchannels, inchannels) + node.op.kshp
if
((
node
.
op
.
kshp
is
not
None
)
and
return
conv2d_fft
(
node
.
inputs
[
0
],
node
.
inputs
[
1
],
**
kwargs
)
(
None
not
in
node
.
op
.
kshp
)
and
(
node
.
op
.
nkern
is
not
None
)
and
(
len
(
node
.
op
.
imshp
)
==
3
)
and
(
node
.
op
.
imshp
[
0
]
is
not
None
)):
kwargs
[
'filter_shape'
]
=
(
node
.
op
.
nkern
,
node
.
op
.
imshp
[
0
])
+
node
.
op
.
kshp
rval
=
conv2d_fft
(
node
.
inputs
[
0
],
node
.
inputs
[
1
],
**
kwargs
)
if
(
'image_shape'
in
kwargs
)
or
(
'filter_shape'
in
kwargs
):
# With given shape information, conv2d_fft may return a different
# broadcast pattern than GpuConv. This is forbidden, so we fix it.
rval
=
tensor
.
patternbroadcast
(
rval
,
node
.
outputs
[
0
]
.
type
.
broadcastable
)
return
rval
@local_optimizer
([
GpuConv
])
@local_optimizer
([
GpuConv
])
...
@@ -1351,10 +1364,55 @@ def local_conv_gemm(node):
...
@@ -1351,10 +1364,55 @@ def local_conv_gemm(node):
if
(
isinstance
(
node
.
op
,
GpuConv
)
and
if
(
isinstance
(
node
.
op
,
GpuConv
)
and
node
.
op
.
border_mode
in
[
'full'
,
'valid'
]):
node
.
op
.
border_mode
in
[
'full'
,
'valid'
]):
img
,
kern
=
node
.
inputs
img
,
kern
=
node
.
inputs
img
=
gpu_contiguous
(
img
)
border_mode
=
node
.
op
.
border_mode
kern
=
kern
[:,
:,
::
-
1
,
::
-
1
]
subsample
=
node
.
op
.
subsample
kern
=
gpu_contiguous
(
kern
)
pad
=
(
0
,
0
)
return
[
GpuCorrMM
(
node
.
op
.
border_mode
,
node
.
op
.
subsample
)(
img
,
kern
)]
if
(
border_mode
==
'full'
)
and
(
subsample
!=
(
1
,
1
)):
# need to simulate this via a padded valid convolution
pad
=
'full'
border_mode
=
'valid'
if
(
border_mode
==
'valid'
):
# need to flip the kernel for valid convolution
kern
=
kern
[:,
:,
::
-
1
,
::
-
1
]
# call GpuCorrMM or GpuCorrMM_gradWeights
# (the latter is faster if batchsize * kernelHeight * kernelWidth
# is larger than inputChannels * outputHeight * outputWidth.
# GpuConv does not always store information on the batchsize and
# channels, though, so we only use what information we have.)
if
((
subsample
==
(
1
,
1
))
and
(
node
.
op
.
imshp
is
not
None
)
and
(
None
not
in
node
.
op
.
imshp
[
-
2
:])
and
(
node
.
op
.
kshp
is
not
None
)
and
(
None
not
in
node
.
op
.
kshp
)):
# we know the kernel and output size
prod1
=
node
.
op
.
kshp
[
0
]
*
node
.
op
.
kshp
[
1
]
prod2
=
((
node
.
op
.
imshp
[
-
2
]
-
node
.
op
.
kshp
[
0
]
+
1
)
*
(
node
.
op
.
imshp
[
-
1
]
-
node
.
op
.
kshp
[
1
]
+
1
))
if
((
node
.
op
.
bsize
is
not
None
)
and
(
len
(
node
.
op
.
imshp
)
==
3
)
and
(
node
.
op
.
imshp
[
0
]
is
not
None
)):
# we also know batchsize and input channels
prod1
*=
node
.
op
.
bsize
prod2
*=
node
.
op
.
imshp
[
0
]
# compare to decide
if
prod1
>
prod2
:
# (we need to wrap the result in as_cuda_ndarray_variable,
# because we are not allowed to replace a CudaNdarray with
# a DimShuffle instance in a graph optimization)
return
[
theano
.
sandbox
.
cuda
.
as_cuda_ndarray_variable
(
GpuCorrMM_gradWeights
(
'valid'
,
subsample
,
pad
)(
gpu_contiguous
(
img
.
dimshuffle
(
1
,
0
,
2
,
3
)),
gpu_contiguous
(
kern
.
dimshuffle
(
1
,
0
,
2
,
3
))
)
.
dimshuffle
(
1
,
0
,
2
,
3
))]
# use GpuCorrMM if we did not choose GpuCorrMM_gradWeights above
return
[
GpuCorrMM
(
'valid'
,
subsample
,
pad
)(
gpu_contiguous
(
img
),
gpu_contiguous
(
kern
))]
elif
(
border_mode
==
'full'
):
# need to dimshuffle the kernel for full convolution
kern
=
kern
.
dimshuffle
(
1
,
0
,
2
,
3
)
# call GpuCorrMM_gradInputs
return
[
GpuCorrMM_gradInputs
(
'valid'
,
subsample
,
pad
)(
gpu_contiguous
(
kern
),
gpu_contiguous
(
img
))]
gpu_optimizer
.
register
(
"conv_gemm"
,
local_conv_gemm
)
gpu_optimizer
.
register
(
"conv_gemm"
,
local_conv_gemm
)
...
...
theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
浏览文件 @
cfc493d1
差异被折叠。
点击展开。
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论