Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
72a7214a
提交
72a7214a
authored
8月 21, 2012
作者:
lamblin
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #863 from nouiz/mixed2
Mixed2
上级
7ebae191
43b81a93
隐藏空白字符变更
内嵌
并排
正在显示
9 个修改的文件
包含
227 行增加
和
261 行删除
+227
-261
NEWS.txt
NEWS.txt
+1
-142
theano-nose
bin/theano-nose
+14
-0
compiledir.py
theano/gof/compiledir.py
+1
-1
basic_ops.py
theano/sandbox/cuda/basic_ops.py
+141
-79
cuda_ndarray.cu
theano/sandbox/cuda/cuda_ndarray.cu
+35
-23
nvcc_compiler.py
theano/sandbox/cuda/nvcc_compiler.py
+15
-1
test_scan.py
theano/scan_module/tests/test_scan.py
+13
-12
__init__.py
theano/tensor/__init__.py
+2
-0
extra_ops.py
theano/tensor/extra_ops.py
+5
-3
没有找到文件。
NEWS.txt
浏览文件 @
72a7214a
...
...
@@ -2,148 +2,7 @@
Updates in the Trunk since the last release:
Bug fixes
* Outputs of Scan nodes could contain corrupted values: some parts of the
output would be repeated a second time, instead of the correct values.
It happened randomly, and quite infrequently, but the bug has been present
(both in Python and Cython) since April 2011. (Pascal L.)
* In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
It did not return the right number of elements. (Frederic B.)
* set_subtensor(x[int vector], new_value) when moved to the GPU
was transformed into inc_subtensor on the GPU. Now we have a correct
(but slow) GPU implementation.
Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
in all cases as well as inc_subtensor(*, *).
Note 2: If your code was affected by the incorrect behavior, we now print
a warning by default (Frederic B.)
* Fixed an issue whereby config values were used as default arguments,
with those defaults then stuck at old values if the config variables were
changed during program execution. (David W-F)
* Fixed many subtle bugs involving mutable default arguments which may have
led to unexpected behaviour, such as objects sharing instance variables
they were not supposed to share. (David W-F)
* Correctly record the GPU device number used when we let the driver select it.
(Frederic B.)
Documentation
* Added in the tutorial documentation on how to extend Theano.
This explains how to make a Theano Op from a Python function.
http://deeplearning.net/software/theano/tutorial/extending_theano.html
(Frédéric B.)
* New installation instructions for Windows using EPD (Pascal L.)
Interface changes
* In 0.5, we removed the deprecated sharedvar.value property.
Now we raise an error if you access it. (Frederic B.)
* theano.function does not accept duplicate inputs, so function([x, x], ...)
does not work anymore. (Pascal L.)
* theano.function now raises an error if some of the provided inputs are
not part of the computational graph needed to compute the output, for
instance, function([x, y], [y]). You can use the kwarg
``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
(Pascal L.)
* New Theano flag "on_unused_input" that define the default value of the
previous point. (Frederic B.)
* tensor.alloc() now raises an error during graph build time
when we try to create less dimensions than the number of dimensions
the provided value have. In the past, the error was at run time.
(Frederic B.)
Speed up
* Convolution on the GPU now check the generation of the card to make
it faster in some cases (especially medium/big ouput image) (Frédéric B.)
(We hardcoded 512 as the maximum number of thread per block. Newer card
support up to 1024 threads per block.
* CPU convolution are now parallelized (Frédric B.)
By default use all cores/hyper-threads
To control it, use the OMP_NUM_THREADS=N environment variable.
New Features
* debugprint new param ids=["CHAR", "id", "int", ""]
This makes the identifier printed to be the python id, a unique char, a
unique int, or not have it printed. We changed the default to be "CHAR"
as this is more readable. (Frederic B.)
* debugprint new param stop_on_name=[False, True]. If True, we don't print
anything below an intermediate variable that has a name. Defaults to False.
(Frederic B.)
* debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
* If you use Enthought Python Distribution (EPD) now we use its blas
implementation by default (tested on Linux and Windows)
(Frederic B., Simon McGregor)
* MRG random now raises an error with a clear message when the passed shape
contains dimensions with bad value like 0. (Frédéric B. reported by Ian G.)
* "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
* "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
* We add dimensions to CudaNdarray to automatically broadcast more frequently.
(Frederic B.)
* theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
* New theano flag cmodule.warn_no_version. Default False. If True,
will print a warning when compiling one or more Op with C code that
can't be cached because there is no c_code_cache_version() function
associated to at least one of those Ops. (Frederic B.)
* CPU alloc now always generate C code (Pascal L.)
* New Theano flag cmodule.warn_no_version=False. When True, warn when an op
with C code is not versioned (which forces to recompile it everytimes).
(Frédéric B.)
* Made a few Ops with C code versioned to reduce compilation time.
(Frédéric B, Pascal L.)
* C code reuses preallocated outputs (only done by Scan) (Pascal L.)
* Garbage collection of intermediate results during Theano function calls
for Ops with C code (Pascal L.)
* Theano flags compiledir_format now support the parameter numpy_version.
* Theano GPU variables, shared variable and constant now support <, <=,
> and >= as as those not on the GPU.
Sparse
* Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
have the same sparsity pattern. (Frederic B.)
Sparse Sandbox graduate
* Remove0 op: it removes stored elements with value 0. (Frederic B.)
Sparse Sandbox Additions (not reviewed/documented/tested, but used by some people)
* They are all in the theano.sparse.sandbox.sp2 module
* Op class: Cast, Poisson, Multinomial, EliminateZeros, Sum, Binomial
* Op class: SamplingDot, SamplingDotCsr (inserted automatically)
* Op function: structured_sigmoid, structured_exp, structured_pow, structured_minimum
* Op class: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
* opt: local_sampling_dot_csr, local_structured_add_s_v
Internal changes
* Define new exceptions MissingInputError and UnusedInputError, and use them
in theano.function, instead of TypeError and ValueError. (Pascal L.)
* Better handling of bitwidth and max values of integers and pointers
across platforms (Pascal L.)
Crash Fix
* Do not try to use the BLAS library when blas.ldflags is manually set to an
empty string (Frederic B.)
* When importing theano on a computer without GPU with the Theano
flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by Luo Heng)
* Optimization printed a useless error when scipy was not available. (Frederic B.)
* GPU conv crash/slowdown on newer hardware (James B.)
* Better error handling in GPU conv (Frederic B.)
* GPU optimization that moves element-wise Ops to the GPU. Crash happened in
a particular execution order of this optimization and the
element-wise fusion optimization when upcasting some inputs to
float32 (to compute them on the GPU).
(Frederic B., reported by Sander Dieleman)
* GpuReshape in some particular case when the input is not contiguous
(Frederic B., reported by Sander Dieleman)
* GpuSoftmaxWithBias with shape (0, N) with N > 1.
(Frédéric B., reported by Razvan P.)
* Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
(Pascal L., reported by Simon McGregor)
* Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
dimensions, which could typically result in optimization crashes (Olivier D.)
* Fixed crash when concatenating some arrays with specific broadcasting
patterns (Olivier D.)
* Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylon)
* In advanced indexing, if some inputs are constant, no need to call constant(...)
on their value any more. (Pascal L., reported by John Salvatier)
* Fix crash on GPU when the GpuSubtensor didn't put the right stride
when the results tensor had a dimensions with size of 1. (Pascal L,
reported Graham T.)
https://github.com/Theano/Theano/wiki/Devnews
=============
Release Notes
...
...
bin/theano-nose
浏览文件 @
72a7214a
...
...
@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests.
option will be interpreted as an indication of the number of tests to be run
between notifications of progress to standard output.
If the '--theano' option is used, it is replaced with the path to theano.
Useful if you don't know where it was installed.
`run_tests_in_batch.py` will in turn call back this script in another process.
"""
...
...
@@ -39,6 +42,12 @@ import sys
from
nose.plugins
import
Plugin
def
main
():
# Handle the --theano arguments
if
"--theano"
in
sys
.
argv
:
i
=
sys
.
argv
.
index
(
"--theano"
)
import
theano
sys
.
argv
[
i
]
=
theano
.
__path__
[
0
]
# Handle --batch[=n] arguments
batch_args
=
[
arg
for
arg
in
sys
.
argv
if
arg
.
startswith
(
'--batch'
)]
for
arg
in
batch_args
:
...
...
@@ -137,6 +146,11 @@ def help():
--without-knownfailure: Do not load the KnownFailure plugin.
--theano: This parameter is replaced with the path to the theano library.
As theano-nose is a wrapper to nosetests, it expect a path to the tests to run.
If you don't know where theano is installed, use this option
to have it inserted automatically.
The other options will be passed to nosetests, see ``nosetests -h``.
"""
...
...
theano/gof/compiledir.py
浏览文件 @
72a7214a
...
...
@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(),
"python_version"
:
platform
.
python_version
(),
"theano_version"
:
theano
.
__version__
,
"numpy_version"
:
numpy
.
__version__
,
"g
++
"
:
gcc_version_str
.
replace
(
" "
,
"_"
),
"g
xx_version
"
:
gcc_version_str
.
replace
(
" "
,
"_"
),
}
compiledir_format_keys
=
", "
.
join
(
compiledir_format_dict
.
keys
())
default_compiledir_format
=
\
...
...
theano/sandbox/cuda/basic_ops.py
浏览文件 @
72a7214a
...
...
@@ -11,7 +11,7 @@ from theano import tensor, scalar, config
from
theano.gof.python25
import
all
,
any
from
theano.sandbox.cuda
import
GpuOp
from
theano.sandbox.cuda
import
GpuOp
,
device_properties
from
theano.sandbox.cuda.type
import
CudaNdarrayType
from
theano.sandbox.cuda
import
filter
as
type_support_filter
...
...
@@ -641,7 +641,9 @@ class GpuSum(GpuOp):
printf("running kernel_reduce_sum_
%(pattern)
s_
%(name)
s
\\
n");
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
if (verbose>1)
printf("n_threads.x=
%%
d, n_threads.y=
%%
d, n_threads.z=
%%
d, nb_threads=
%%
d, n_blocks.x=
%%
d, n_blocks.y=
%%
d, nb_block=
%%
d, n_shared=
%%
d
\\
n",
printf("n_threads.x=
%%
d, n_threads.y=
%%
d, n_threads.z=
%%
d,"
" nb_threads=
%%
d, n_blocks.x=
%%
d, n_blocks.y=
%%
d,"
" nb_block=
%%
d, n_shared=
%%
d
\\
n",
n_threads.x,n_threads.y,n_threads.z,
n_threads.x*n_threads.y*n_threads.z,
n_blocks.x,n_blocks.y,
...
...
@@ -673,7 +675,8 @@ class GpuSum(GpuOp):
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError,
"Cuda error:
%%
s:
%%
s. (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
"Cuda error:
%%
s:
%%
s."
" (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
"kernel_reduce_sum_
%(pattern)
s_
%(name)
s",
cudaGetErrorString(sts),
n_blocks.x,
...
...
@@ -876,7 +879,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_SIZE(
%(x)
s),
NUM_VECTOR_OP_THREADS_PER_BLOCK));
dim3 n_blocks(1);
if (verbose) printf("running kernel_reduce_sum_ccontig_
%(name)
s n_threads.x=
%%
d, size=
%%
d, ndim=
%%
d
\\
n",
if (verbose) printf("running kernel_reduce_sum_ccontig_
%(name)
s"
" n_threads.x=
%%
d, size=
%%
d, ndim=
%%
d
\\
n",
n_threads.x,CudaNdarray_SIZE(
%(x)
s),
%(x)
s->nd);
int n_shared = sizeof(float) * n_threads.x;
kernel_reduce_sum_ccontig_
%(name)
s<<<n_blocks, n_threads, n_shared>>>(
...
...
@@ -887,7 +891,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error:
%%
s:
%%
s. (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error:
%%
s:
%%
s."
" (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
"kernel_reduce_sum_ccontig_
%(name)
s",
cudaGetErrorString(sts),
n_blocks.x,
...
...
@@ -937,11 +943,13 @@ class GpuSum(GpuOp):
:param N: the number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111
Work for N=1,2,3
"""
assert
N
in
[
1
,
2
,
3
]
assert
N
in
[
1
,
2
,
3
]
makecall
=
self
.
_makecall
(
node
,
name
,
x
,
z
,
fail
)
N_pattern
=
''
.
join
([
'1'
]
*
N
)
param_dim
=
","
.
join
([
"CudaNdarray_HOST_DIMS(
%(x)
s)[
%(i)
s]"
%
locals
()
for
i
in
xrange
(
N
+
1
)])
strides_dim
=
","
.
join
([
"CudaNdarray_HOST_STRIDES(
%(x)
s)[
%(i)
s]"
%
locals
()
for
i
in
xrange
(
N
+
1
)])
N_pattern
=
''
.
join
([
'1'
]
*
N
)
param_dim
=
","
.
join
([
"CudaNdarray_HOST_DIMS(
%(x)
s)[
%(i)
s]"
%
locals
()
for
i
in
xrange
(
N
+
1
)])
strides_dim
=
","
.
join
([
"CudaNdarray_HOST_STRIDES(
%(x)
s)[
%(i)
s]"
%
locals
()
for
i
in
xrange
(
N
+
1
)])
threads_y
=
"""
//get as many y threads as we can fit
while (n_threads.x * (n_threads.y+1) <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
...
...
@@ -962,10 +970,10 @@ class GpuSum(GpuOp):
break;
}
"""
%
locals
()
if
len
(
self
.
reduce_mask
)
==
2
:
if
len
(
self
.
reduce_mask
)
==
2
:
threads_y
=
''
threads_z
=
''
if
len
(
self
.
reduce_mask
)
==
3
:
if
len
(
self
.
reduce_mask
)
==
3
:
threads_z
=
''
print
>>
sio
,
"""
{
...
...
@@ -975,15 +983,18 @@ class GpuSum(GpuOp):
NUM_VECTOR_OP_THREADS_PER_BLOCK));
%(threads_y)
s
%(threads_z)
s
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(
%(x)
s)[0],NUM_VECTOR_OP_BLOCKS));
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(
%(x)
s)[0],
NUM_VECTOR_OP_BLOCKS));
%(makecall)
s
}
"""
%
locals
()
def
c_code_reduce_01
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
self
.
c_code_reduce_01X
(
sio
,
node
,
name
,
x
,
z
,
fail
,
1
)
def
c_code_reduce_011
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
self
.
c_code_reduce_01X
(
sio
,
node
,
name
,
x
,
z
,
fail
,
2
)
def
c_code_reduce_0111
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
self
.
c_code_reduce_01X
(
sio
,
node
,
name
,
x
,
z
,
fail
,
3
)
...
...
@@ -1021,7 +1032,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error:
%%
s:
%%
s. (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error:
%%
s:
%%
s."
" (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
"kernel_reduce_sum_010_
%(name)
s",
cudaGetErrorString(sts),
n_blocks.x,
...
...
@@ -1033,9 +1046,11 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
def
c_code_reduce_010
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
makecall
=
self
.
_makecall
(
node
,
name
,
x
,
z
,
fail
)
makecall_inner
=
self
.
_makecall
(
node
,
name
,
x
,
z
,
fail
,
pattern
=
"010_inner"
)
makecall_inner
=
self
.
_makecall
(
node
,
name
,
x
,
z
,
fail
,
pattern
=
"010_inner"
)
pattern
=
''
.
join
(
str
(
i
)
for
i
in
self
.
reduce_mask
)
print
>>
sio
,
"""
{
...
...
@@ -1085,7 +1100,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error:
%%
s:
%%
s. (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error:
%%
s:
%%
s."
" (grid:
%%
i x
%%
i; block:
%%
i x
%%
i x
%%
i)
\\
n",
"kernel_reduce_sum_010_
%(name)
s",
cudaGetErrorString(sts),
n_blocks.x,
...
...
@@ -1233,6 +1250,7 @@ class GpuSum(GpuOp):
%(makecall)
s
}
"""
%
locals
()
def
c_code_reduce_111
(
self
,
sio
,
node
,
name
,
x
,
z
,
fail
):
makecall
=
self
.
_makecall
(
node
,
name
,
x
,
z
,
fail
)
print
>>
sio
,
"""
...
...
@@ -1275,7 +1293,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_HOST_DIMS(
%(x)
s)[0],
NUM_VECTOR_OP_BLOCKS));
while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS && n_blocks.y < CudaNdarray_HOST_DIMS(
%(x)
s)[1])
while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS &&
n_blocks.y < CudaNdarray_HOST_DIMS(
%(x)
s)[1])
{
n_blocks.y += 1;
}
...
...
@@ -1356,7 +1375,7 @@ class GpuSum(GpuOp):
def
c_support_code_apply
(
self
,
node
,
nodename
):
sio
=
StringIO
.
StringIO
()
nd_in
=
len
(
self
.
reduce_mask
)
if
all
(
i
==
1
for
i
in
self
.
reduce_mask
):
if
all
(
i
==
1
for
i
in
self
.
reduce_mask
):
#this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor
reducebuf
=
self
.
_k_reduce_buf
(
'Z[0]'
)
...
...
@@ -1411,7 +1430,7 @@ class GpuSum(GpuOp):
%(reducebuf)
s
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
1
):
if
self
.
reduce_mask
==
(
1
,
1
):
#this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor
reducebuf
=
self
.
_k_reduce_buf
(
'Z[0]'
)
...
...
@@ -1444,29 +1463,33 @@ class GpuSum(GpuOp):
}
"""
%
locals
()
#01, 011, 0111
if
0
==
self
.
reduce_mask
[
0
]
and
all
(
self
.
reduce_mask
[
1
:])
and
nd_in
in
[
2
,
3
,
4
]:
if
(
0
==
self
.
reduce_mask
[
0
]
and
all
(
self
.
reduce_mask
[
1
:])
and
nd_in
in
[
2
,
3
,
4
]):
# this kernel uses one block for each row.
# threads per block for each element per row.
N_pattern
=
''
.
join
([
'1'
]
*
(
nd_in
-
1
))
if
nd_in
==
2
:
N_pattern
=
''
.
join
([
'1'
]
*
(
nd_in
-
1
))
if
nd_in
==
2
:
for_i1
=
"for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)"
for_i2
=
"int i2=0, sA2=0;"
for_i3
=
"int i3=0, sA3=0;"
if
nd_in
==
3
:
for_i2
=
"int i2=0, sA2=0;"
for_i3
=
"int i3=0, sA3=0;"
if
nd_in
==
3
:
for_i1
=
"for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)"
for_i2
=
"for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)"
for_i3
=
"int i3=0, sA3=0;"
if
nd_in
==
4
:
for_i3
=
"int i3=0, sA3=0;"
if
nd_in
==
4
:
for_i1
=
"for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)"
for_i2
=
"for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)"
for_i3
=
"for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)"
reducebuf
=
self
.
_k_reduce_buf
(
'Z[i0 * sZ0]'
)
param_dim
=
","
.
join
([
"const int d
%(i)
s"
%
locals
()
for
i
in
xrange
(
nd_in
)])
param_strides
=
","
.
join
([
"const int sA
%(i)
s"
%
locals
()
for
i
in
xrange
(
nd_in
)])
decl
=
self
.
_k_decl
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
param_dim
=
","
.
join
([
"const int d
%(i)
s"
%
locals
()
for
i
in
xrange
(
nd_in
)])
param_strides
=
","
.
join
([
"const int sA
%(i)
s"
%
locals
()
for
i
in
xrange
(
nd_in
)])
decl
=
self
.
_k_decl
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
print
>>
sio
,
"""
%(decl)
s{
%(init)
s
...
...
@@ -1484,7 +1507,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
1
,
0
)
or
self
.
reduce_mask
==
(
1
,
0
):
if
self
.
reduce_mask
==
(
0
,
1
,
0
)
or
self
.
reduce_mask
==
(
1
,
0
):
# this kernel uses one block for each column,
# threads per block for each element per column.
...
...
@@ -1497,7 +1520,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
...
...
@@ -1525,7 +1549,7 @@ class GpuSum(GpuOp):
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
1
,
0
):
if
self
.
reduce_mask
==
(
0
,
1
,
0
):
print
>>
sio
,
"""
static __global__ void kernel_reduce_sum_010_AD_
%(nodename)
s(
const int A,
...
...
@@ -1533,7 +1557,8 @@ class GpuSum(GpuOp):
const int C,
const int D,
//const int E, // THIS is 32
const float *X, const int sX0, const int sX1, const int sX2,
const float *X, const int sX0,
const int sX1, const int sX2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
...
...
@@ -1564,9 +1589,10 @@ class GpuSum(GpuOp):
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
1
,
0
):
if
self
.
reduce_mask
==
(
0
,
1
,
0
):
#
# This kernel is optimized when the inner most dimensions have the smallest stride.
# This kernel is optimized when the inner most dimensions
# have the smallest stride.
# this kernel uses one block for multiple column(up to 32TODO),
# threads per block for each element per column.
...
...
@@ -1575,10 +1601,12 @@ class GpuSum(GpuOp):
#thread.y = dim 1
#block.x = dim 0
#block.y = dim 1 rest
init
=
self
.
_k_init
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
decl
=
self
.
_k_decl
(
node
,
nodename
,
pattern
=
"010_inner"
)
reducebuf
=
self
.
_k_reduce_buf_multiple
(
'Z[i0 * sZ0 + i2*sZ1]'
,
'blockDim.x'
)
reducebuf
=
self
.
_k_reduce_buf_multiple
(
'Z[i0 * sZ0 + i2*sZ1]'
,
'blockDim.x'
)
reducebuf
=
self
.
_k_reduce_buf_multiple
(
'Z[i0 * sZ0 + i2*sZ1]'
,
'blockDim.x'
)
reducebuf
=
self
.
_k_reduce_buf_multiple
(
'Z[i0 * sZ0 + i2*sZ1]'
,
'blockDim.x'
)
print
>>
sio
,
"""
%(decl)
s
{
...
...
@@ -1602,7 +1630,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
1
,
0
):
if
self
.
reduce_mask
==
(
1
,
1
,
0
):
# this kernel uses one block for each column,
# threads per block for each element per column.
...
...
@@ -1615,7 +1643,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0)
{
const int threadCount = blockDim.x * blockDim.y;
...
...
@@ -1642,7 +1671,7 @@ class GpuSum(GpuOp):
%(reducebuf)
s
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
0
,
0
):
if
self
.
reduce_mask
==
(
1
,
0
,
0
):
reducebuf
=
self
.
_k_reduce_buf
(
'Z[i1 * sZ0 + i2 * sZ1]'
)
decl
=
self
.
_k_decl
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
...
...
@@ -1664,7 +1693,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
1
,
1
):
if
self
.
reduce_mask
==
(
1
,
1
,
1
):
reducebuf
=
self
.
_k_reduce_buf
(
'Z[0]'
)
decl
=
self
.
_k_decl
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
...
...
@@ -1686,7 +1715,7 @@ class GpuSum(GpuOp):
%(reducebuf)
s
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
0
,
1
):
if
self
.
reduce_mask
==
(
0
,
0
,
1
):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf
=
self
.
_k_reduce_buf
(
'Z[i0 * sZ0 + i1 * sZ1]'
)
...
...
@@ -1695,7 +1724,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
...
...
@@ -1721,7 +1751,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
0
,
1
,
1
):
if
self
.
reduce_mask
==
(
0
,
0
,
1
,
1
):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf
=
self
.
_k_reduce_buf
(
'Z[i0 * sZ0 + i1 * sZ1]'
)
...
...
@@ -1749,7 +1779,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
0
,
1
,
0
,
1
):
if
self
.
reduce_mask
==
(
0
,
1
,
0
,
1
):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf
=
self
.
_k_reduce_buf
(
'Z[i0 * sZ0 + i2 * sZ1]'
)
...
...
@@ -1777,7 +1807,7 @@ class GpuSum(GpuOp):
}
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
1
,
1
,
1
):
if
self
.
reduce_mask
==
(
1
,
1
,
1
,
1
):
reducebuf
=
self
.
_k_reduce_buf
(
'Z[0]'
)
decl
=
self
.
_k_decl
(
node
,
nodename
)
init
=
self
.
_k_init
(
node
,
nodename
)
...
...
@@ -1800,7 +1830,7 @@ class GpuSum(GpuOp):
%(reducebuf)
s
}
"""
%
locals
()
if
self
.
reduce_mask
==
(
1
,
0
,
1
,
1
):
if
self
.
reduce_mask
==
(
1
,
0
,
1
,
1
):
reducebuf
=
self
.
_k_reduce_buf
(
'Z[blockIdx.x*sZ0]'
)
print
>>
sio
,
"""
static __global__ void kernel_reduce_sum_1011_
%(nodename)
s(
...
...
@@ -1808,7 +1838,8 @@ class GpuSum(GpuOp):
const unsigned int d1,
const unsigned int d2,
const unsigned int d3,
const float *A, const int sA0, const int sA1, const int sA2, const int sA3,
const float *A, const int sA0, const int sA1,
const int sA2, const int sA3,
float * Z, const int sZ0)
{
const int threadCount = blockDim.x * blockDim.y * blockDim.z;
...
...
@@ -1867,7 +1898,7 @@ class GpuSubtensor(tensor.Subtensor, GpuOp):
assert
isinstance
(
x
.
type
,
CudaNdarrayType
)
rval
=
tensor
.
Subtensor
.
make_node
(
self
,
x
,
*
inputs
)
otype
=
CudaNdarrayType
(
rval
.
outputs
[
0
]
.
type
.
broadcastable
)
return
Apply
(
self
,
[
x
]
+
rval
.
inputs
[
1
:],
[
otype
()])
return
Apply
(
self
,
[
x
]
+
rval
.
inputs
[
1
:],
[
otype
()])
def
perform
(
self
,
node
,
inputs
,
out_
):
out
,
=
out_
...
...
@@ -1907,6 +1938,7 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
#If True or False, we assert that we use the take version or not
#If None, we choose the best one applicable
perform_using_take
=
None
max_threads
=
0
def
make_node
(
self
,
x
,
ilist
):
x_
=
as_cuda_ndarray_variable
(
x
)
...
...
@@ -1946,9 +1978,18 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
idx
=
idx
.
view
(
"float32"
)
idx
=
cuda_ndarray
.
cuda_ndarray
.
CudaNdarray
(
idx
)
if
self
.
max_threads
==
0
:
num
=
theano
.
sandbox
.
cuda
.
use
.
device_number
if
device_properties
(
num
)[
'regsPerBlock'
]
<
(
8192
*
2
):
self
.
max_threads
=
256
else
:
self
.
max_threads
=
512
o
=
x
.
take
(
idx
,
0
,
# axis
out_
[
0
][
0
])
# return
out_
[
0
][
0
],
# return
"raise"
,
self
.
max_threads
)
if
x
is
not
x_orig
:
o
=
o
.
reshape
(
out_shape
)
out
[
0
]
=
o
...
...
@@ -2033,14 +2074,14 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):
assert
isinstance
(
x
.
type
,
CudaNdarrayType
)
assert
isinstance
(
y
.
type
,
CudaNdarrayType
)
rval
=
tensor
.
IncSubtensor
.
make_node
(
self
,
x
,
y
,
*
inputs
)
return
Apply
(
self
,
[
x
,
y
]
+
rval
.
inputs
[
2
:],
[
x
.
type
()])
return
Apply
(
self
,
[
x
,
y
]
+
rval
.
inputs
[
2
:],
[
x
.
type
()])
class
GpuFlatten
(
tensor
.
Flatten
,
GpuOp
):
"""
Implement Flatten on the gpu.
"""
def
make_node
(
self
,
x
):
def
make_node
(
self
,
x
):
assert
isinstance
(
x
.
type
,
CudaNdarrayType
)
rval
=
tensor
.
Flatten
.
make_node
(
self
,
x
)
host_out_broadcastable
=
rval
.
outputs
[
0
]
.
type
.
broadcastable
...
...
@@ -2096,10 +2137,12 @@ class GpuJoin(tensor.Join, GpuOp):
# dimension in "axis" can be different, so make equal for ==
tmp_shape
[
axis
]
=
template_shape
[
axis
]
if
tuple
(
tmp_shape
)
!=
template_shape
:
raise
ValueError
,
"Shape of input CudaNdarrays must agree except for the 'axis' dimension"
raise
ValueError
(
"Shape of input CudaNdarrays must"
" agree except for the 'axis' dimension"
)
if
len
(
template_shape
)
!=
node
.
outputs
[
0
]
.
type
.
ndim
:
raise
ValueError
,
"Number of dimension of input tensors disagree with dimensions passed at graph creation time."
raise
ValueError
(
"Number of dimension of input tensors disagree"
" with dimensions passed at graph creation time."
)
# final shape must be the same as all input tensors
# except for the "axis" dimension, so we can simply
...
...
@@ -2110,7 +2153,8 @@ class GpuJoin(tensor.Join, GpuOp):
# just to be explicit, check that dim=1 for broadcastable
# dimensions
for
i
,
bcastable
in
enumerate
(
node
.
outputs
[
0
]
.
type
.
broadcastable
):
assert
not
bcastable
or
final_shape
[
i
]
==
1
,
"Broadcastable dimension but dim != 1, this is invalid"
assert
not
bcastable
or
final_shape
[
i
]
==
1
,
(
"Broadcastable dimension but dim != 1, this is invalid"
)
rval
=
cuda_ndarray
.
cuda_ndarray
.
CudaNdarray
.
zeros
(
final_shape
)
...
...
@@ -2120,9 +2164,9 @@ class GpuJoin(tensor.Join, GpuOp):
# except for 'axis'
def
construct_slices
(
curlen
):
slices
=
[
slice
(
None
,
None
,
None
)
for
i
in
\
slices
=
[
slice
(
None
,
None
,
None
)
for
i
in
\
range
(
len
(
template_shape
))]
slices
[
axis
]
=
slice
(
curpos
,
curpos
+
curlen
,
None
)
slices
[
axis
]
=
slice
(
curpos
,
curpos
+
curlen
,
None
)
return
tuple
(
slices
)
for
i
,
cnda
in
enumerate
(
cndas
):
...
...
@@ -2157,7 +2201,9 @@ class GpuAlloc(GpuOp):
v
=
as_cuda_ndarray_variable
(
value
)
sh
=
[
tensor
.
as_tensor_variable
(
s
)
for
s
in
shape
]
if
v
.
ndim
!=
len
(
shape
):
raise
TypeError
(
'GpuAlloc requires value of same dimensions as shape'
,
value
,
len
(
shape
))
raise
TypeError
(
'GpuAlloc requires value of same dimensions as shape'
,
value
,
len
(
shape
))
bcast
=
[]
for
s
in
sh
:
...
...
@@ -2170,7 +2216,7 @@ class GpuAlloc(GpuOp):
const_shp
=
None
bcast
.
append
(
numpy
.
all
(
1
==
const_shp
))
otype
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
bcast
)
return
Apply
(
self
,
[
v
]
+
sh
,
[
otype
()])
return
Apply
(
self
,
[
v
]
+
sh
,
[
otype
()])
def
perform
(
self
,
node
,
inputs
,
out_
):
out
,
=
out_
...
...
@@ -2178,7 +2224,7 @@ class GpuAlloc(GpuOp):
sh
=
tuple
([
int
(
i
)
for
i
in
inputs
[
1
:]])
if
out
[
0
]
is
None
or
out
[
0
]
.
shape
!=
sh
:
out
[
0
]
=
cuda_ndarray
.
cuda_ndarray
.
CudaNdarray
.
zeros
(
sh
)
out
[
0
][
...
]
=
v
# broadcast v to fill us up
out
[
0
][
...
]
=
v
# broadcast v to fill us up
def
c_code
(
self
,
node
,
name
,
inputs
,
out_
,
sub
):
out
,
=
out_
...
...
@@ -2186,12 +2232,12 @@ class GpuAlloc(GpuOp):
value
=
inputs
[
0
]
shps
=
inputs
[
1
:]
nd
=
len
(
shps
)
str
=
"int dims[
%(nd)
s];
\n
"
%
locals
()
for
idx
,
sh
in
enumerate
(
shps
):
str
=
"int dims[
%(nd)
s];
\n
"
%
locals
()
for
idx
,
sh
in
enumerate
(
shps
):
str
+=
"dims[
%(idx)
s] = PyInt_AsLong((PyObject*)
%(sh)
s);
\n
"
%
locals
()
str
+=
"if(
%(out)
s==NULL
\n
"
%
locals
()
for
idx
,
sh
in
enumerate
(
shps
):
for
idx
,
sh
in
enumerate
(
shps
):
str
+=
"||CudaNdarray_HOST_DIMS(
%(out)
s)[
%(idx)
s]!=dims[
%(idx)
s]"
%
locals
()
str
+=
"""){
Py_XDECREF(
%(out)
s);
...
...
@@ -2350,10 +2396,9 @@ def tensordot(a, b, axes=2):
"Axes should be scalar valued or a list/tuple of len 2."
,
axes
)
# Those are predifined CudaNdarrayType as done in tensor.basic
# Useful mostly for test as the gpu op are inserted automatically...
fscalar
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
())
def
scalar
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic scalar variable.
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2363,8 +2408,9 @@ def scalar(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
())
return
type
(
name
)
fscalar
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
())
fvector
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
))
def
vector
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic vector variable.
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2374,8 +2420,9 @@ def vector(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
False
,
))
return
type
(
name
)
fvector
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
))
fmatrix
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
False
))
def
matrix
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic matrix variable.
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2385,8 +2432,9 @@ def matrix(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
False
,
False
))
return
type
(
name
)
fmatrix
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
False
))
frow
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
True
,
False
))
def
row
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic row variable (ndim=2, broadcastable=[True,False]).
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2396,8 +2444,9 @@ def row(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
True
,
False
))
return
type
(
name
)
frow
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
True
,
False
))
fcol
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
True
))
def
col
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic column variable (ndim=2, broadcastable=[False,True]).
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2407,8 +2456,9 @@ def col(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
False
,
True
))
return
type
(
name
)
fcol
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,
True
))
ftensor3
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,)
*
3
)
def
tensor3
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic 3-D variable.
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2418,8 +2468,9 @@ def tensor3(name=None, dtype=None):
dtype
=
config
.
floatX
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
False
,
False
,
False
))
return
type
(
name
)
ftensor3
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,)
*
3
)
ftensor4
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,)
*
4
)
def
tensor4
(
name
=
None
,
dtype
=
None
):
"""Return a symbolic 4-D variable.
:param dtype: numeric type (None means to use theano.config.floatX)
...
...
@@ -2430,6 +2481,7 @@ def tensor4(name=None, dtype=None):
type
=
CudaNdarrayType
(
dtype
=
dtype
,
broadcastable
=
(
False
,
False
,
False
,
False
))
return
type
(
name
)
ftensor4
=
CudaNdarrayType
(
dtype
=
'float32'
,
broadcastable
=
(
False
,)
*
4
)
@theano.compile.profilemode.register_profiler_printer
...
...
@@ -2446,22 +2498,24 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
gpu
=
0
trans
=
0
for
(
_
,
node
),
t
in
apply_time
.
items
():
if
isinstance
(
node
.
op
.
__class__
.
__name__
,
(
HostFromGpu
,
GpuFromHost
)):
if
isinstance
(
node
.
op
.
__class__
.
__name__
,
(
HostFromGpu
,
GpuFromHost
)):
trans
+=
t
elif
node
.
op
.
__class__
.
__name__
.
lower
()
.
startswith
(
"gpu"
):
gpu
+=
t
else
:
cpu
+=
t
print
print
" Spent
%.3
fs(
%.3
f
%%
) in cpu Op,
%.3
fs(
%.3
f
%%
) in gpu Op and
%.3
fs(
%.3
f
%%
) transfert Op"
%
(
cpu
,
cpu
/
local_time
*
100
,
gpu
,
gpu
/
local_time
*
100
,
trans
,
trans
/
local_time
*
100
)
print
" Spent
%.3
fs(
%.3
f
%%
) in cpu Op,
%.3
fs(
%.3
f
%%
) in gpu Op and
%.3
fs(
%.3
f
%%
) transfert Op"
%
(
cpu
,
cpu
/
local_time
*
100
,
gpu
,
gpu
/
local_time
*
100
,
trans
,
trans
/
local_time
*
100
)
print
print
" Theano function input that are float64"
print
" <fct name> <input name> <input type> <str input>"
for
fct
in
fct_call
.
keys
():
for
i
in
fct
.
input_storage
:
if
hasattr
(
i
.
type
,
'dtype'
)
and
i
.
type
.
dtype
==
'float64'
:
if
hasattr
(
i
.
type
,
'dtype'
)
and
i
.
type
.
dtype
==
'float64'
:
print
' '
,
fct
.
name
,
i
.
name
,
i
.
type
,
i
print
...
...
@@ -2470,5 +2524,13 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
print
' <Apply> <Apply position> <fct name> <inputs type> <outputs type>'
for
fct
in
fct_call
.
keys
():
for
idx
,
node
in
enumerate
(
fct
.
maker
.
fgraph
.
toposort
()):
if
any
(
hasattr
(
i
,
'dtype'
)
and
i
.
dtype
==
'float64'
for
i
in
node
.
outputs
)
and
not
any
(
hasattr
(
i
,
'dtype'
)
and
i
.
dtype
==
'float64'
for
i
in
node
.
inputs
):
print
' '
,
str
(
node
),
idx
,
fct
.
name
,
str
([
getattr
(
i
,
'dtype'
,
None
)
for
i
in
node
.
inputs
]),
str
([
getattr
(
i
,
'dtype'
,
None
)
for
i
in
node
.
outputs
])
if
(
any
(
hasattr
(
i
,
'dtype'
)
and
i
.
dtype
==
'float64'
for
i
in
node
.
outputs
)
and
not
any
(
hasattr
(
i
,
'dtype'
)
and
i
.
dtype
==
'float64'
for
i
in
node
.
inputs
)):
print
' '
,
str
(
node
),
idx
,
fct
.
name
,
print
str
([
getattr
(
i
,
'dtype'
,
None
)
for
i
in
node
.
inputs
]),
print
str
([
getattr
(
i
,
'dtype'
,
None
)
for
i
in
node
.
outputs
])
theano/sandbox/cuda/cuda_ndarray.cu
浏览文件 @
72a7214a
...
...
@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject
*
axis_obj
=
Py_None
;
PyObject
*
out_obj
=
Py_None
;
PyObject
*
clipmode_obj
=
NULL
;
if
(
!
PyArg_ParseTuple
(
args
,
"O|OOO"
,
&
indices_obj
,
&
axis_obj
,
&
out_obj
,
&
clipmode_obj
))
int
max_threads
=
1
;
// max threads per blocks
if
(
!
PyArg_ParseTuple
(
args
,
"O|OOOi"
,
&
indices_obj
,
&
axis_obj
,
&
out_obj
,
&
clipmode_obj
,
&
max_threads
))
return
NULL
;
//Check argument indices
...
...
@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject
*
axis_iobj
=
PyNumber_Long
(
axis_obj
);
if
(
!
axis_iobj
)
{
PyErr_SetString
(
PyExc_NotImplementedError
,
"CudaNdarray_TakeFrom: axis must be convertable to a long"
);
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
return
NULL
;
}
long
axis
=
PyInt_AsLong
(
axis_iobj
);
Py_DECREF
(
axis_iobj
);
axis_iobj
=
NULL
;
if
(
axis
!=
0
)
{
PyErr_SetString
(
PyExc_NotImplementedError
,
"CudaNdarray_TakeFrom: only axis=0 is currently supported"
);
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
return
NULL
;
}
...
...
@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if
(
!
out
)
{
out
=
(
CudaNdarray
*
)
CudaNdarray_New
();
if
(
!
out
){
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
free
(
dims
);
return
NULL
;
}
if
(
CudaNdarray_alloc_contiguous
(
out
,
self
->
nd
,
dims
))
{
Py_DECREF
(
out
);
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
free
(
dims
);
return
NULL
;
}
...
...
@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if
(
clipmode_obj
)
{
char
*
clipmode
=
PyString_AsString
(
clipmode_obj
);
if
(
!
clipmode
){
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
free
(
dims
);
return
NULL
;
}
if
(
strcmp
(
clipmode
,
"raise"
)
!=
0
)
{
PyErr_SetString
(
PyExc_NotImplementedError
,
"CudaNdarray_TakeFrom: only the raise mode is currently supported"
);
Py_DECREF
(
indices_obj
);
PyErr_Format
(
PyExc_NotImplementedError
,
"CudaNdarray_TakeFrom: only the raise mode is currently supported. Got '%s'"
,
clipmode
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
free
(
dims
);
return
NULL
;
}
Py_DECREF
(
clipmode_obj
);
}
void
(
*
k3
)(
const
int
,
const
int
,
const
int
,
const
npy_int64
*
,
...
...
@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if
(
err_var
==
NULL
)
{
err_var
=
(
int
*
)
device_malloc
(
sizeof
(
int
));
if
(
!
err_var
)
{
// PyErr set by device_malloc
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
free
(
dims
);
return
NULL
;
...
...
@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyErr_Format
(
PyExc_RuntimeError
,
"Error setting device error code to 0. %s"
,
cudaGetErrorString
(
err
));
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
free
(
dims
);
return
NULL
;
...
...
@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
}
dim3
n_blocks
(
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
0
],
65535
),
1
,
1
);
switch
(
self
->
nd
)
{
case
1
:
{
dim3
n_threads
(
1
,
1
,
1
);
if
(
verbose
)
printf
(
"kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf
(
"cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)
\n
"
,
self
->
nd
,
cudaGetLastError
(),
n_blocks
.
x
,
n_blocks
.
y
,
n_threads
.
x
,
n_threads
.
y
);
k3
<<<
n_blocks
,
n_threads
>>>
(
dims
[
0
],
...
...
@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break
;
case
2
:
{
dim3
n_threads
(
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
1
],
512
),
1
,
1
);
dim3
n_threads
(
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
1
],
max_threads
),
1
,
1
);
if
(
verbose
)
printf
(
"kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf
(
"cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)
\n
"
,
cudaGetLastError
(),
self
->
nd
,
n_blocks
.
x
,
n_blocks
.
y
,
n_threads
.
x
,
n_threads
.
y
);
k3
<<<
n_blocks
,
n_threads
>>>
(
dims
[
0
],
//dimensions
dims
[
1
],
...
...
@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break
;
case
3
:
{
int
ty
=
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
2
],
512
);
int
tx
=
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
1
],
512
/
ty
);
int
ty
=
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
2
],
max_threads
);
int
tx
=
std
::
min
(
CudaNdarray_HOST_DIMS
(
out
)[
1
],
max_threads
/
ty
);
dim3
n_threads
(
tx
,
ty
,
1
);
if
(
verbose
)
printf
(
"kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf
(
"cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)
\n
"
,
self
->
nd
,
cudaGetLastError
(),
n_blocks
.
x
,
n_blocks
.
y
,
n_threads
.
x
,
n_threads
.
y
);
k3
<<<
n_blocks
,
n_threads
>>>
(
dims
[
0
],
//dimensions
...
...
@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s.
\n
"
,
"CudaNdarray_TakeFrom"
,
cudaGetErrorString
(
err
));
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
return
NULL
;
}
...
...
@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s when trying to get the error value.
\n
"
,
"CudaNdarray_TakeFrom"
,
cudaGetErrorString
(
err
));
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
return
NULL
;
}
...
...
@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
err
=
cudaMemset
((
void
*
)
err_var
,
0
,
sizeof
(
int
));
if
(
cudaSuccess
!=
err
)
{
PyErr_Format
(
PyExc_MemoryError
,
"Error setting device error code to 0 after having an index error. %s"
,
cudaGetErrorString
(
err
));
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
return
NULL
;
}
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
Py_DECREF
(
out
);
return
NULL
;
}
Py_DECREF
(
indices
_obj
);
Py_DECREF
(
indices
);
if
(
verbose
)
printf
(
"TAKE SUCCEDED
\n
"
);
return
(
PyObject
*
)
out
;
...
...
theano/sandbox/cuda/nvcc_compiler.py
浏览文件 @
72a7214a
...
...
@@ -7,6 +7,7 @@ import subprocess
import
sys
import
warnings
import
theano
from
theano.gof.cc
import
hash_from_file
from
theano.gof.cmodule
import
(
std_libs
,
std_lib_dirs
,
std_include_dirs
,
dlimport
,
...
...
@@ -119,6 +120,16 @@ class NVCC_compiler(object):
cuda_ndarray_cuh_hash
=
hash_from_file
(
os
.
path
.
join
(
os
.
path
.
split
(
__file__
)[
0
],
'cuda_ndarray.cuh'
))
flags
.
append
(
'-DCUDA_NDARRAY_CUH='
+
cuda_ndarray_cuh_hash
)
# We compile cuda_ndarray.cu during import.
# We should not add device properties at that time.
# As the device is not selected yet!
# TODO: compile cuda_ndarray when we bind to a GPU?
import
theano.sandbox.cuda
if
hasattr
(
theano
.
sandbox
,
'cuda'
):
n
=
theano
.
sandbox
.
cuda
.
use
.
device_number
p
=
theano
.
sandbox
.
cuda
.
device_properties
(
n
)
flags
.
append
(
'-arch=sm_'
+
str
(
p
[
'major'
])
+
str
(
p
[
'minor'
]))
return
flags
@staticmethod
...
...
@@ -217,7 +228,9 @@ class NVCC_compiler(object):
# '--gpu-code=compute_13',
#nvcc argument
preargs1
=
[
pa
for
pa
in
preargs
if
pa
.
startswith
(
'-O'
)
or
pa
.
startswith
(
'--maxrregcount='
)]
if
pa
.
startswith
(
'-O'
)
or
pa
.
startswith
(
'--maxrregcount='
)
or
pa
.
startswith
(
'-arch='
)]
preargs2
=
[
pa
for
pa
in
preargs
if
pa
not
in
preargs1
]
# other arguments
...
...
@@ -337,6 +350,7 @@ class NVCC_compiler(object):
pass
print
>>
sys
.
stderr
,
l
print
nvcc_stdout
print
cmd
raise
Exception
(
'nvcc return status'
,
p
.
returncode
,
'for cmd'
,
' '
.
join
(
cmd
))
elif
config
.
cmodule
.
compilation_warning
and
nvcc_stdout
:
...
...
theano/scan_module/tests/test_scan.py
浏览文件 @
72a7214a
...
...
@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase):
for
step
in
xrange
(
1
,
4
):
v_out
[
step
]
=
v_u
[
step
]
*
W_in
+
v_out
[
step
-
1
]
*
W
theano_values
=
f2
(
v_u
,
v_x0
,
W_in
,
W
)
assert
numpy
.
allclose
(
theano_values
,
v_out
)
assert
numpy
.
allclose
(
theano_values
,
v_out
),
(
theano_values
,
v_out
,
theano_values
-
v_out
)
# TO DEL
topo
=
f2
.
maker
.
fgraph
.
toposort
()
...
...
@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase):
v_y
[
i
]
=
numpy
.
dot
(
v_x
[
i
-
1
],
vWout
)
(
theano_x
,
theano_y
)
=
f4
(
v_u1
,
v_u2
,
v_x0
,
v_y0
,
vW_in1
)
assert
numpy
.
allclose
(
theano_x
,
v_x
)
assert
numpy
.
allclose
(
theano_y
,
v_y
)
assert
numpy
.
allclose
(
theano_x
,
v_x
)
,
(
theano_x
,
v_x
,
theano_x
-
v_x
)
assert
numpy
.
allclose
(
theano_y
,
v_y
)
,
(
theano_y
,
v_y
,
theano_y
-
v_y
)
def
test_multiple_outs_taps
(
self
):
l
=
5
...
...
@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase):
ny1
[
4
]
=
(
ny1
[
3
]
+
ny1
[
1
])
*
numpy
.
dot
(
ny0
[
3
],
vWout
)
ny2
[
4
]
=
numpy
.
dot
(
v_u1
[
4
],
vW_in1
)
def
test_using_taps_sequence
(
self
):
# this test refers to a bug reported by Nicolas
# Boulanger-Lewandowski June 6th
x
=
theano
.
tensor
.
dvector
()
y
,
updates
=
theano
.
scan
(
lambda
x
:
[
x
],
sequences
=
dict
(
input
=
x
,
taps
=
[
-
1
]),
outputs_info
=
[
None
])
outputs_info
=
[
None
])
inp
=
numpy
.
arange
(
5
)
.
astype
(
'float64'
)
rval
=
theano
.
function
([
x
],
y
,
updates
=
updates
)(
inp
)
assert
numpy
.
all
(
rval
==
inp
[:
-
1
])
...
...
@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase):
# equivalent is done
(
theano_x0
,
theano_x1
)
=
f9
(
vu0
,
vu1
,
vu2
,
vx0
,
vx1
)
# assert that theano does what it should
assert
numpy
.
allclose
(
theano_x0
,
numpy_x0
)
assert
numpy
.
allclose
(
theano_x1
,
numpy_x1
),
(
theano_x1
,
numpy_x1
,
theano_x1
-
numpy_x1
)
assert
numpy
.
allclose
(
theano_x0
,
numpy_x0
),
(
theano_x0
,
numpy_x0
,
theano_x0
-
numpy_x0
)
assert
numpy
.
allclose
(
theano_x1
,
numpy_x1
),
(
theano_x1
,
numpy_x1
,
theano_x1
-
numpy_x1
)
# assert that it was done in place
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
...
...
@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase):
vx1
=
asarrayX
(
rng
.
uniform
())
x0
=
theano
.
shared
(
vx0
)
x1
=
theano
.
shared
(
vx1
)
outputs
,
updates
=
theano
.
scan
(
lambda
x
,
y
:
(
x
+
asarrayX
(
1
),
y
+
asarrayX
(
1
)),
outputs
,
updates
=
theano
.
scan
(
lambda
x
,
y
:
(
x
+
asarrayX
(
1
),
y
+
asarrayX
(
1
)),
[],
[
x0
,
x1
],
n_steps
=
3
)
[
x0
,
x1
],
n_steps
=
3
)
x0
=
asarrayX
(
numpy
.
zeros
((
3
,)))
x0
[
0
]
=
vx0
x0
=
theano
.
tensor
.
constant
(
x0
)
...
...
@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase):
v_eW
=
numpy
.
array
(
rng
.
uniform
(
size
=
(
5
,
5
))
-
.
5
,
dtype
=
floatX
)
v_eh0
=
numpy
.
array
(
rng
.
uniform
(
size
=
(
5
,))
-
.
5
,
dtype
=
floatX
)
def
rnn_fn
(
_u
,
_y
,
_W
):
srng
=
theano
.
tensor
.
shared_randomstreams
.
RandomStreams
(
seed
)
...
...
theano/tensor/__init__.py
浏览文件 @
72a7214a
...
...
@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \
jacobian
,
hessian
from
theano.tensor.sort
import
sort
from
extra_ops
import
(
DiffOp
,
bincount
,
squeeze
,
repeat
,
bartlett
,
fill_diagonal
)
theano/tensor/extra_ops.py
浏览文件 @
72a7214a
...
...
@@ -3,8 +3,8 @@ import numpy
import
theano
import
basic
from
theano
import
gof
,
tensor
,
scalar
from
theano.sandbox.linalg.ops
import
diag
from
theano
import
gof
,
scalar
import
basic
as
tensor
class
DiffOp
(
theano
.
Op
):
...
...
@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op):
raise
NotImplementedError
(
'
%
s: gradient is currently implemented'
' for matrices only'
%
self
.
__class__
.
__name__
)
wr_a
=
fill_diagonal
(
grad
,
0
)
# valid for any number of dimensions
wr_val
=
diag
(
grad
)
.
sum
()
# diag is only valid for matrices
# diag is only valid for matrices
import
theano.sandbox.linalg
wr_val
=
theano
.
sandbox
.
linalg
.
ops
.
diag
(
grad
)
.
sum
()
return
[
wr_a
,
wr_val
]
fill_diagonal_
=
FillDiagonal
()
...
...
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论