提交 72a7214a authored 作者: lamblin's avatar lamblin

Merge pull request #863 from nouiz/mixed2

Mixed2
...@@ -2,148 +2,7 @@ ...@@ -2,148 +2,7 @@
Updates in the Trunk since the last release: Updates in the Trunk since the last release:
Bug fixes https://github.com/Theano/Theano/wiki/Devnews
* Outputs of Scan nodes could contain corrupted values: some parts of the
output would be repeated a second time, instead of the correct values.
It happened randomly, and quite infrequently, but the bug has been present
(both in Python and Cython) since April 2011. (Pascal L.)
* In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
It did not return the right number of elements. (Frederic B.)
* set_subtensor(x[int vector], new_value) when moved to the GPU
was transformed into inc_subtensor on the GPU. Now we have a correct
(but slow) GPU implementation.
Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
in all cases as well as inc_subtensor(*, *).
Note 2: If your code was affected by the incorrect behavior, we now print
a warning by default (Frederic B.)
* Fixed an issue whereby config values were used as default arguments,
with those defaults then stuck at old values if the config variables were
changed during program execution. (David W-F)
* Fixed many subtle bugs involving mutable default arguments which may have
led to unexpected behaviour, such as objects sharing instance variables
they were not supposed to share. (David W-F)
* Correctly record the GPU device number used when we let the driver select it.
(Frederic B.)
Documentation
* Added in the tutorial documentation on how to extend Theano.
This explains how to make a Theano Op from a Python function.
http://deeplearning.net/software/theano/tutorial/extending_theano.html
(Frédéric B.)
* New installation instructions for Windows using EPD (Pascal L.)
Interface changes
* In 0.5, we removed the deprecated sharedvar.value property.
Now we raise an error if you access it. (Frederic B.)
* theano.function does not accept duplicate inputs, so function([x, x], ...)
does not work anymore. (Pascal L.)
* theano.function now raises an error if some of the provided inputs are
not part of the computational graph needed to compute the output, for
instance, function([x, y], [y]). You can use the kwarg
``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
(Pascal L.)
* New Theano flag "on_unused_input" that define the default value of the
previous point. (Frederic B.)
* tensor.alloc() now raises an error during graph build time
when we try to create less dimensions than the number of dimensions
the provided value have. In the past, the error was at run time.
(Frederic B.)
Speed up
* Convolution on the GPU now check the generation of the card to make
it faster in some cases (especially medium/big ouput image) (Frédéric B.)
(We hardcoded 512 as the maximum number of thread per block. Newer card
support up to 1024 threads per block.
* CPU convolution are now parallelized (Frédric B.)
By default use all cores/hyper-threads
To control it, use the OMP_NUM_THREADS=N environment variable.
New Features
* debugprint new param ids=["CHAR", "id", "int", ""]
This makes the identifier printed to be the python id, a unique char, a
unique int, or not have it printed. We changed the default to be "CHAR"
as this is more readable. (Frederic B.)
* debugprint new param stop_on_name=[False, True]. If True, we don't print
anything below an intermediate variable that has a name. Defaults to False.
(Frederic B.)
* debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
* If you use Enthought Python Distribution (EPD) now we use its blas
implementation by default (tested on Linux and Windows)
(Frederic B., Simon McGregor)
* MRG random now raises an error with a clear message when the passed shape
contains dimensions with bad value like 0. (Frédéric B. reported by Ian G.)
* "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
* "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
* We add dimensions to CudaNdarray to automatically broadcast more frequently.
(Frederic B.)
* theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
* New theano flag cmodule.warn_no_version. Default False. If True,
will print a warning when compiling one or more Op with C code that
can't be cached because there is no c_code_cache_version() function
associated to at least one of those Ops. (Frederic B.)
* CPU alloc now always generate C code (Pascal L.)
* New Theano flag cmodule.warn_no_version=False. When True, warn when an op
with C code is not versioned (which forces to recompile it everytimes).
(Frédéric B.)
* Made a few Ops with C code versioned to reduce compilation time.
(Frédéric B, Pascal L.)
* C code reuses preallocated outputs (only done by Scan) (Pascal L.)
* Garbage collection of intermediate results during Theano function calls
for Ops with C code (Pascal L.)
* Theano flags compiledir_format now support the parameter numpy_version.
* Theano GPU variables, shared variable and constant now support <, <=,
> and >= as as those not on the GPU.
Sparse
* Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
have the same sparsity pattern. (Frederic B.)
Sparse Sandbox graduate
* Remove0 op: it removes stored elements with value 0. (Frederic B.)
Sparse Sandbox Additions (not reviewed/documented/tested, but used by some people)
* They are all in the theano.sparse.sandbox.sp2 module
* Op class: Cast, Poisson, Multinomial, EliminateZeros, Sum, Binomial
* Op class: SamplingDot, SamplingDotCsr (inserted automatically)
* Op function: structured_sigmoid, structured_exp, structured_pow, structured_minimum
* Op class: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
* opt: local_sampling_dot_csr, local_structured_add_s_v
Internal changes
* Define new exceptions MissingInputError and UnusedInputError, and use them
in theano.function, instead of TypeError and ValueError. (Pascal L.)
* Better handling of bitwidth and max values of integers and pointers
across platforms (Pascal L.)
Crash Fix
* Do not try to use the BLAS library when blas.ldflags is manually set to an
empty string (Frederic B.)
* When importing theano on a computer without GPU with the Theano
flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by Luo Heng)
* Optimization printed a useless error when scipy was not available. (Frederic B.)
* GPU conv crash/slowdown on newer hardware (James B.)
* Better error handling in GPU conv (Frederic B.)
* GPU optimization that moves element-wise Ops to the GPU. Crash happened in
a particular execution order of this optimization and the
element-wise fusion optimization when upcasting some inputs to
float32 (to compute them on the GPU).
(Frederic B., reported by Sander Dieleman)
* GpuReshape in some particular case when the input is not contiguous
(Frederic B., reported by Sander Dieleman)
* GpuSoftmaxWithBias with shape (0, N) with N > 1.
(Frédéric B., reported by Razvan P.)
* Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
(Pascal L., reported by Simon McGregor)
* Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
dimensions, which could typically result in optimization crashes (Olivier D.)
* Fixed crash when concatenating some arrays with specific broadcasting
patterns (Olivier D.)
* Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylon)
* In advanced indexing, if some inputs are constant, no need to call constant(...)
on their value any more. (Pascal L., reported by John Salvatier)
* Fix crash on GPU when the GpuSubtensor didn't put the right stride
when the results tensor had a dimensions with size of 1. (Pascal L,
reported Graham T.)
============= =============
Release Notes Release Notes
......
...@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests. ...@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests.
option will be interpreted as an indication of the number of tests to be run option will be interpreted as an indication of the number of tests to be run
between notifications of progress to standard output. between notifications of progress to standard output.
If the '--theano' option is used, it is replaced with the path to theano.
Useful if you don't know where it was installed.
`run_tests_in_batch.py` will in turn call back this script in another process. `run_tests_in_batch.py` will in turn call back this script in another process.
""" """
...@@ -39,6 +42,12 @@ import sys ...@@ -39,6 +42,12 @@ import sys
from nose.plugins import Plugin from nose.plugins import Plugin
def main(): def main():
# Handle the --theano arguments
if "--theano" in sys.argv:
i = sys.argv.index("--theano")
import theano
sys.argv[i] = theano.__path__[0]
# Handle --batch[=n] arguments # Handle --batch[=n] arguments
batch_args = [arg for arg in sys.argv if arg.startswith('--batch')] batch_args = [arg for arg in sys.argv if arg.startswith('--batch')]
for arg in batch_args: for arg in batch_args:
...@@ -137,6 +146,11 @@ def help(): ...@@ -137,6 +146,11 @@ def help():
--without-knownfailure: Do not load the KnownFailure plugin. --without-knownfailure: Do not load the KnownFailure plugin.
--theano: This parameter is replaced with the path to the theano library.
As theano-nose is a wrapper to nosetests, it expect a path to the tests to run.
If you don't know where theano is installed, use this option
to have it inserted automatically.
The other options will be passed to nosetests, see ``nosetests -h``. The other options will be passed to nosetests, see ``nosetests -h``.
""" """
......
...@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(), ...@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(),
"python_version": platform.python_version(), "python_version": platform.python_version(),
"theano_version": theano.__version__, "theano_version": theano.__version__,
"numpy_version": numpy.__version__, "numpy_version": numpy.__version__,
"g++": gcc_version_str.replace(" ", "_"), "gxx_version": gcc_version_str.replace(" ", "_"),
} }
compiledir_format_keys = ", ".join(compiledir_format_dict.keys()) compiledir_format_keys = ", ".join(compiledir_format_dict.keys())
default_compiledir_format =\ default_compiledir_format =\
......
...@@ -11,7 +11,7 @@ from theano import tensor, scalar, config ...@@ -11,7 +11,7 @@ from theano import tensor, scalar, config
from theano.gof.python25 import all, any from theano.gof.python25 import all, any
from theano.sandbox.cuda import GpuOp from theano.sandbox.cuda import GpuOp, device_properties
from theano.sandbox.cuda.type import CudaNdarrayType from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda import filter as type_support_filter from theano.sandbox.cuda import filter as type_support_filter
...@@ -641,7 +641,9 @@ class GpuSum(GpuOp): ...@@ -641,7 +641,9 @@ class GpuSum(GpuOp):
printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n"); printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n");
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z; int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
if (verbose>1) if (verbose>1)
printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d, nb_block=%%d, n_shared=%%d\\n", printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d,"
" nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d,"
" nb_block=%%d, n_shared=%%d\\n",
n_threads.x,n_threads.y,n_threads.z, n_threads.x,n_threads.y,n_threads.z,
n_threads.x*n_threads.y*n_threads.z, n_threads.x*n_threads.y*n_threads.z,
n_blocks.x,n_blocks.y, n_blocks.x,n_blocks.y,
...@@ -673,7 +675,8 @@ class GpuSum(GpuOp): ...@@ -673,7 +675,8 @@ class GpuSum(GpuOp):
if (cudaSuccess != sts) if (cudaSuccess != sts)
{ {
PyErr_Format(PyExc_RuntimeError, PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n", "Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_%(pattern)s_%(name)s", "kernel_reduce_sum_%(pattern)s_%(name)s",
cudaGetErrorString(sts), cudaGetErrorString(sts),
n_blocks.x, n_blocks.x,
...@@ -876,7 +879,8 @@ class GpuSum(GpuOp): ...@@ -876,7 +879,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_SIZE(%(x)s), std::min(CudaNdarray_SIZE(%(x)s),
NUM_VECTOR_OP_THREADS_PER_BLOCK)); NUM_VECTOR_OP_THREADS_PER_BLOCK));
dim3 n_blocks(1); dim3 n_blocks(1);
if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s n_threads.x=%%d, size=%%d, ndim=%%d\\n", if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s"
" n_threads.x=%%d, size=%%d, ndim=%%d\\n",
n_threads.x,CudaNdarray_SIZE(%(x)s),%(x)s->nd); n_threads.x,CudaNdarray_SIZE(%(x)s),%(x)s->nd);
int n_shared = sizeof(float) * n_threads.x; int n_shared = sizeof(float) * n_threads.x;
kernel_reduce_sum_ccontig_%(name)s<<<n_blocks, n_threads, n_shared>>>( kernel_reduce_sum_ccontig_%(name)s<<<n_blocks, n_threads, n_shared>>>(
...@@ -887,7 +891,9 @@ class GpuSum(GpuOp): ...@@ -887,7 +891,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError(); cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts) if (cudaSuccess != sts)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n", PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_ccontig_%(name)s", "kernel_reduce_sum_ccontig_%(name)s",
cudaGetErrorString(sts), cudaGetErrorString(sts),
n_blocks.x, n_blocks.x,
...@@ -937,11 +943,13 @@ class GpuSum(GpuOp): ...@@ -937,11 +943,13 @@ class GpuSum(GpuOp):
:param N: the number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111 :param N: the number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111
Work for N=1,2,3 Work for N=1,2,3
""" """
assert N in [1,2,3] assert N in [1, 2, 3]
makecall = self._makecall(node, name, x, z, fail) makecall = self._makecall(node, name, x, z, fail)
N_pattern = ''.join(['1']*N) N_pattern = ''.join(['1'] * N)
param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)]) param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals()
strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)]) for i in xrange(N + 1)])
strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]"
% locals() for i in xrange(N + 1)])
threads_y = """ threads_y = """
//get as many y threads as we can fit //get as many y threads as we can fit
while (n_threads.x * (n_threads.y+1) <= NUM_VECTOR_OP_THREADS_PER_BLOCK) while (n_threads.x * (n_threads.y+1) <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
...@@ -962,10 +970,10 @@ class GpuSum(GpuOp): ...@@ -962,10 +970,10 @@ class GpuSum(GpuOp):
break; break;
} }
""" % locals() """ % locals()
if len(self.reduce_mask)==2: if len(self.reduce_mask) == 2:
threads_y = '' threads_y = ''
threads_z = '' threads_z = ''
if len(self.reduce_mask)==3: if len(self.reduce_mask) == 3:
threads_z = '' threads_z = ''
print >> sio, """ print >> sio, """
{ {
...@@ -975,15 +983,18 @@ class GpuSum(GpuOp): ...@@ -975,15 +983,18 @@ class GpuSum(GpuOp):
NUM_VECTOR_OP_THREADS_PER_BLOCK)); NUM_VECTOR_OP_THREADS_PER_BLOCK));
%(threads_y)s %(threads_y)s
%(threads_z)s %(threads_z)s
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS)); dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_BLOCKS));
%(makecall)s %(makecall)s
} }
""" % locals() """ % locals()
def c_code_reduce_01(self, sio, node, name, x, z, fail): def c_code_reduce_01(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 1) self.c_code_reduce_01X(sio, node, name, x, z, fail, 1)
def c_code_reduce_011(self, sio, node, name, x, z, fail): def c_code_reduce_011(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 2) self.c_code_reduce_01X(sio, node, name, x, z, fail, 2)
def c_code_reduce_0111(self, sio, node, name, x, z, fail): def c_code_reduce_0111(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 3) self.c_code_reduce_01X(sio, node, name, x, z, fail, 3)
...@@ -1021,7 +1032,9 @@ class GpuSum(GpuOp): ...@@ -1021,7 +1032,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError(); cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts) if (cudaSuccess != sts)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n", PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_010_%(name)s", "kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts), cudaGetErrorString(sts),
n_blocks.x, n_blocks.x,
...@@ -1033,9 +1046,11 @@ class GpuSum(GpuOp): ...@@ -1033,9 +1046,11 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
def c_code_reduce_010(self, sio, node, name, x, z, fail): def c_code_reduce_010(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail) makecall = self._makecall(node, name, x, z, fail)
makecall_inner = self._makecall(node, name, x, z, fail, pattern="010_inner") makecall_inner = self._makecall(node, name, x, z, fail,
pattern="010_inner")
pattern = ''.join(str(i) for i in self.reduce_mask) pattern = ''.join(str(i) for i in self.reduce_mask)
print >> sio, """ print >> sio, """
{ {
...@@ -1085,7 +1100,9 @@ class GpuSum(GpuOp): ...@@ -1085,7 +1100,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError(); cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts) if (cudaSuccess != sts)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n", PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_010_%(name)s", "kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts), cudaGetErrorString(sts),
n_blocks.x, n_blocks.x,
...@@ -1233,6 +1250,7 @@ class GpuSum(GpuOp): ...@@ -1233,6 +1250,7 @@ class GpuSum(GpuOp):
%(makecall)s %(makecall)s
} }
""" % locals() """ % locals()
def c_code_reduce_111(self, sio, node, name, x, z, fail): def c_code_reduce_111(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail) makecall = self._makecall(node, name, x, z, fail)
print >> sio, """ print >> sio, """
...@@ -1275,7 +1293,8 @@ class GpuSum(GpuOp): ...@@ -1275,7 +1293,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_HOST_DIMS(%(x)s)[0], std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_BLOCKS)); NUM_VECTOR_OP_BLOCKS));
while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS && n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1]) while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS &&
n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1])
{ {
n_blocks.y += 1; n_blocks.y += 1;
} }
...@@ -1356,7 +1375,7 @@ class GpuSum(GpuOp): ...@@ -1356,7 +1375,7 @@ class GpuSum(GpuOp):
def c_support_code_apply(self, node, nodename): def c_support_code_apply(self, node, nodename):
sio = StringIO.StringIO() sio = StringIO.StringIO()
nd_in = len(self.reduce_mask) nd_in = len(self.reduce_mask)
if all(i==1 for i in self.reduce_mask): if all(i == 1 for i in self.reduce_mask):
#this kernel is ok for up to a few thousand elements, but #this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor # it only runs on ONE multiprocessor
reducebuf = self._k_reduce_buf('Z[0]') reducebuf = self._k_reduce_buf('Z[0]')
...@@ -1411,7 +1430,7 @@ class GpuSum(GpuOp): ...@@ -1411,7 +1430,7 @@ class GpuSum(GpuOp):
%(reducebuf)s %(reducebuf)s
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,1): if self.reduce_mask == (1, 1):
#this kernel is ok for up to a few thousand elements, but #this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor # it only runs on ONE multiprocessor
reducebuf = self._k_reduce_buf('Z[0]') reducebuf = self._k_reduce_buf('Z[0]')
...@@ -1444,29 +1463,33 @@ class GpuSum(GpuOp): ...@@ -1444,29 +1463,33 @@ class GpuSum(GpuOp):
} }
""" % locals() """ % locals()
#01, 011, 0111 #01, 011, 0111
if 0 == self.reduce_mask[0] and all(self.reduce_mask[1:]) and nd_in in[2,3,4]: if (0 == self.reduce_mask[0] and
all(self.reduce_mask[1:]) and
nd_in in[2, 3, 4]):
# this kernel uses one block for each row. # this kernel uses one block for each row.
# threads per block for each element per row. # threads per block for each element per row.
N_pattern = ''.join(['1']*(nd_in-1)) N_pattern = ''.join(['1'] * (nd_in - 1))
if nd_in==2: if nd_in == 2:
for_i1 = "for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)" for_i1 = "for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)"
for_i2="int i2=0, sA2=0;" for_i2 = "int i2=0, sA2=0;"
for_i3="int i3=0, sA3=0;" for_i3 = "int i3=0, sA3=0;"
if nd_in==3: if nd_in == 3:
for_i1 = "for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)" for_i1 = "for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)"
for_i2 = "for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)" for_i2 = "for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)"
for_i3="int i3=0, sA3=0;" for_i3 = "int i3=0, sA3=0;"
if nd_in==4: if nd_in == 4:
for_i1 = "for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)" for_i1 = "for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)"
for_i2 = "for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)" for_i2 = "for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)"
for_i3 = "for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)" for_i3 = "for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)"
reducebuf = self._k_reduce_buf('Z[i0 * sZ0]') reducebuf = self._k_reduce_buf('Z[i0 * sZ0]')
param_dim = ",".join(["const int d%(i)s" % locals() for i in xrange(nd_in)]) param_dim = ",".join(["const int d%(i)s" % locals()
param_strides = ",".join(["const int sA%(i)s" % locals() for i in xrange(nd_in)]) for i in xrange(nd_in)])
decl = self._k_decl(node,nodename) param_strides = ",".join(["const int sA%(i)s" % locals()
init = self._k_init(node,nodename) for i in xrange(nd_in)])
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
print >> sio, """ print >> sio, """
%(decl)s{ %(decl)s{
%(init)s %(init)s
...@@ -1484,7 +1507,7 @@ class GpuSum(GpuOp): ...@@ -1484,7 +1507,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,1,0) or self.reduce_mask == (1,0): if self.reduce_mask == (0, 1, 0) or self.reduce_mask == (1, 0):
# this kernel uses one block for each column, # this kernel uses one block for each column,
# threads per block for each element per column. # threads per block for each element per column.
...@@ -1497,7 +1520,8 @@ class GpuSum(GpuOp): ...@@ -1497,7 +1520,8 @@ class GpuSum(GpuOp):
const int d0, const int d0,
const int d1, const int d1,
const int d2, const int d2,
const float *A, const int sA0, const int sA1, const int sA2, const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1) float * Z, const int sZ0, const int sZ1)
{ {
const int threadCount = blockDim.x; const int threadCount = blockDim.x;
...@@ -1525,7 +1549,7 @@ class GpuSum(GpuOp): ...@@ -1525,7 +1549,7 @@ class GpuSum(GpuOp):
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,1,0): if self.reduce_mask == (0, 1, 0):
print >> sio, """ print >> sio, """
static __global__ void kernel_reduce_sum_010_AD_%(nodename)s( static __global__ void kernel_reduce_sum_010_AD_%(nodename)s(
const int A, const int A,
...@@ -1533,7 +1557,8 @@ class GpuSum(GpuOp): ...@@ -1533,7 +1557,8 @@ class GpuSum(GpuOp):
const int C, const int C,
const int D, const int D,
//const int E, // THIS is 32 //const int E, // THIS is 32
const float *X, const int sX0, const int sX1, const int sX2, const float *X, const int sX0,
const int sX1, const int sX2,
float * Z, const int sZ0, const int sZ1) float * Z, const int sZ0, const int sZ1)
{ {
const int threadCount = blockDim.x; const int threadCount = blockDim.x;
...@@ -1564,9 +1589,10 @@ class GpuSum(GpuOp): ...@@ -1564,9 +1589,10 @@ class GpuSum(GpuOp):
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,1,0): if self.reduce_mask == (0, 1, 0):
# #
# This kernel is optimized when the inner most dimensions have the smallest stride. # This kernel is optimized when the inner most dimensions
# have the smallest stride.
# this kernel uses one block for multiple column(up to 32TODO), # this kernel uses one block for multiple column(up to 32TODO),
# threads per block for each element per column. # threads per block for each element per column.
...@@ -1575,10 +1601,12 @@ class GpuSum(GpuOp): ...@@ -1575,10 +1601,12 @@ class GpuSum(GpuOp):
#thread.y = dim 1 #thread.y = dim 1
#block.x = dim 0 #block.x = dim 0
#block.y = dim 1 rest #block.y = dim 1 rest
init = self._k_init(node,nodename) init = self._k_init(node, nodename)
decl = self._k_decl(node, nodename, pattern="010_inner") decl = self._k_decl(node, nodename, pattern="010_inner")
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x') reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x') 'blockDim.x')
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
'blockDim.x')
print >> sio, """ print >> sio, """
%(decl)s %(decl)s
{ {
...@@ -1602,7 +1630,7 @@ class GpuSum(GpuOp): ...@@ -1602,7 +1630,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,1,0): if self.reduce_mask == (1, 1, 0):
# this kernel uses one block for each column, # this kernel uses one block for each column,
# threads per block for each element per column. # threads per block for each element per column.
...@@ -1615,7 +1643,8 @@ class GpuSum(GpuOp): ...@@ -1615,7 +1643,8 @@ class GpuSum(GpuOp):
const int d0, const int d0,
const int d1, const int d1,
const int d2, const int d2,
const float *A, const int sA0, const int sA1, const int sA2, const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0) float * Z, const int sZ0)
{ {
const int threadCount = blockDim.x * blockDim.y; const int threadCount = blockDim.x * blockDim.y;
...@@ -1642,7 +1671,7 @@ class GpuSum(GpuOp): ...@@ -1642,7 +1671,7 @@ class GpuSum(GpuOp):
%(reducebuf)s %(reducebuf)s
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,0,0): if self.reduce_mask == (1, 0, 0):
reducebuf = self._k_reduce_buf('Z[i1 * sZ0 + i2 * sZ1]') reducebuf = self._k_reduce_buf('Z[i1 * sZ0 + i2 * sZ1]')
decl = self._k_decl(node, nodename) decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename) init = self._k_init(node, nodename)
...@@ -1664,7 +1693,7 @@ class GpuSum(GpuOp): ...@@ -1664,7 +1693,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,1,1): if self.reduce_mask == (1, 1, 1):
reducebuf = self._k_reduce_buf('Z[0]') reducebuf = self._k_reduce_buf('Z[0]')
decl = self._k_decl(node, nodename) decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename) init = self._k_init(node, nodename)
...@@ -1686,7 +1715,7 @@ class GpuSum(GpuOp): ...@@ -1686,7 +1715,7 @@ class GpuSum(GpuOp):
%(reducebuf)s %(reducebuf)s
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,0,1): if self.reduce_mask == (0, 0, 1):
# this kernel uses one block for each row, # this kernel uses one block for each row,
# threads per block for each element per row. # threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]') reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
...@@ -1695,7 +1724,8 @@ class GpuSum(GpuOp): ...@@ -1695,7 +1724,8 @@ class GpuSum(GpuOp):
const int d0, const int d0,
const int d1, const int d1,
const int d2, const int d2,
const float *A, const int sA0, const int sA1, const int sA2, const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1) float * Z, const int sZ0, const int sZ1)
{ {
const int threadCount = blockDim.x; const int threadCount = blockDim.x;
...@@ -1721,7 +1751,7 @@ class GpuSum(GpuOp): ...@@ -1721,7 +1751,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,0,1,1): if self.reduce_mask == (0, 0, 1, 1):
# this kernel uses one block for each row, # this kernel uses one block for each row,
# threads per block for each element per row. # threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]') reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
...@@ -1749,7 +1779,7 @@ class GpuSum(GpuOp): ...@@ -1749,7 +1779,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (0,1,0,1): if self.reduce_mask == (0, 1, 0, 1):
# this kernel uses one block for each row, # this kernel uses one block for each row,
# threads per block for each element per row. # threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]') reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]')
...@@ -1777,7 +1807,7 @@ class GpuSum(GpuOp): ...@@ -1777,7 +1807,7 @@ class GpuSum(GpuOp):
} }
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,1,1,1): if self.reduce_mask == (1, 1, 1, 1):
reducebuf = self._k_reduce_buf('Z[0]') reducebuf = self._k_reduce_buf('Z[0]')
decl = self._k_decl(node, nodename) decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename) init = self._k_init(node, nodename)
...@@ -1800,7 +1830,7 @@ class GpuSum(GpuOp): ...@@ -1800,7 +1830,7 @@ class GpuSum(GpuOp):
%(reducebuf)s %(reducebuf)s
} }
""" % locals() """ % locals()
if self.reduce_mask == (1,0,1,1): if self.reduce_mask == (1, 0, 1, 1):
reducebuf = self._k_reduce_buf('Z[blockIdx.x*sZ0]') reducebuf = self._k_reduce_buf('Z[blockIdx.x*sZ0]')
print >> sio, """ print >> sio, """
static __global__ void kernel_reduce_sum_1011_%(nodename)s( static __global__ void kernel_reduce_sum_1011_%(nodename)s(
...@@ -1808,7 +1838,8 @@ class GpuSum(GpuOp): ...@@ -1808,7 +1838,8 @@ class GpuSum(GpuOp):
const unsigned int d1, const unsigned int d1,
const unsigned int d2, const unsigned int d2,
const unsigned int d3, const unsigned int d3,
const float *A, const int sA0, const int sA1, const int sA2, const int sA3, const float *A, const int sA0, const int sA1,
const int sA2, const int sA3,
float * Z, const int sZ0) float * Z, const int sZ0)
{ {
const int threadCount = blockDim.x * blockDim.y * blockDim.z; const int threadCount = blockDim.x * blockDim.y * blockDim.z;
...@@ -1867,7 +1898,7 @@ class GpuSubtensor(tensor.Subtensor, GpuOp): ...@@ -1867,7 +1898,7 @@ class GpuSubtensor(tensor.Subtensor, GpuOp):
assert isinstance(x.type, CudaNdarrayType) assert isinstance(x.type, CudaNdarrayType)
rval = tensor.Subtensor.make_node(self, x, *inputs) rval = tensor.Subtensor.make_node(self, x, *inputs)
otype = CudaNdarrayType(rval.outputs[0].type.broadcastable) otype = CudaNdarrayType(rval.outputs[0].type.broadcastable)
return Apply(self, [x]+rval.inputs[1:], [otype()]) return Apply(self, [x] + rval.inputs[1:], [otype()])
def perform(self, node, inputs, out_): def perform(self, node, inputs, out_):
out, = out_ out, = out_
...@@ -1907,6 +1938,7 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp): ...@@ -1907,6 +1938,7 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
#If True or False, we assert that we use the take version or not #If True or False, we assert that we use the take version or not
#If None, we choose the best one applicable #If None, we choose the best one applicable
perform_using_take = None perform_using_take = None
max_threads = 0
def make_node(self, x, ilist): def make_node(self, x, ilist):
x_ = as_cuda_ndarray_variable(x) x_ = as_cuda_ndarray_variable(x)
...@@ -1946,9 +1978,18 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp): ...@@ -1946,9 +1978,18 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
idx = idx.view("float32") idx = idx.view("float32")
idx = cuda_ndarray.cuda_ndarray.CudaNdarray(idx) idx = cuda_ndarray.cuda_ndarray.CudaNdarray(idx)
if self.max_threads == 0:
num = theano.sandbox.cuda.use.device_number
if device_properties(num)['regsPerBlock'] < (8192 * 2):
self.max_threads = 256
else:
self.max_threads = 512
o = x.take(idx, o = x.take(idx,
0, # axis 0, # axis
out_[0][0]) # return out_[0][0], # return
"raise",
self.max_threads)
if x is not x_orig: if x is not x_orig:
o = o.reshape(out_shape) o = o.reshape(out_shape)
out[0] = o out[0] = o
...@@ -2033,14 +2074,14 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp): ...@@ -2033,14 +2074,14 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):
assert isinstance(x.type, CudaNdarrayType) assert isinstance(x.type, CudaNdarrayType)
assert isinstance(y.type, CudaNdarrayType) assert isinstance(y.type, CudaNdarrayType)
rval = tensor.IncSubtensor.make_node(self, x, y, *inputs) rval = tensor.IncSubtensor.make_node(self, x, y, *inputs)
return Apply(self, [x,y]+rval.inputs[2:], [x.type()]) return Apply(self, [x, y] + rval.inputs[2:], [x.type()])
class GpuFlatten(tensor.Flatten, GpuOp): class GpuFlatten(tensor.Flatten, GpuOp):
""" """
Implement Flatten on the gpu. Implement Flatten on the gpu.
""" """
def make_node(self, x ): def make_node(self, x):
assert isinstance(x.type, CudaNdarrayType) assert isinstance(x.type, CudaNdarrayType)
rval = tensor.Flatten.make_node(self, x) rval = tensor.Flatten.make_node(self, x)
host_out_broadcastable = rval.outputs[0].type.broadcastable host_out_broadcastable = rval.outputs[0].type.broadcastable
...@@ -2096,10 +2137,12 @@ class GpuJoin(tensor.Join, GpuOp): ...@@ -2096,10 +2137,12 @@ class GpuJoin(tensor.Join, GpuOp):
# dimension in "axis" can be different, so make equal for == # dimension in "axis" can be different, so make equal for ==
tmp_shape[axis] = template_shape[axis] tmp_shape[axis] = template_shape[axis]
if tuple(tmp_shape) != template_shape: if tuple(tmp_shape) != template_shape:
raise ValueError, "Shape of input CudaNdarrays must agree except for the 'axis' dimension" raise ValueError("Shape of input CudaNdarrays must"
" agree except for the 'axis' dimension")
if len(template_shape) != node.outputs[0].type.ndim: if len(template_shape) != node.outputs[0].type.ndim:
raise ValueError, "Number of dimension of input tensors disagree with dimensions passed at graph creation time." raise ValueError("Number of dimension of input tensors disagree"
" with dimensions passed at graph creation time.")
# final shape must be the same as all input tensors # final shape must be the same as all input tensors
# except for the "axis" dimension, so we can simply # except for the "axis" dimension, so we can simply
...@@ -2110,7 +2153,8 @@ class GpuJoin(tensor.Join, GpuOp): ...@@ -2110,7 +2153,8 @@ class GpuJoin(tensor.Join, GpuOp):
# just to be explicit, check that dim=1 for broadcastable # just to be explicit, check that dim=1 for broadcastable
# dimensions # dimensions
for i, bcastable in enumerate(node.outputs[0].type.broadcastable): for i, bcastable in enumerate(node.outputs[0].type.broadcastable):
assert not bcastable or final_shape[i] == 1, "Broadcastable dimension but dim != 1, this is invalid" assert not bcastable or final_shape[i] == 1, (
"Broadcastable dimension but dim != 1, this is invalid")
rval = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(final_shape) rval = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(final_shape)
...@@ -2120,9 +2164,9 @@ class GpuJoin(tensor.Join, GpuOp): ...@@ -2120,9 +2164,9 @@ class GpuJoin(tensor.Join, GpuOp):
# except for 'axis' # except for 'axis'
def construct_slices(curlen): def construct_slices(curlen):
slices = [slice(None,None,None) for i in \ slices = [slice(None, None, None) for i in \
range(len(template_shape))] range(len(template_shape))]
slices[axis] = slice(curpos,curpos+curlen,None) slices[axis] = slice(curpos, curpos + curlen, None)
return tuple(slices) return tuple(slices)
for i, cnda in enumerate(cndas): for i, cnda in enumerate(cndas):
...@@ -2157,7 +2201,9 @@ class GpuAlloc(GpuOp): ...@@ -2157,7 +2201,9 @@ class GpuAlloc(GpuOp):
v = as_cuda_ndarray_variable(value) v = as_cuda_ndarray_variable(value)
sh = [tensor.as_tensor_variable(s) for s in shape] sh = [tensor.as_tensor_variable(s) for s in shape]
if v.ndim != len(shape): if v.ndim != len(shape):
raise TypeError('GpuAlloc requires value of same dimensions as shape', value, len(shape)) raise TypeError(
'GpuAlloc requires value of same dimensions as shape',
value, len(shape))
bcast = [] bcast = []
for s in sh: for s in sh:
...@@ -2170,7 +2216,7 @@ class GpuAlloc(GpuOp): ...@@ -2170,7 +2216,7 @@ class GpuAlloc(GpuOp):
const_shp = None const_shp = None
bcast.append(numpy.all(1 == const_shp)) bcast.append(numpy.all(1 == const_shp))
otype = CudaNdarrayType(dtype='float32', broadcastable=bcast) otype = CudaNdarrayType(dtype='float32', broadcastable=bcast)
return Apply(self, [v]+sh, [otype()]) return Apply(self, [v] + sh, [otype()])
def perform(self, node, inputs, out_): def perform(self, node, inputs, out_):
out, = out_ out, = out_
...@@ -2178,7 +2224,7 @@ class GpuAlloc(GpuOp): ...@@ -2178,7 +2224,7 @@ class GpuAlloc(GpuOp):
sh = tuple([int(i) for i in inputs[1:]]) sh = tuple([int(i) for i in inputs[1:]])
if out[0] is None or out[0].shape != sh: if out[0] is None or out[0].shape != sh:
out[0] = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(sh) out[0] = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(sh)
out[0][...] = v # broadcast v to fill us up out[0][...] = v # broadcast v to fill us up
def c_code(self, node, name, inputs, out_, sub): def c_code(self, node, name, inputs, out_, sub):
out, = out_ out, = out_
...@@ -2186,12 +2232,12 @@ class GpuAlloc(GpuOp): ...@@ -2186,12 +2232,12 @@ class GpuAlloc(GpuOp):
value = inputs[0] value = inputs[0]
shps = inputs[1:] shps = inputs[1:]
nd = len(shps) nd = len(shps)
str = "int dims[%(nd)s];\n" % locals() str = "int dims[%(nd)s];\n" % locals()
for idx,sh in enumerate(shps): for idx, sh in enumerate(shps):
str += "dims[%(idx)s] = PyInt_AsLong((PyObject*)%(sh)s);\n" % locals() str += "dims[%(idx)s] = PyInt_AsLong((PyObject*)%(sh)s);\n" % locals()
str += "if(%(out)s==NULL\n" % locals() str += "if(%(out)s==NULL\n" % locals()
for idx,sh in enumerate(shps): for idx, sh in enumerate(shps):
str += "||CudaNdarray_HOST_DIMS(%(out)s)[%(idx)s]!=dims[%(idx)s]" % locals() str += "||CudaNdarray_HOST_DIMS(%(out)s)[%(idx)s]!=dims[%(idx)s]" % locals()
str += """){ str += """){
Py_XDECREF(%(out)s); Py_XDECREF(%(out)s);
...@@ -2350,10 +2396,9 @@ def tensordot(a, b, axes=2): ...@@ -2350,10 +2396,9 @@ def tensordot(a, b, axes=2):
"Axes should be scalar valued or a list/tuple of len 2.", "Axes should be scalar valued or a list/tuple of len 2.",
axes) axes)
# Those are predifined CudaNdarrayType as done in tensor.basic # Those are predifined CudaNdarrayType as done in tensor.basic
# Useful mostly for test as the gpu op are inserted automatically... # Useful mostly for test as the gpu op are inserted automatically...
fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
def scalar(name=None, dtype=None): def scalar(name=None, dtype=None):
"""Return a symbolic scalar variable. """Return a symbolic scalar variable.
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2363,8 +2408,9 @@ def scalar(name=None, dtype=None): ...@@ -2363,8 +2408,9 @@ def scalar(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=()) type = CudaNdarrayType(dtype=dtype, broadcastable=())
return type(name) return type(name)
fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
def vector(name=None, dtype=None): def vector(name=None, dtype=None):
"""Return a symbolic vector variable. """Return a symbolic vector variable.
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2374,8 +2420,9 @@ def vector(name=None, dtype=None): ...@@ -2374,8 +2420,9 @@ def vector(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, )) type = CudaNdarrayType(dtype=dtype, broadcastable=(False, ))
return type(name) return type(name)
fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
def matrix(name=None, dtype=None): def matrix(name=None, dtype=None):
"""Return a symbolic matrix variable. """Return a symbolic matrix variable.
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2385,8 +2432,9 @@ def matrix(name=None, dtype=None): ...@@ -2385,8 +2432,9 @@ def matrix(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False)) type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False))
return type(name) return type(name)
fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
def row(name=None, dtype=None): def row(name=None, dtype=None):
"""Return a symbolic row variable (ndim=2, broadcastable=[True,False]). """Return a symbolic row variable (ndim=2, broadcastable=[True,False]).
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2396,8 +2444,9 @@ def row(name=None, dtype=None): ...@@ -2396,8 +2444,9 @@ def row(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(True, False)) type = CudaNdarrayType(dtype=dtype, broadcastable=(True, False))
return type(name) return type(name)
frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
def col(name=None, dtype=None): def col(name=None, dtype=None):
"""Return a symbolic column variable (ndim=2, broadcastable=[False,True]). """Return a symbolic column variable (ndim=2, broadcastable=[False,True]).
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2407,8 +2456,9 @@ def col(name=None, dtype=None): ...@@ -2407,8 +2456,9 @@ def col(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, True)) type = CudaNdarrayType(dtype=dtype, broadcastable=(False, True))
return type(name) return type(name)
fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,)*3)
def tensor3(name=None, dtype=None): def tensor3(name=None, dtype=None):
"""Return a symbolic 3-D variable. """Return a symbolic 3-D variable.
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2418,8 +2468,9 @@ def tensor3(name=None, dtype=None): ...@@ -2418,8 +2468,9 @@ def tensor3(name=None, dtype=None):
dtype = config.floatX dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False, False)) type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False, False))
return type(name) return type(name)
ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 3)
ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)
def tensor4(name=None, dtype=None): def tensor4(name=None, dtype=None):
"""Return a symbolic 4-D variable. """Return a symbolic 4-D variable.
:param dtype: numeric type (None means to use theano.config.floatX) :param dtype: numeric type (None means to use theano.config.floatX)
...@@ -2430,6 +2481,7 @@ def tensor4(name=None, dtype=None): ...@@ -2430,6 +2481,7 @@ def tensor4(name=None, dtype=None):
type = CudaNdarrayType(dtype=dtype, type = CudaNdarrayType(dtype=dtype,
broadcastable=(False, False, False, False)) broadcastable=(False, False, False, False))
return type(name) return type(name)
ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)
@theano.compile.profilemode.register_profiler_printer @theano.compile.profilemode.register_profiler_printer
...@@ -2446,22 +2498,24 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call, ...@@ -2446,22 +2498,24 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
gpu = 0 gpu = 0
trans = 0 trans = 0
for (_, node), t in apply_time.items(): for (_, node), t in apply_time.items():
if isinstance(node.op.__class__.__name__, (HostFromGpu, GpuFromHost)): if isinstance(node.op.__class__.__name__,
(HostFromGpu, GpuFromHost)):
trans += t trans += t
elif node.op.__class__.__name__.lower().startswith("gpu"): elif node.op.__class__.__name__.lower().startswith("gpu"):
gpu += t gpu += t
else: else:
cpu += t cpu += t
print print
print " Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op"%( print " Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op" % (
cpu, cpu/local_time*100, gpu, gpu/local_time*100, trans, trans/local_time*100) cpu, cpu / local_time * 100, gpu, gpu / local_time * 100,
trans, trans / local_time * 100)
print print
print " Theano function input that are float64" print " Theano function input that are float64"
print " <fct name> <input name> <input type> <str input>" print " <fct name> <input name> <input type> <str input>"
for fct in fct_call.keys(): for fct in fct_call.keys():
for i in fct.input_storage: for i in fct.input_storage:
if hasattr(i.type, 'dtype') and i.type.dtype=='float64': if hasattr(i.type, 'dtype') and i.type.dtype == 'float64':
print ' ', fct.name, i.name, i.type, i print ' ', fct.name, i.name, i.type, i
print print
...@@ -2470,5 +2524,13 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call, ...@@ -2470,5 +2524,13 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
print ' <Apply> <Apply position> <fct name> <inputs type> <outputs type>' print ' <Apply> <Apply position> <fct name> <inputs type> <outputs type>'
for fct in fct_call.keys(): for fct in fct_call.keys():
for idx, node in enumerate(fct.maker.fgraph.toposort()): for idx, node in enumerate(fct.maker.fgraph.toposort()):
if any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.outputs) and not any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.inputs): if (any(hasattr(i, 'dtype') and i.dtype == 'float64'
print ' ', str(node), idx, fct.name, str([getattr(i,'dtype',None) for i in node.inputs]),str([getattr(i,'dtype',None) for i in node.outputs]) for i in node.outputs) and
not any(hasattr(i, 'dtype') and i.dtype == 'float64'
for i in node.inputs)):
print ' ', str(node), idx, fct.name,
print str([getattr(i, 'dtype', None)
for i in node.inputs]),
print str([getattr(i, 'dtype', None)
for i in node.outputs])
...@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject * axis_obj = Py_None; PyObject * axis_obj = Py_None;
PyObject * out_obj = Py_None; PyObject * out_obj = Py_None;
PyObject * clipmode_obj = NULL; PyObject * clipmode_obj = NULL;
if (! PyArg_ParseTuple(args, "O|OOO", &indices_obj, &axis_obj, int max_threads = 1; // max threads per blocks
&out_obj, &clipmode_obj))
if (! PyArg_ParseTuple(args, "O|OOOi", &indices_obj, &axis_obj,
&out_obj, &clipmode_obj, &max_threads))
return NULL; return NULL;
//Check argument indices //Check argument indices
...@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject * axis_iobj = PyNumber_Long(axis_obj); PyObject * axis_iobj = PyNumber_Long(axis_obj);
if (!axis_iobj) { if (!axis_iobj) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: axis must be convertable to a long"); PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: axis must be convertable to a long");
Py_DECREF(indices_obj); Py_DECREF(indices);
return NULL; return NULL;
} }
long axis = PyInt_AsLong(axis_iobj); long axis = PyInt_AsLong(axis_iobj);
Py_DECREF(axis_iobj); axis_iobj=NULL; Py_DECREF(axis_iobj); axis_iobj=NULL;
if (axis != 0) { if (axis != 0) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only axis=0 is currently supported"); PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only axis=0 is currently supported");
Py_DECREF(indices_obj); Py_DECREF(indices);
return NULL; return NULL;
} }
...@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (!out) { if (!out) {
out = (CudaNdarray*)CudaNdarray_New(); out = (CudaNdarray*)CudaNdarray_New();
if (!out){ if (!out){
Py_DECREF(indices_obj); Py_DECREF(indices);
free(dims); free(dims);
return NULL; return NULL;
} }
if (CudaNdarray_alloc_contiguous(out, self->nd, dims)) { if (CudaNdarray_alloc_contiguous(out, self->nd, dims)) {
Py_DECREF(out); Py_DECREF(out);
Py_DECREF(indices_obj); Py_DECREF(indices);
free(dims); free(dims);
return NULL; return NULL;
} }
...@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (clipmode_obj) { if (clipmode_obj) {
char * clipmode = PyString_AsString(clipmode_obj); char * clipmode = PyString_AsString(clipmode_obj);
if (! clipmode){ if (! clipmode){
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
free(dims); free(dims);
return NULL; return NULL;
} }
if (strcmp(clipmode, "raise") != 0) { if (strcmp(clipmode, "raise") != 0) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only the raise mode is currently supported"); PyErr_Format(PyExc_NotImplementedError,
Py_DECREF(indices_obj); "CudaNdarray_TakeFrom: only the raise mode is currently supported. Got '%s'",
clipmode);
Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
free(dims); free(dims);
return NULL; return NULL;
} }
Py_DECREF(clipmode_obj);
} }
void (*k3)(const int, const int, const int, void (*k3)(const int, const int, const int,
const npy_int64*, const npy_int64*,
...@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (err_var == NULL) { if (err_var == NULL) {
err_var = (int*)device_malloc(sizeof(int)); err_var = (int*)device_malloc(sizeof(int));
if (!err_var) { // PyErr set by device_malloc if (!err_var) { // PyErr set by device_malloc
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
free(dims); free(dims);
return NULL; return NULL;
...@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyErr_Format(PyExc_RuntimeError, PyErr_Format(PyExc_RuntimeError,
"Error setting device error code to 0. %s", "Error setting device error code to 0. %s",
cudaGetErrorString(err)); cudaGetErrorString(err));
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
free(dims); free(dims);
return NULL; return NULL;
...@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
} }
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(out)[0],65535),1,1); dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(out)[0],65535),1,1);
switch (self->nd) { switch (self->nd) {
case 1: case 1:
{ {
dim3 n_threads(1, 1, 1); dim3 n_threads(1, 1, 1);
if (verbose) if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d," printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n", " n_threads.x=%i, n_threads.y=%i)\n",
self->nd, cudaGetLastError(),
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y); n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>( k3<<<n_blocks, n_threads>>>(
dims[0], dims[0],
...@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break; break;
case 2: case 2:
{ {
dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], 512), 1, 1); dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads), 1, 1);
if (verbose) if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d," printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n", " n_threads.x=%i, n_threads.y=%i)\n",
cudaGetLastError(), self->nd,
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y); n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>( k3<<<n_blocks, n_threads>>>(
dims[0], //dimensions dims[0], //dimensions
dims[1], dims[1],
...@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break; break;
case 3: case 3:
{ {
int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], 512); int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], max_threads);
int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], 512 / ty); int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads / ty);
dim3 n_threads(tx, ty, 1); dim3 n_threads(tx, ty, 1);
if (verbose) if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d," printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n", " n_threads.x=%i, n_threads.y=%i)\n",
self->nd, cudaGetLastError(),
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y); n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>( k3<<<n_blocks, n_threads>>>(
dims[0], //dimensions dims[0], //dimensions
...@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s.\n", "Cuda error: %s: %s.\n",
"CudaNdarray_TakeFrom", "CudaNdarray_TakeFrom",
cudaGetErrorString(err)); cudaGetErrorString(err));
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
return NULL; return NULL;
} }
...@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s when trying to get the error value.\n", "Cuda error: %s: %s when trying to get the error value.\n",
"CudaNdarray_TakeFrom", "CudaNdarray_TakeFrom",
cudaGetErrorString(err)); cudaGetErrorString(err));
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
return NULL; return NULL;
} }
...@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
err = cudaMemset((void*)err_var, 0, sizeof(int)); err = cudaMemset((void*)err_var, 0, sizeof(int));
if (cudaSuccess != err) { if (cudaSuccess != err) {
PyErr_Format(PyExc_MemoryError, "Error setting device error code to 0 after having an index error. %s", cudaGetErrorString(err)); PyErr_Format(PyExc_MemoryError, "Error setting device error code to 0 after having an index error. %s", cudaGetErrorString(err));
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
return NULL; return NULL;
} }
Py_DECREF(indices_obj); Py_DECREF(indices);
Py_DECREF(out); Py_DECREF(out);
return NULL; return NULL;
} }
Py_DECREF(indices_obj); Py_DECREF(indices);
if (verbose) printf("TAKE SUCCEDED\n"); if (verbose) printf("TAKE SUCCEDED\n");
return (PyObject *)out; return (PyObject *)out;
......
...@@ -7,6 +7,7 @@ import subprocess ...@@ -7,6 +7,7 @@ import subprocess
import sys import sys
import warnings import warnings
import theano
from theano.gof.cc import hash_from_file from theano.gof.cc import hash_from_file
from theano.gof.cmodule import (std_libs, std_lib_dirs, from theano.gof.cmodule import (std_libs, std_lib_dirs,
std_include_dirs, dlimport, std_include_dirs, dlimport,
...@@ -119,6 +120,16 @@ class NVCC_compiler(object): ...@@ -119,6 +120,16 @@ class NVCC_compiler(object):
cuda_ndarray_cuh_hash = hash_from_file( cuda_ndarray_cuh_hash = hash_from_file(
os.path.join(os.path.split(__file__)[0], 'cuda_ndarray.cuh')) os.path.join(os.path.split(__file__)[0], 'cuda_ndarray.cuh'))
flags.append('-DCUDA_NDARRAY_CUH=' + cuda_ndarray_cuh_hash) flags.append('-DCUDA_NDARRAY_CUH=' + cuda_ndarray_cuh_hash)
# We compile cuda_ndarray.cu during import.
# We should not add device properties at that time.
# As the device is not selected yet!
# TODO: compile cuda_ndarray when we bind to a GPU?
import theano.sandbox.cuda
if hasattr(theano.sandbox, 'cuda'):
n = theano.sandbox.cuda.use.device_number
p = theano.sandbox.cuda.device_properties(n)
flags.append('-arch=sm_' + str(p['major']) + str(p['minor']))
return flags return flags
@staticmethod @staticmethod
...@@ -217,7 +228,9 @@ class NVCC_compiler(object): ...@@ -217,7 +228,9 @@ class NVCC_compiler(object):
# '--gpu-code=compute_13', # '--gpu-code=compute_13',
#nvcc argument #nvcc argument
preargs1 = [pa for pa in preargs preargs1 = [pa for pa in preargs
if pa.startswith('-O') or pa.startswith('--maxrregcount=')] if pa.startswith('-O') or
pa.startswith('--maxrregcount=') or
pa.startswith('-arch=')]
preargs2 = [pa for pa in preargs preargs2 = [pa for pa in preargs
if pa not in preargs1] # other arguments if pa not in preargs1] # other arguments
...@@ -337,6 +350,7 @@ class NVCC_compiler(object): ...@@ -337,6 +350,7 @@ class NVCC_compiler(object):
pass pass
print >> sys.stderr, l print >> sys.stderr, l
print nvcc_stdout print nvcc_stdout
print cmd
raise Exception('nvcc return status', p.returncode, raise Exception('nvcc return status', p.returncode,
'for cmd', ' '.join(cmd)) 'for cmd', ' '.join(cmd))
elif config.cmodule.compilation_warning and nvcc_stdout: elif config.cmodule.compilation_warning and nvcc_stdout:
......
...@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase): ...@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase):
for step in xrange(1, 4): for step in xrange(1, 4):
v_out[step] = v_u[step] * W_in + v_out[step - 1] * W v_out[step] = v_u[step] * W_in + v_out[step - 1] * W
theano_values = f2(v_u, v_x0, W_in, W) theano_values = f2(v_u, v_x0, W_in, W)
assert numpy.allclose(theano_values, v_out) assert numpy.allclose(theano_values, v_out), (theano_values, v_out,
theano_values - v_out)
# TO DEL # TO DEL
topo = f2.maker.fgraph.toposort() topo = f2.maker.fgraph.toposort()
...@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase): ...@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase):
v_y[i] = numpy.dot(v_x[i - 1], vWout) v_y[i] = numpy.dot(v_x[i - 1], vWout)
(theano_x, theano_y) = f4(v_u1, v_u2, v_x0, v_y0, vW_in1) (theano_x, theano_y) = f4(v_u1, v_u2, v_x0, v_y0, vW_in1)
assert numpy.allclose(theano_x, v_x) assert numpy.allclose(theano_x, v_x), (theano_x, v_x, theano_x - v_x)
assert numpy.allclose(theano_y, v_y) assert numpy.allclose(theano_y, v_y), (theano_y, v_y, theano_y - v_y)
def test_multiple_outs_taps(self): def test_multiple_outs_taps(self):
l = 5 l = 5
...@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase): ...@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase):
ny1[4] = (ny1[3] + ny1[1]) * numpy.dot(ny0[3], vWout) ny1[4] = (ny1[3] + ny1[1]) * numpy.dot(ny0[3], vWout)
ny2[4] = numpy.dot(v_u1[4], vW_in1) ny2[4] = numpy.dot(v_u1[4], vW_in1)
def test_using_taps_sequence(self): def test_using_taps_sequence(self):
# this test refers to a bug reported by Nicolas # this test refers to a bug reported by Nicolas
# Boulanger-Lewandowski June 6th # Boulanger-Lewandowski June 6th
x = theano.tensor.dvector() x = theano.tensor.dvector()
y, updates = theano.scan(lambda x: [x], y, updates = theano.scan(lambda x: [x],
sequences=dict(input=x, taps=[-1]), sequences=dict(input=x, taps=[-1]),
outputs_info = [None]) outputs_info=[None])
inp = numpy.arange(5).astype('float64') inp = numpy.arange(5).astype('float64')
rval = theano.function([x], y, updates=updates)(inp) rval = theano.function([x], y, updates=updates)(inp)
assert numpy.all(rval == inp[:-1]) assert numpy.all(rval == inp[:-1])
...@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase): ...@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase):
# equivalent is done # equivalent is done
(theano_x0, theano_x1) = f9(vu0, vu1, vu2, vx0, vx1) (theano_x0, theano_x1) = f9(vu0, vu1, vu2, vx0, vx1)
# assert that theano does what it should # assert that theano does what it should
assert numpy.allclose(theano_x0, numpy_x0) assert numpy.allclose(theano_x0, numpy_x0), (theano_x0, numpy_x0,
assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1, theano_x1 - numpy_x1) theano_x0 - numpy_x0)
assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1,
theano_x1 - numpy_x1)
# assert that it was done in place # assert that it was done in place
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
...@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase): ...@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase):
vx1 = asarrayX(rng.uniform()) vx1 = asarrayX(rng.uniform())
x0 = theano.shared(vx0) x0 = theano.shared(vx0)
x1 = theano.shared(vx1) x1 = theano.shared(vx1)
outputs, updates = theano.scan(lambda x,y: (x + asarrayX(1), outputs, updates = theano.scan(lambda x, y: (x + asarrayX(1),
y + asarrayX(1)), y + asarrayX(1)),
[], [],
[x0,x1], [x0, x1],
n_steps = 3) n_steps=3)
x0 = asarrayX(numpy.zeros((3,))) x0 = asarrayX(numpy.zeros((3,)))
x0[0] = vx0 x0[0] = vx0
x0 = theano.tensor.constant(x0) x0 = theano.tensor.constant(x0)
...@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase): ...@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase):
v_eW = numpy.array(rng.uniform(size=(5, 5)) - .5, dtype=floatX) v_eW = numpy.array(rng.uniform(size=(5, 5)) - .5, dtype=floatX)
v_eh0 = numpy.array(rng.uniform(size=(5,)) - .5, dtype=floatX) v_eh0 = numpy.array(rng.uniform(size=(5,)) - .5, dtype=floatX)
def rnn_fn(_u, _y, _W): def rnn_fn(_u, _y, _W):
srng = theano.tensor.shared_randomstreams.RandomStreams(seed) srng = theano.tensor.shared_randomstreams.RandomStreams(seed)
......
...@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \ ...@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \
jacobian, hessian jacobian, hessian
from theano.tensor.sort import sort from theano.tensor.sort import sort
from extra_ops import (DiffOp, bincount, squeeze,
repeat, bartlett, fill_diagonal)
...@@ -3,8 +3,8 @@ import numpy ...@@ -3,8 +3,8 @@ import numpy
import theano import theano
import basic import basic
from theano import gof, tensor, scalar from theano import gof, scalar
from theano.sandbox.linalg.ops import diag import basic as tensor
class DiffOp(theano.Op): class DiffOp(theano.Op):
...@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op): ...@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op):
raise NotImplementedError('%s: gradient is currently implemented' raise NotImplementedError('%s: gradient is currently implemented'
' for matrices only' % self.__class__.__name__) ' for matrices only' % self.__class__.__name__)
wr_a = fill_diagonal(grad, 0) # valid for any number of dimensions wr_a = fill_diagonal(grad, 0) # valid for any number of dimensions
wr_val = diag(grad).sum() # diag is only valid for matrices # diag is only valid for matrices
import theano.sandbox.linalg
wr_val = theano.sandbox.linalg.ops.diag(grad).sum()
return [wr_a, wr_val] return [wr_a, wr_val]
fill_diagonal_ = FillDiagonal() fill_diagonal_ = FillDiagonal()
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论