提交 72a7214a authored 作者: lamblin's avatar lamblin

Merge pull request #863 from nouiz/mixed2

Mixed2
......@@ -2,148 +2,7 @@
Updates in the Trunk since the last release:
Bug fixes
* Outputs of Scan nodes could contain corrupted values: some parts of the
output would be repeated a second time, instead of the correct values.
It happened randomly, and quite infrequently, but the bug has been present
(both in Python and Cython) since April 2011. (Pascal L.)
* In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
It did not return the right number of elements. (Frederic B.)
* set_subtensor(x[int vector], new_value) when moved to the GPU
was transformed into inc_subtensor on the GPU. Now we have a correct
(but slow) GPU implementation.
Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
in all cases as well as inc_subtensor(*, *).
Note 2: If your code was affected by the incorrect behavior, we now print
a warning by default (Frederic B.)
* Fixed an issue whereby config values were used as default arguments,
with those defaults then stuck at old values if the config variables were
changed during program execution. (David W-F)
* Fixed many subtle bugs involving mutable default arguments which may have
led to unexpected behaviour, such as objects sharing instance variables
they were not supposed to share. (David W-F)
* Correctly record the GPU device number used when we let the driver select it.
(Frederic B.)
Documentation
* Added in the tutorial documentation on how to extend Theano.
This explains how to make a Theano Op from a Python function.
http://deeplearning.net/software/theano/tutorial/extending_theano.html
(Frédéric B.)
* New installation instructions for Windows using EPD (Pascal L.)
Interface changes
* In 0.5, we removed the deprecated sharedvar.value property.
Now we raise an error if you access it. (Frederic B.)
* theano.function does not accept duplicate inputs, so function([x, x], ...)
does not work anymore. (Pascal L.)
* theano.function now raises an error if some of the provided inputs are
not part of the computational graph needed to compute the output, for
instance, function([x, y], [y]). You can use the kwarg
``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
(Pascal L.)
* New Theano flag "on_unused_input" that define the default value of the
previous point. (Frederic B.)
* tensor.alloc() now raises an error during graph build time
when we try to create less dimensions than the number of dimensions
the provided value have. In the past, the error was at run time.
(Frederic B.)
Speed up
* Convolution on the GPU now check the generation of the card to make
it faster in some cases (especially medium/big ouput image) (Frédéric B.)
(We hardcoded 512 as the maximum number of thread per block. Newer card
support up to 1024 threads per block.
* CPU convolution are now parallelized (Frédric B.)
By default use all cores/hyper-threads
To control it, use the OMP_NUM_THREADS=N environment variable.
New Features
* debugprint new param ids=["CHAR", "id", "int", ""]
This makes the identifier printed to be the python id, a unique char, a
unique int, or not have it printed. We changed the default to be "CHAR"
as this is more readable. (Frederic B.)
* debugprint new param stop_on_name=[False, True]. If True, we don't print
anything below an intermediate variable that has a name. Defaults to False.
(Frederic B.)
* debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
* If you use Enthought Python Distribution (EPD) now we use its blas
implementation by default (tested on Linux and Windows)
(Frederic B., Simon McGregor)
* MRG random now raises an error with a clear message when the passed shape
contains dimensions with bad value like 0. (Frédéric B. reported by Ian G.)
* "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
* "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
* We add dimensions to CudaNdarray to automatically broadcast more frequently.
(Frederic B.)
* theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
* New theano flag cmodule.warn_no_version. Default False. If True,
will print a warning when compiling one or more Op with C code that
can't be cached because there is no c_code_cache_version() function
associated to at least one of those Ops. (Frederic B.)
* CPU alloc now always generate C code (Pascal L.)
* New Theano flag cmodule.warn_no_version=False. When True, warn when an op
with C code is not versioned (which forces to recompile it everytimes).
(Frédéric B.)
* Made a few Ops with C code versioned to reduce compilation time.
(Frédéric B, Pascal L.)
* C code reuses preallocated outputs (only done by Scan) (Pascal L.)
* Garbage collection of intermediate results during Theano function calls
for Ops with C code (Pascal L.)
* Theano flags compiledir_format now support the parameter numpy_version.
* Theano GPU variables, shared variable and constant now support <, <=,
> and >= as as those not on the GPU.
Sparse
* Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
have the same sparsity pattern. (Frederic B.)
Sparse Sandbox graduate
* Remove0 op: it removes stored elements with value 0. (Frederic B.)
Sparse Sandbox Additions (not reviewed/documented/tested, but used by some people)
* They are all in the theano.sparse.sandbox.sp2 module
* Op class: Cast, Poisson, Multinomial, EliminateZeros, Sum, Binomial
* Op class: SamplingDot, SamplingDotCsr (inserted automatically)
* Op function: structured_sigmoid, structured_exp, structured_pow, structured_minimum
* Op class: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
* opt: local_sampling_dot_csr, local_structured_add_s_v
Internal changes
* Define new exceptions MissingInputError and UnusedInputError, and use them
in theano.function, instead of TypeError and ValueError. (Pascal L.)
* Better handling of bitwidth and max values of integers and pointers
across platforms (Pascal L.)
Crash Fix
* Do not try to use the BLAS library when blas.ldflags is manually set to an
empty string (Frederic B.)
* When importing theano on a computer without GPU with the Theano
flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by Luo Heng)
* Optimization printed a useless error when scipy was not available. (Frederic B.)
* GPU conv crash/slowdown on newer hardware (James B.)
* Better error handling in GPU conv (Frederic B.)
* GPU optimization that moves element-wise Ops to the GPU. Crash happened in
a particular execution order of this optimization and the
element-wise fusion optimization when upcasting some inputs to
float32 (to compute them on the GPU).
(Frederic B., reported by Sander Dieleman)
* GpuReshape in some particular case when the input is not contiguous
(Frederic B., reported by Sander Dieleman)
* GpuSoftmaxWithBias with shape (0, N) with N > 1.
(Frédéric B., reported by Razvan P.)
* Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
(Pascal L., reported by Simon McGregor)
* Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
dimensions, which could typically result in optimization crashes (Olivier D.)
* Fixed crash when concatenating some arrays with specific broadcasting
patterns (Olivier D.)
* Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylon)
* In advanced indexing, if some inputs are constant, no need to call constant(...)
on their value any more. (Pascal L., reported by John Salvatier)
* Fix crash on GPU when the GpuSubtensor didn't put the right stride
when the results tensor had a dimensions with size of 1. (Pascal L,
reported Graham T.)
https://github.com/Theano/Theano/wiki/Devnews
=============
Release Notes
......
......@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests.
option will be interpreted as an indication of the number of tests to be run
between notifications of progress to standard output.
If the '--theano' option is used, it is replaced with the path to theano.
Useful if you don't know where it was installed.
`run_tests_in_batch.py` will in turn call back this script in another process.
"""
......@@ -39,6 +42,12 @@ import sys
from nose.plugins import Plugin
def main():
# Handle the --theano arguments
if "--theano" in sys.argv:
i = sys.argv.index("--theano")
import theano
sys.argv[i] = theano.__path__[0]
# Handle --batch[=n] arguments
batch_args = [arg for arg in sys.argv if arg.startswith('--batch')]
for arg in batch_args:
......@@ -137,6 +146,11 @@ def help():
--without-knownfailure: Do not load the KnownFailure plugin.
--theano: This parameter is replaced with the path to the theano library.
As theano-nose is a wrapper to nosetests, it expect a path to the tests to run.
If you don't know where theano is installed, use this option
to have it inserted automatically.
The other options will be passed to nosetests, see ``nosetests -h``.
"""
......
......@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(),
"python_version": platform.python_version(),
"theano_version": theano.__version__,
"numpy_version": numpy.__version__,
"g++": gcc_version_str.replace(" ", "_"),
"gxx_version": gcc_version_str.replace(" ", "_"),
}
compiledir_format_keys = ", ".join(compiledir_format_dict.keys())
default_compiledir_format =\
......
......@@ -11,7 +11,7 @@ from theano import tensor, scalar, config
from theano.gof.python25 import all, any
from theano.sandbox.cuda import GpuOp
from theano.sandbox.cuda import GpuOp, device_properties
from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda import filter as type_support_filter
......@@ -641,7 +641,9 @@ class GpuSum(GpuOp):
printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n");
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
if (verbose>1)
printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d, nb_block=%%d, n_shared=%%d\\n",
printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d,"
" nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d,"
" nb_block=%%d, n_shared=%%d\\n",
n_threads.x,n_threads.y,n_threads.z,
n_threads.x*n_threads.y*n_threads.z,
n_blocks.x,n_blocks.y,
......@@ -673,7 +675,8 @@ class GpuSum(GpuOp):
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_%(pattern)s_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
......@@ -876,7 +879,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_SIZE(%(x)s),
NUM_VECTOR_OP_THREADS_PER_BLOCK));
dim3 n_blocks(1);
if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s n_threads.x=%%d, size=%%d, ndim=%%d\\n",
if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s"
" n_threads.x=%%d, size=%%d, ndim=%%d\\n",
n_threads.x,CudaNdarray_SIZE(%(x)s),%(x)s->nd);
int n_shared = sizeof(float) * n_threads.x;
kernel_reduce_sum_ccontig_%(name)s<<<n_blocks, n_threads, n_shared>>>(
......@@ -887,7 +891,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_ccontig_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
......@@ -937,11 +943,13 @@ class GpuSum(GpuOp):
:param N: the number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111
Work for N=1,2,3
"""
assert N in [1,2,3]
assert N in [1, 2, 3]
makecall = self._makecall(node, name, x, z, fail)
N_pattern = ''.join(['1']*N)
param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)])
strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)])
N_pattern = ''.join(['1'] * N)
param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals()
for i in xrange(N + 1)])
strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]"
% locals() for i in xrange(N + 1)])
threads_y = """
//get as many y threads as we can fit
while (n_threads.x * (n_threads.y+1) <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
......@@ -962,10 +970,10 @@ class GpuSum(GpuOp):
break;
}
""" % locals()
if len(self.reduce_mask)==2:
if len(self.reduce_mask) == 2:
threads_y = ''
threads_z = ''
if len(self.reduce_mask)==3:
if len(self.reduce_mask) == 3:
threads_z = ''
print >> sio, """
{
......@@ -975,15 +983,18 @@ class GpuSum(GpuOp):
NUM_VECTOR_OP_THREADS_PER_BLOCK));
%(threads_y)s
%(threads_z)s
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS));
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_BLOCKS));
%(makecall)s
}
""" % locals()
def c_code_reduce_01(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 1)
def c_code_reduce_011(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 2)
def c_code_reduce_0111(self, sio, node, name, x, z, fail):
self.c_code_reduce_01X(sio, node, name, x, z, fail, 3)
......@@ -1021,7 +1032,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
......@@ -1033,9 +1046,11 @@ class GpuSum(GpuOp):
}
}
""" % locals()
def c_code_reduce_010(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
makecall_inner = self._makecall(node, name, x, z, fail, pattern="010_inner")
makecall_inner = self._makecall(node, name, x, z, fail,
pattern="010_inner")
pattern = ''.join(str(i) for i in self.reduce_mask)
print >> sio, """
{
......@@ -1085,7 +1100,9 @@ class GpuSum(GpuOp):
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
PyErr_Format(PyExc_RuntimeError,
"Cuda error: %%s: %%s."
" (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
......@@ -1233,6 +1250,7 @@ class GpuSum(GpuOp):
%(makecall)s
}
""" % locals()
def c_code_reduce_111(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
print >> sio, """
......@@ -1275,7 +1293,8 @@ class GpuSum(GpuOp):
std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_BLOCKS));
while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS && n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1])
while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS &&
n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1])
{
n_blocks.y += 1;
}
......@@ -1356,7 +1375,7 @@ class GpuSum(GpuOp):
def c_support_code_apply(self, node, nodename):
sio = StringIO.StringIO()
nd_in = len(self.reduce_mask)
if all(i==1 for i in self.reduce_mask):
if all(i == 1 for i in self.reduce_mask):
#this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor
reducebuf = self._k_reduce_buf('Z[0]')
......@@ -1411,7 +1430,7 @@ class GpuSum(GpuOp):
%(reducebuf)s
}
""" % locals()
if self.reduce_mask == (1,1):
if self.reduce_mask == (1, 1):
#this kernel is ok for up to a few thousand elements, but
# it only runs on ONE multiprocessor
reducebuf = self._k_reduce_buf('Z[0]')
......@@ -1444,29 +1463,33 @@ class GpuSum(GpuOp):
}
""" % locals()
#01, 011, 0111
if 0 == self.reduce_mask[0] and all(self.reduce_mask[1:]) and nd_in in[2,3,4]:
if (0 == self.reduce_mask[0] and
all(self.reduce_mask[1:]) and
nd_in in[2, 3, 4]):
# this kernel uses one block for each row.
# threads per block for each element per row.
N_pattern = ''.join(['1']*(nd_in-1))
if nd_in==2:
N_pattern = ''.join(['1'] * (nd_in - 1))
if nd_in == 2:
for_i1 = "for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)"
for_i2="int i2=0, sA2=0;"
for_i3="int i3=0, sA3=0;"
if nd_in==3:
for_i2 = "int i2=0, sA2=0;"
for_i3 = "int i3=0, sA3=0;"
if nd_in == 3:
for_i1 = "for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)"
for_i2 = "for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)"
for_i3="int i3=0, sA3=0;"
if nd_in==4:
for_i3 = "int i3=0, sA3=0;"
if nd_in == 4:
for_i1 = "for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)"
for_i2 = "for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)"
for_i3 = "for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)"
reducebuf = self._k_reduce_buf('Z[i0 * sZ0]')
param_dim = ",".join(["const int d%(i)s" % locals() for i in xrange(nd_in)])
param_strides = ",".join(["const int sA%(i)s" % locals() for i in xrange(nd_in)])
decl = self._k_decl(node,nodename)
init = self._k_init(node,nodename)
param_dim = ",".join(["const int d%(i)s" % locals()
for i in xrange(nd_in)])
param_strides = ",".join(["const int sA%(i)s" % locals()
for i in xrange(nd_in)])
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
print >> sio, """
%(decl)s{
%(init)s
......@@ -1484,7 +1507,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (0,1,0) or self.reduce_mask == (1,0):
if self.reduce_mask == (0, 1, 0) or self.reduce_mask == (1, 0):
# this kernel uses one block for each column,
# threads per block for each element per column.
......@@ -1497,7 +1520,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
......@@ -1525,7 +1549,7 @@ class GpuSum(GpuOp):
}
""" % locals()
if self.reduce_mask == (0,1,0):
if self.reduce_mask == (0, 1, 0):
print >> sio, """
static __global__ void kernel_reduce_sum_010_AD_%(nodename)s(
const int A,
......@@ -1533,7 +1557,8 @@ class GpuSum(GpuOp):
const int C,
const int D,
//const int E, // THIS is 32
const float *X, const int sX0, const int sX1, const int sX2,
const float *X, const int sX0,
const int sX1, const int sX2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
......@@ -1564,9 +1589,10 @@ class GpuSum(GpuOp):
}
""" % locals()
if self.reduce_mask == (0,1,0):
if self.reduce_mask == (0, 1, 0):
#
# This kernel is optimized when the inner most dimensions have the smallest stride.
# This kernel is optimized when the inner most dimensions
# have the smallest stride.
# this kernel uses one block for multiple column(up to 32TODO),
# threads per block for each element per column.
......@@ -1575,10 +1601,12 @@ class GpuSum(GpuOp):
#thread.y = dim 1
#block.x = dim 0
#block.y = dim 1 rest
init = self._k_init(node,nodename)
init = self._k_init(node, nodename)
decl = self._k_decl(node, nodename, pattern="010_inner")
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
'blockDim.x')
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
'blockDim.x')
print >> sio, """
%(decl)s
{
......@@ -1602,7 +1630,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (1,1,0):
if self.reduce_mask == (1, 1, 0):
# this kernel uses one block for each column,
# threads per block for each element per column.
......@@ -1615,7 +1643,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0)
{
const int threadCount = blockDim.x * blockDim.y;
......@@ -1642,7 +1671,7 @@ class GpuSum(GpuOp):
%(reducebuf)s
}
""" % locals()
if self.reduce_mask == (1,0,0):
if self.reduce_mask == (1, 0, 0):
reducebuf = self._k_reduce_buf('Z[i1 * sZ0 + i2 * sZ1]')
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
......@@ -1664,7 +1693,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (1,1,1):
if self.reduce_mask == (1, 1, 1):
reducebuf = self._k_reduce_buf('Z[0]')
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
......@@ -1686,7 +1715,7 @@ class GpuSum(GpuOp):
%(reducebuf)s
}
""" % locals()
if self.reduce_mask == (0,0,1):
if self.reduce_mask == (0, 0, 1):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
......@@ -1695,7 +1724,8 @@ class GpuSum(GpuOp):
const int d0,
const int d1,
const int d2,
const float *A, const int sA0, const int sA1, const int sA2,
const float *A, const int sA0,
const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
......@@ -1721,7 +1751,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (0,0,1,1):
if self.reduce_mask == (0, 0, 1, 1):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
......@@ -1749,7 +1779,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (0,1,0,1):
if self.reduce_mask == (0, 1, 0, 1):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]')
......@@ -1777,7 +1807,7 @@ class GpuSum(GpuOp):
}
}
""" % locals()
if self.reduce_mask == (1,1,1,1):
if self.reduce_mask == (1, 1, 1, 1):
reducebuf = self._k_reduce_buf('Z[0]')
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
......@@ -1800,7 +1830,7 @@ class GpuSum(GpuOp):
%(reducebuf)s
}
""" % locals()
if self.reduce_mask == (1,0,1,1):
if self.reduce_mask == (1, 0, 1, 1):
reducebuf = self._k_reduce_buf('Z[blockIdx.x*sZ0]')
print >> sio, """
static __global__ void kernel_reduce_sum_1011_%(nodename)s(
......@@ -1808,7 +1838,8 @@ class GpuSum(GpuOp):
const unsigned int d1,
const unsigned int d2,
const unsigned int d3,
const float *A, const int sA0, const int sA1, const int sA2, const int sA3,
const float *A, const int sA0, const int sA1,
const int sA2, const int sA3,
float * Z, const int sZ0)
{
const int threadCount = blockDim.x * blockDim.y * blockDim.z;
......@@ -1867,7 +1898,7 @@ class GpuSubtensor(tensor.Subtensor, GpuOp):
assert isinstance(x.type, CudaNdarrayType)
rval = tensor.Subtensor.make_node(self, x, *inputs)
otype = CudaNdarrayType(rval.outputs[0].type.broadcastable)
return Apply(self, [x]+rval.inputs[1:], [otype()])
return Apply(self, [x] + rval.inputs[1:], [otype()])
def perform(self, node, inputs, out_):
out, = out_
......@@ -1907,6 +1938,7 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
#If True or False, we assert that we use the take version or not
#If None, we choose the best one applicable
perform_using_take = None
max_threads = 0
def make_node(self, x, ilist):
x_ = as_cuda_ndarray_variable(x)
......@@ -1946,9 +1978,18 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
idx = idx.view("float32")
idx = cuda_ndarray.cuda_ndarray.CudaNdarray(idx)
if self.max_threads == 0:
num = theano.sandbox.cuda.use.device_number
if device_properties(num)['regsPerBlock'] < (8192 * 2):
self.max_threads = 256
else:
self.max_threads = 512
o = x.take(idx,
0, # axis
out_[0][0]) # return
out_[0][0], # return
"raise",
self.max_threads)
if x is not x_orig:
o = o.reshape(out_shape)
out[0] = o
......@@ -2033,14 +2074,14 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):
assert isinstance(x.type, CudaNdarrayType)
assert isinstance(y.type, CudaNdarrayType)
rval = tensor.IncSubtensor.make_node(self, x, y, *inputs)
return Apply(self, [x,y]+rval.inputs[2:], [x.type()])
return Apply(self, [x, y] + rval.inputs[2:], [x.type()])
class GpuFlatten(tensor.Flatten, GpuOp):
"""
Implement Flatten on the gpu.
"""
def make_node(self, x ):
def make_node(self, x):
assert isinstance(x.type, CudaNdarrayType)
rval = tensor.Flatten.make_node(self, x)
host_out_broadcastable = rval.outputs[0].type.broadcastable
......@@ -2096,10 +2137,12 @@ class GpuJoin(tensor.Join, GpuOp):
# dimension in "axis" can be different, so make equal for ==
tmp_shape[axis] = template_shape[axis]
if tuple(tmp_shape) != template_shape:
raise ValueError, "Shape of input CudaNdarrays must agree except for the 'axis' dimension"
raise ValueError("Shape of input CudaNdarrays must"
" agree except for the 'axis' dimension")
if len(template_shape) != node.outputs[0].type.ndim:
raise ValueError, "Number of dimension of input tensors disagree with dimensions passed at graph creation time."
raise ValueError("Number of dimension of input tensors disagree"
" with dimensions passed at graph creation time.")
# final shape must be the same as all input tensors
# except for the "axis" dimension, so we can simply
......@@ -2110,7 +2153,8 @@ class GpuJoin(tensor.Join, GpuOp):
# just to be explicit, check that dim=1 for broadcastable
# dimensions
for i, bcastable in enumerate(node.outputs[0].type.broadcastable):
assert not bcastable or final_shape[i] == 1, "Broadcastable dimension but dim != 1, this is invalid"
assert not bcastable or final_shape[i] == 1, (
"Broadcastable dimension but dim != 1, this is invalid")
rval = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(final_shape)
......@@ -2120,9 +2164,9 @@ class GpuJoin(tensor.Join, GpuOp):
# except for 'axis'
def construct_slices(curlen):
slices = [slice(None,None,None) for i in \
slices = [slice(None, None, None) for i in \
range(len(template_shape))]
slices[axis] = slice(curpos,curpos+curlen,None)
slices[axis] = slice(curpos, curpos + curlen, None)
return tuple(slices)
for i, cnda in enumerate(cndas):
......@@ -2157,7 +2201,9 @@ class GpuAlloc(GpuOp):
v = as_cuda_ndarray_variable(value)
sh = [tensor.as_tensor_variable(s) for s in shape]
if v.ndim != len(shape):
raise TypeError('GpuAlloc requires value of same dimensions as shape', value, len(shape))
raise TypeError(
'GpuAlloc requires value of same dimensions as shape',
value, len(shape))
bcast = []
for s in sh:
......@@ -2170,7 +2216,7 @@ class GpuAlloc(GpuOp):
const_shp = None
bcast.append(numpy.all(1 == const_shp))
otype = CudaNdarrayType(dtype='float32', broadcastable=bcast)
return Apply(self, [v]+sh, [otype()])
return Apply(self, [v] + sh, [otype()])
def perform(self, node, inputs, out_):
out, = out_
......@@ -2178,7 +2224,7 @@ class GpuAlloc(GpuOp):
sh = tuple([int(i) for i in inputs[1:]])
if out[0] is None or out[0].shape != sh:
out[0] = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(sh)
out[0][...] = v # broadcast v to fill us up
out[0][...] = v # broadcast v to fill us up
def c_code(self, node, name, inputs, out_, sub):
out, = out_
......@@ -2186,12 +2232,12 @@ class GpuAlloc(GpuOp):
value = inputs[0]
shps = inputs[1:]
nd = len(shps)
str = "int dims[%(nd)s];\n" % locals()
for idx,sh in enumerate(shps):
str = "int dims[%(nd)s];\n" % locals()
for idx, sh in enumerate(shps):
str += "dims[%(idx)s] = PyInt_AsLong((PyObject*)%(sh)s);\n" % locals()
str += "if(%(out)s==NULL\n" % locals()
for idx,sh in enumerate(shps):
for idx, sh in enumerate(shps):
str += "||CudaNdarray_HOST_DIMS(%(out)s)[%(idx)s]!=dims[%(idx)s]" % locals()
str += """){
Py_XDECREF(%(out)s);
......@@ -2350,10 +2396,9 @@ def tensordot(a, b, axes=2):
"Axes should be scalar valued or a list/tuple of len 2.",
axes)
# Those are predifined CudaNdarrayType as done in tensor.basic
# Useful mostly for test as the gpu op are inserted automatically...
fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
def scalar(name=None, dtype=None):
"""Return a symbolic scalar variable.
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2363,8 +2408,9 @@ def scalar(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=())
return type(name)
fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
def vector(name=None, dtype=None):
"""Return a symbolic vector variable.
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2374,8 +2420,9 @@ def vector(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, ))
return type(name)
fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
def matrix(name=None, dtype=None):
"""Return a symbolic matrix variable.
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2385,8 +2432,9 @@ def matrix(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False))
return type(name)
fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
def row(name=None, dtype=None):
"""Return a symbolic row variable (ndim=2, broadcastable=[True,False]).
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2396,8 +2444,9 @@ def row(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(True, False))
return type(name)
frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
def col(name=None, dtype=None):
"""Return a symbolic column variable (ndim=2, broadcastable=[False,True]).
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2407,8 +2456,9 @@ def col(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, True))
return type(name)
fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,)*3)
def tensor3(name=None, dtype=None):
"""Return a symbolic 3-D variable.
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2418,8 +2468,9 @@ def tensor3(name=None, dtype=None):
dtype = config.floatX
type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False, False))
return type(name)
ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 3)
ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)
def tensor4(name=None, dtype=None):
"""Return a symbolic 4-D variable.
:param dtype: numeric type (None means to use theano.config.floatX)
......@@ -2430,6 +2481,7 @@ def tensor4(name=None, dtype=None):
type = CudaNdarrayType(dtype=dtype,
broadcastable=(False, False, False, False))
return type(name)
ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)
@theano.compile.profilemode.register_profiler_printer
......@@ -2446,22 +2498,24 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
gpu = 0
trans = 0
for (_, node), t in apply_time.items():
if isinstance(node.op.__class__.__name__, (HostFromGpu, GpuFromHost)):
if isinstance(node.op.__class__.__name__,
(HostFromGpu, GpuFromHost)):
trans += t
elif node.op.__class__.__name__.lower().startswith("gpu"):
gpu += t
else:
cpu += t
print
print " Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op"%(
cpu, cpu/local_time*100, gpu, gpu/local_time*100, trans, trans/local_time*100)
print " Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op" % (
cpu, cpu / local_time * 100, gpu, gpu / local_time * 100,
trans, trans / local_time * 100)
print
print " Theano function input that are float64"
print " <fct name> <input name> <input type> <str input>"
for fct in fct_call.keys():
for i in fct.input_storage:
if hasattr(i.type, 'dtype') and i.type.dtype=='float64':
if hasattr(i.type, 'dtype') and i.type.dtype == 'float64':
print ' ', fct.name, i.name, i.type, i
print
......@@ -2470,5 +2524,13 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
print ' <Apply> <Apply position> <fct name> <inputs type> <outputs type>'
for fct in fct_call.keys():
for idx, node in enumerate(fct.maker.fgraph.toposort()):
if any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.outputs) and not any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.inputs):
print ' ', str(node), idx, fct.name, str([getattr(i,'dtype',None) for i in node.inputs]),str([getattr(i,'dtype',None) for i in node.outputs])
if (any(hasattr(i, 'dtype') and i.dtype == 'float64'
for i in node.outputs) and
not any(hasattr(i, 'dtype') and i.dtype == 'float64'
for i in node.inputs)):
print ' ', str(node), idx, fct.name,
print str([getattr(i, 'dtype', None)
for i in node.inputs]),
print str([getattr(i, 'dtype', None)
for i in node.outputs])
......@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject * axis_obj = Py_None;
PyObject * out_obj = Py_None;
PyObject * clipmode_obj = NULL;
if (! PyArg_ParseTuple(args, "O|OOO", &indices_obj, &axis_obj,
&out_obj, &clipmode_obj))
int max_threads = 1; // max threads per blocks
if (! PyArg_ParseTuple(args, "O|OOOi", &indices_obj, &axis_obj,
&out_obj, &clipmode_obj, &max_threads))
return NULL;
//Check argument indices
......@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyObject * axis_iobj = PyNumber_Long(axis_obj);
if (!axis_iobj) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: axis must be convertable to a long");
Py_DECREF(indices_obj);
Py_DECREF(indices);
return NULL;
}
long axis = PyInt_AsLong(axis_iobj);
Py_DECREF(axis_iobj); axis_iobj=NULL;
if (axis != 0) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only axis=0 is currently supported");
Py_DECREF(indices_obj);
Py_DECREF(indices);
return NULL;
}
......@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (!out) {
out = (CudaNdarray*)CudaNdarray_New();
if (!out){
Py_DECREF(indices_obj);
Py_DECREF(indices);
free(dims);
return NULL;
}
if (CudaNdarray_alloc_contiguous(out, self->nd, dims)) {
Py_DECREF(out);
Py_DECREF(indices_obj);
Py_DECREF(indices);
free(dims);
return NULL;
}
......@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (clipmode_obj) {
char * clipmode = PyString_AsString(clipmode_obj);
if (! clipmode){
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
free(dims);
return NULL;
}
if (strcmp(clipmode, "raise") != 0) {
PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only the raise mode is currently supported");
Py_DECREF(indices_obj);
PyErr_Format(PyExc_NotImplementedError,
"CudaNdarray_TakeFrom: only the raise mode is currently supported. Got '%s'",
clipmode);
Py_DECREF(indices);
Py_DECREF(out);
free(dims);
return NULL;
}
Py_DECREF(clipmode_obj);
}
void (*k3)(const int, const int, const int,
const npy_int64*,
......@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
if (err_var == NULL) {
err_var = (int*)device_malloc(sizeof(int));
if (!err_var) { // PyErr set by device_malloc
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
free(dims);
return NULL;
......@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
PyErr_Format(PyExc_RuntimeError,
"Error setting device error code to 0. %s",
cudaGetErrorString(err));
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
free(dims);
return NULL;
......@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
}
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(out)[0],65535),1,1);
switch (self->nd) {
case 1:
{
dim3 n_threads(1, 1, 1);
if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n",
self->nd, cudaGetLastError(),
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>(
dims[0],
......@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break;
case 2:
{
dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], 512), 1, 1);
dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads), 1, 1);
if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n",
cudaGetLastError(), self->nd,
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>(
dims[0], //dimensions
dims[1],
......@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
break;
case 3:
{
int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], 512);
int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], 512 / ty);
int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], max_threads);
int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads / ty);
dim3 n_threads(tx, ty, 1);
if (verbose)
printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
printf("cudaGetLastError=%d, nd=%d"
" kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
" n_threads.x=%i, n_threads.y=%i)\n",
self->nd, cudaGetLastError(),
n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
k3<<<n_blocks, n_threads>>>(
dims[0], //dimensions
......@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s.\n",
"CudaNdarray_TakeFrom",
cudaGetErrorString(err));
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
return NULL;
}
......@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
"Cuda error: %s: %s when trying to get the error value.\n",
"CudaNdarray_TakeFrom",
cudaGetErrorString(err));
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
return NULL;
}
......@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
err = cudaMemset((void*)err_var, 0, sizeof(int));
if (cudaSuccess != err) {
PyErr_Format(PyExc_MemoryError, "Error setting device error code to 0 after having an index error. %s", cudaGetErrorString(err));
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
return NULL;
}
Py_DECREF(indices_obj);
Py_DECREF(indices);
Py_DECREF(out);
return NULL;
}
Py_DECREF(indices_obj);
Py_DECREF(indices);
if (verbose) printf("TAKE SUCCEDED\n");
return (PyObject *)out;
......
......@@ -7,6 +7,7 @@ import subprocess
import sys
import warnings
import theano
from theano.gof.cc import hash_from_file
from theano.gof.cmodule import (std_libs, std_lib_dirs,
std_include_dirs, dlimport,
......@@ -119,6 +120,16 @@ class NVCC_compiler(object):
cuda_ndarray_cuh_hash = hash_from_file(
os.path.join(os.path.split(__file__)[0], 'cuda_ndarray.cuh'))
flags.append('-DCUDA_NDARRAY_CUH=' + cuda_ndarray_cuh_hash)
# We compile cuda_ndarray.cu during import.
# We should not add device properties at that time.
# As the device is not selected yet!
# TODO: compile cuda_ndarray when we bind to a GPU?
import theano.sandbox.cuda
if hasattr(theano.sandbox, 'cuda'):
n = theano.sandbox.cuda.use.device_number
p = theano.sandbox.cuda.device_properties(n)
flags.append('-arch=sm_' + str(p['major']) + str(p['minor']))
return flags
@staticmethod
......@@ -217,7 +228,9 @@ class NVCC_compiler(object):
# '--gpu-code=compute_13',
#nvcc argument
preargs1 = [pa for pa in preargs
if pa.startswith('-O') or pa.startswith('--maxrregcount=')]
if pa.startswith('-O') or
pa.startswith('--maxrregcount=') or
pa.startswith('-arch=')]
preargs2 = [pa for pa in preargs
if pa not in preargs1] # other arguments
......@@ -337,6 +350,7 @@ class NVCC_compiler(object):
pass
print >> sys.stderr, l
print nvcc_stdout
print cmd
raise Exception('nvcc return status', p.returncode,
'for cmd', ' '.join(cmd))
elif config.cmodule.compilation_warning and nvcc_stdout:
......
......@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase):
for step in xrange(1, 4):
v_out[step] = v_u[step] * W_in + v_out[step - 1] * W
theano_values = f2(v_u, v_x0, W_in, W)
assert numpy.allclose(theano_values, v_out)
assert numpy.allclose(theano_values, v_out), (theano_values, v_out,
theano_values - v_out)
# TO DEL
topo = f2.maker.fgraph.toposort()
......@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase):
v_y[i] = numpy.dot(v_x[i - 1], vWout)
(theano_x, theano_y) = f4(v_u1, v_u2, v_x0, v_y0, vW_in1)
assert numpy.allclose(theano_x, v_x)
assert numpy.allclose(theano_y, v_y)
assert numpy.allclose(theano_x, v_x), (theano_x, v_x, theano_x - v_x)
assert numpy.allclose(theano_y, v_y), (theano_y, v_y, theano_y - v_y)
def test_multiple_outs_taps(self):
l = 5
......@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase):
ny1[4] = (ny1[3] + ny1[1]) * numpy.dot(ny0[3], vWout)
ny2[4] = numpy.dot(v_u1[4], vW_in1)
def test_using_taps_sequence(self):
# this test refers to a bug reported by Nicolas
# Boulanger-Lewandowski June 6th
x = theano.tensor.dvector()
y, updates = theano.scan(lambda x: [x],
sequences=dict(input=x, taps=[-1]),
outputs_info = [None])
outputs_info=[None])
inp = numpy.arange(5).astype('float64')
rval = theano.function([x], y, updates=updates)(inp)
assert numpy.all(rval == inp[:-1])
......@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase):
# equivalent is done
(theano_x0, theano_x1) = f9(vu0, vu1, vu2, vx0, vx1)
# assert that theano does what it should
assert numpy.allclose(theano_x0, numpy_x0)
assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1, theano_x1 - numpy_x1)
assert numpy.allclose(theano_x0, numpy_x0), (theano_x0, numpy_x0,
theano_x0 - numpy_x0)
assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1,
theano_x1 - numpy_x1)
# assert that it was done in place
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
......@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase):
vx1 = asarrayX(rng.uniform())
x0 = theano.shared(vx0)
x1 = theano.shared(vx1)
outputs, updates = theano.scan(lambda x,y: (x + asarrayX(1),
y + asarrayX(1)),
outputs, updates = theano.scan(lambda x, y: (x + asarrayX(1),
y + asarrayX(1)),
[],
[x0,x1],
n_steps = 3)
[x0, x1],
n_steps=3)
x0 = asarrayX(numpy.zeros((3,)))
x0[0] = vx0
x0 = theano.tensor.constant(x0)
......@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase):
v_eW = numpy.array(rng.uniform(size=(5, 5)) - .5, dtype=floatX)
v_eh0 = numpy.array(rng.uniform(size=(5,)) - .5, dtype=floatX)
def rnn_fn(_u, _y, _W):
srng = theano.tensor.shared_randomstreams.RandomStreams(seed)
......
......@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \
jacobian, hessian
from theano.tensor.sort import sort
from extra_ops import (DiffOp, bincount, squeeze,
repeat, bartlett, fill_diagonal)
......@@ -3,8 +3,8 @@ import numpy
import theano
import basic
from theano import gof, tensor, scalar
from theano.sandbox.linalg.ops import diag
from theano import gof, scalar
import basic as tensor
class DiffOp(theano.Op):
......@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op):
raise NotImplementedError('%s: gradient is currently implemented'
' for matrices only' % self.__class__.__name__)
wr_a = fill_diagonal(grad, 0) # valid for any number of dimensions
wr_val = diag(grad).sum() # diag is only valid for matrices
# diag is only valid for matrices
import theano.sandbox.linalg
wr_val = theano.sandbox.linalg.ops.diag(grad).sum()
return [wr_a, wr_val]
fill_diagonal_ = FillDiagonal()
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论