提交 f70d4149 authored 作者: James Bergstra's avatar James Bergstra

merge

...@@ -150,4 +150,11 @@ Config Attributes ...@@ -150,4 +150,11 @@ Config Attributes
A directory with bin/, lib/, include/ folders containing cuda utilities. A directory with bin/, lib/, include/ folders containing cuda utilities.
.. attribute:: gcc.cxxflags
Default: ""
Extra parameters to pass to gcc when compiling. Extra include paths,
library paths, configuration options, etc.
==================
Advanced Indexing
==================
Continue the Advanced Indexing project that is on either github or bitbucket.
=============================================================================
Proposal for New Linking Strategy supporting Lazy Evaluation: Op.make_thunk
=============================================================================
.. note::
Proposal made June 2010.
Motivation
===========
Conditional evaluation is useful to describe many optimization algorithms where
the update expressions depend on internal state.
True conditional evaluation requires lazy graph evaluation.
Without lazy graph evaluation, the runtime of a graph can be exponential in the
number of conditionals instead of linear. No one waits an exponential amount of
time, so instead people work around this problem in various other ways, but it
would be better if theano had an 'if-then-else' expression (call it cond).
A lazily-evaluted 'cond' requires a linker to use a different method for
interacting with Ops. Neither the current perform() nor c_code() approaches
support lazy evaluation.
Why do perform (and c_code) not handle lazy evaluation?
The syntax of the current perform() could be extended to be compatible with lazy
evaluation. For example, the linker could set all inputs to None, and use the
return value from perform() to see which inputs are required. But all the Ops
that currently implement a perform() function would be broken because their
perform implementations do not ask for inputs before using them. I don't see a
way around this. The same restriction applies to c_code.
The way around this is to introduce a new interface for the linker to talk to
Ops. I propose that we add an Op.make_thunk() that returns an object satisfying
this interface.
At the same time, it appears that as we try to integrate PyCUDA Ops another
problem arises. We would like to use Op.perform() to drive the GPU, but it is
natural to move compilation of the CUDA kernel to a point after make_node() and a
point before perform(). The point where the linker makes an thunk from the Op
seems like a natural choice.
A third motivation for introducing an Op.make_thunk function is to clarify the
relationship between Ops (the classes you implement in Python) and mathematical
operations (the more abstract things in terms of which you think when using
Theano).
I propose that *technically* an Op, when conditioned by particular inputs,
generates *at most one implementation* that defines the behaviour of that Op.
In *intuitive terms*, the abstract mathematical steps that we sometimes talk about regarding Theano
still correspond to Ops -- it's just that these Ops have relatively generic
implementations.
The process of optimization is to specialize those generic implementations
by using information from the rest of the graph.
If we accept that an Op corresponds to at most one implementation,
then it makes sense to ask an Op instance to expose that implementation via a
standard interface (Op.make_thunk).
It does not make sense to pass arguments to Op.make_thunk such as 'py' or "c|py"
to tell the Op which implementation to use. The Op instance represents just one
implementation, and flags such as 'py' or 'c|py' should be passed to the Op's
constructor.
Proposal: Op.make_thunk
==========================
There are two interface items I propose to add. The first is a Thunk object
(which we have never had before), and the second is a new function (make_thunk)
in the PureOp class (a superclass of Op) that will return a Thunk.
..code-block:: python
class Thunk (object):
"""Abstract class / interface
It describes the interface used by a Theano linker to execute the nodes in a
graph. Thunk instances are in correspondance with Apply instances that
remain in the final form of the graph after optimization.
"""
lazy = property(...,
"""True means the thunk may trigger lazy evaluation.
False means the thunk always requires all inputs and computes all
outputs.
Consequently False implies that __call__ always returns None
"""
def __call__(self):
"""Thunk will compute some number (or zero) of outputs and in the case
that it cannot compute all its outputs for lack of inputs, this function
will return a list of input indexes that are required. The linker will
typically compute those required inputs and then call this
__call__ function again.
The thunk is considered to be finished when it returns an empty list or
None.
"""
..code-block:: python
class PureOp(object): # recall:
# Op inherits from PureOp
def make_node(self, *inputs): # leave alone
...
def perform(self, node,
inputs, output_storage): # move to `Op` class
...
def make_thunk(self, node, # new function
input_computed, output_computed,
input_registers, output_registers,
):
"""
:type node: Apply instance
:param node: previous rval from make_node(self, *inputs)
:type input_computed: list of len-1 lists, with values in (0,1).
:param input_computed: at runtime, input_computed[i][0]==1 implies
that the i'th input has been computed and stored at
input_registers[i][0], and is available for use.
Otherwise the content of input_registers[i][0] is undefined.
:type output_computed: list of len-1 lists, with values in (0,1).
:param output_computed: at runtime, output_computed[i][0]==1 implies
that the i'th output has already been computed and stored at
output_registers[i][0].
Otherwise, output_registers[i][0] will contain either None, or
a value that was previously computed by this thunk.
:type input_registers: list of len-1 lists
:type output_registers: list of len-1 lists
:param input_registers: the i'th input can be read from
input_registers[i][0] when input_computed[i][0] == 1.
:param output_registers: the i'th output must be stored to
output_registers[i][0], at which point the thunk must set output_computed[i][0] == 1.
:returns: a Thunk (subclass) instance
"""
The Thunk class can have subclasses that use Op.perform and Op.c_code as we use
them now. The interface of Thunk is backward-compatible with the thunks built
by the CLinker and PerformLinker. If a graph contains zero Thunks with
lazy==True, then the current Linkers will continue to work.
The new Thunk interface will support a new LazyLinker that can run programs for
which some thunks have lazy==True.
The Thunk class can have subclasses that are implemented in C, which might help
performance.
========================
DP Instruction Selection
========================
Read Ch 9 of Modern Compiler Implementation about instruction selection.
We should probably be doing graph optimization totally differently:
Optimizations *only add* new ways of implementing something, they do not replace
the old way. Every graph node (apply) as a cost, and Dynamic Programming (DP)
is used to select the minimum cost graph.
The advantage of this approach is that optimizations do not have to run in such
a careful order, and graph selection would be much faster.
Think about how aliasing and destructive operations (the destroy-handler) would
fit in this approach.
=====================
Intermediate Language
=====================
It would be nice to be able to use Theano from other languages.
This requires two things: a way to communicate the expression to the theano
compiler, and a way to pass data to and from the compiled function.
One way to do this would be define a textual representation of theano graphs.
A Scheme-like language seems appropriate. Perhaps just scheme would be
appropriate.
How to pass shared variables?
=================
MongoDB DLL Cache
=================
In network environments (like at DIRO on NFS3), a distributed DB like mongo or couch is faster and more
robust to concurrency than the $HOME/.theano. Also, a single cache could be
shared by multiple users. This would result in less compilation time, for
everyone, and less stale-cache problems.
...@@ -4,6 +4,7 @@ Automatic updates ...@@ -4,6 +4,7 @@ Automatic updates
.. note: .. note:
Proposed 2010 01 13 Proposed 2010 01 13
Done 2010 04 ??
The Module version of RandomStreams could arrange for the automatic update of The Module version of RandomStreams could arrange for the automatic update of
......
=======
OpenCL
=======
Migrate the GPU code-generators to the PyCUDA style, and eventually to OpenCL.
This means mainly to use a different kind of code-generation strategy. The
kernel itself is compiled, but the calling code remains in python or cython. We
would no longer generate entire C files this way, and no longer use the CLinker
for GPU code.
Proactive Merging
=================
Merge is done now as an optimization.
But if Merging was done at graph construction time, things like #476 would work.
Additionally, memo-izing at graph construction time would make it possible to
define recursive formula with recursive python functions (e.g. Fibonacci).
Currently the merge optimization would make the Fibonacci series linear, but the
size of the program used to express the program would be exponential.
========================
Add tensor attributes
========================
Size, shape, psd, symmetric, triangular, contiguous.
Add these attributes to the TensorType with the option always that they be
'unknown'.
Add attributes that are useful for optimizations, or useful for code generation.
...@@ -104,6 +104,47 @@ class BadCLinkerOutput(DebugModeError): ...@@ -104,6 +104,47 @@ class BadCLinkerOutput(DebugModeError):
print >> sio, " val_py :", self.val_py print >> sio, " val_py :", self.val_py
print >> sio, " val_c :", self.val_c print >> sio, " val_c :", self.val_c
print >> sio, " op :", self.offending_op() print >> sio, " op :", self.offending_op()
try:
ssio = StringIO()
print >> ssio, " PyValue shape, dtype, strides, min, max:",
print >> ssio, self.val_py.shape,
print >> ssio, self.val_py.dtype,
print >> ssio, self.val_py.strides
print >> ssio, self.val_py.min(),
print >> ssio, self.val_py.max(),
# only if all succeeds to we add anything to sio
print >> sio, ssio.getvalue()
except:
pass
try:
ssio = StringIO()
print >> ssio, " CValue shape, dtype, strides, min, max:",
print >> ssio, self.val_c.shape,
print >> ssio, self.val_c.dtype,
print >> ssio, self.val_c.strides,
print >> ssio, self.val_c.min(),
print >> ssio, self.val_c.max(),
# only if all succeeds to we add anything to sio
print >> sio, ssio.getvalue()
except:
pass
try:
ov=numpy.asarray(self.val_c)
nv=numpy.asarray(self.val_py)
ssio = StringIO()
print >> ssio, " Max Abs Diff: ", numpy.max(numpy.absolute(nv-ov))
print >> ssio, " Mean Abs Diff: ", numpy.mean(numpy.absolute(nv-ov))
print >> ssio, " Median Abs Diff: ", numpy.median(numpy.absolute(nv-ov))
print >> ssio, " Std Abs Diff: ", numpy.std(numpy.absolute(nv-ov))
reldiff = numpy.absolute(nv-ov) / (numpy.absolute(nv)+numpy.absolute(ov))
print >> ssio, " Max Rel Diff: ", numpy.max(reldiff)
print >> ssio, " Mean Rel Diff: ", numpy.mean(reldiff)
print >> ssio, " Median Rel Diff: ", numpy.median(reldiff)
print >> ssio, " Std Rel Diff: ", numpy.std(reldiff)
# only if all succeeds to we add anything to sio
print >> sio, ssio.getvalue()
except:
pass
return sio.getvalue() return sio.getvalue()
class BadOptimization(DebugModeError): class BadOptimization(DebugModeError):
......
...@@ -858,6 +858,7 @@ class CLinker(link.Linker): ...@@ -858,6 +858,7 @@ class CLinker(link.Linker):
if config.gcc.cxxflags: if config.gcc.cxxflags:
sig.append(config.gcc.cxxflags) sig.append(config.gcc.cxxflags)
error_on_play = [False]
def in_sig(i, topological_pos, i_idx): def in_sig(i, topological_pos, i_idx):
# assert that every input to every node is one of' # assert that every input to every node is one of'
# - an env input # - an env input
...@@ -868,7 +869,14 @@ class CLinker(link.Linker): ...@@ -868,7 +869,14 @@ class CLinker(link.Linker):
# yield a 'position' that reflects its role in code_gen() # yield a 'position' that reflects its role in code_gen()
if isinstance(i, graph.Constant): #orphans if isinstance(i, graph.Constant): #orphans
if id(i) not in constant_ids: if id(i) not in constant_ids:
constant_ids[id(i)] = (i.signature(), topological_pos, i_idx) isig = (i.signature(), topological_pos, i_idx)
try:
hash(isig)
except: #generic constants don't have a hashable signature
error_on_play[0] = True
return None
constant_ids[id(i)] = isig
else:
isig = constant_ids[id(i)] isig = constant_ids[id(i)]
#print 'SIGNATURE', i.signature() #print 'SIGNATURE', i.signature()
#return i.signature() #return i.signature()
...@@ -903,6 +911,12 @@ class CLinker(link.Linker): ...@@ -903,6 +911,12 @@ class CLinker(link.Linker):
for ipos,i in enumerate(node.inputs)), for ipos,i in enumerate(node.inputs)),
tuple(o in no_recycling for o in node.outputs))) tuple(o in no_recycling for o in node.outputs)))
if error_on_play[0]:
# if one of the signatures is not hashable
# then bypass the cache mechanism and
# compile fresh every time
return None
op_pos[node] = node_pos op_pos[node] = node_pos
env_computed_set.update(node.outputs) env_computed_set.update(node.outputs)
......
...@@ -242,6 +242,7 @@ def refresh_lock(lock_file): ...@@ -242,6 +242,7 @@ def refresh_lock(lock_file):
lock_write = open(lock_file, 'w') lock_write = open(lock_file, 'w')
lock_write.write(unique_id + '\n') lock_write.write(unique_id + '\n')
lock_write.close() lock_write.close()
print lock_file
return unique_id return unique_id
class Unlocker(object): class Unlocker(object):
......
...@@ -7,6 +7,7 @@ if sys.version_info[:2] >= (2,5): ...@@ -7,6 +7,7 @@ if sys.version_info[:2] >= (2,5):
import toolbox import toolbox
import graph import graph
from theano.gof import deque
from env import InconsistencyError from env import InconsistencyError
...@@ -44,6 +45,92 @@ class DestroyHandler(object): ...@@ -44,6 +45,92 @@ class DestroyHandler(object):
return self.map[env].orderings(env) return self.map[env].orderings(env)
def _dfs_toposort(i, r_out, orderings):
"""
i - list of inputs
o - list of outputs
orderings - dict of additions to the normal inputs and outputs
Returns nothing. Raises exception for graph with cycles
"""
#this is hard-coded reimplementation of functions from graph.py
# reason: go faster, prepare for port to C.
assert isinstance(r_out, (tuple, list, deque))
# TODO: For more speed - use a defaultdict for the orderings
iset = set(i)
if 0:
def expand(obj):
rval = []
if obj not in iset:
if isinstance(obj, graph.Variable):
if obj.owner:
rval = [obj.owner]
if isinstance(obj, graph.Apply):
rval = list(obj.inputs)
rval.extend(orderings.get(obj, []))
else:
assert not orderings.get(obj, [])
return rval
expand_cache = {}
# reachable, clients = stack_search( deque(r_out), deps, 'dfs', True)
start=deque(r_out)
rval_set = set()
rval_set.add(id(None))
rval_list = list()
expand_inv = {}
sources = deque()
while start:
l = start.pop()# this makes the search dfs
if id(l) not in rval_set:
rval_list.append(l)
rval_set.add(id(l))
if l in iset:
assert not orderings.get(l, [])
expand_l = []
else:
try:
if l.owner:
expand_l = [l.owner]
else:
expand_l = []
except AttributeError:
expand_l = list(l.inputs)
expand_l.extend(orderings.get(l, []))
if expand_l:
for r in expand_l:
expand_inv.setdefault(r, []).append(l)
start.extend(expand_l)
else:
sources.append(l)
expand_cache[l] = expand_l
assert len(rval_list) == len(rval_set)-1
rset = set()
rlist = []
while sources:
node = sources.popleft()
if node not in rset:
rlist.append(node)
rset.add(node)
for client in expand_inv.get(node, []):
expand_cache[client] = [a for a in expand_cache[client] if a is not node]
if not expand_cache[client]:
sources.append(client)
if len(rlist) != len(rval_list):
raise ValueError('graph contains cycles')
#return [o for o in rlist if isinstance(o, graph.Apply)]
def getroot(r, view_i): def getroot(r, view_i):
""" """
For views: Return non-view variable which is ultimatly viewed by r. For views: Return non-view variable which is ultimatly viewed by r.
...@@ -303,7 +390,8 @@ class DestroyHandlerHelper2(toolbox.Bookkeeper): ...@@ -303,7 +390,8 @@ class DestroyHandlerHelper2(toolbox.Bookkeeper):
raise raise
#print 'orderings:', ords #print 'orderings:', ords
try: try:
graph.io_toposort(env.inputs, env.outputs, ords) ### graph.io_toposort(env.inputs, env.outputs, ords)
_dfs_toposort(env.inputs, env.outputs, ords)
except ValueError, e: except ValueError, e:
#print 'not passing.', ords #print 'not passing.', ords
if 'cycles' in str(e): if 'cycles' in str(e):
......
...@@ -9,9 +9,10 @@ compatible with `gof`'s :doc:`graph` routines. ...@@ -9,9 +9,10 @@ compatible with `gof`'s :doc:`graph` routines.
__docformat__ = "restructuredtext en" __docformat__ = "restructuredtext en"
import utils import utils
import traceback
from theano import config from theano import config
class CLinkerObject(object): class CLinkerObject(object):
"""Standard elements of an Op or Type used with the CLinker """Standard elements of an Op or Type used with the CLinker
""" """
...@@ -320,9 +321,7 @@ class PureOp(object): ...@@ -320,9 +321,7 @@ class PureOp(object):
""" """
node = self.make_node(*inputs, **kwargs) node = self.make_node(*inputs, **kwargs)
limit = config.traceback.limit self.add_tag_trace(node)
if limit == -1: limit = None
node.tag.trace = traceback.extract_stack(limit=limit)[:-1]
if self.default_output is not None: if self.default_output is not None:
return node.outputs[self.default_output] return node.outputs[self.default_output]
else: else:
...@@ -331,6 +330,9 @@ class PureOp(object): ...@@ -331,6 +330,9 @@ class PureOp(object):
else: else:
return node.outputs return node.outputs
# Convenience so that subclass implementers don't have to import utils
# just to self.add_tag_trace
add_tag_trace = staticmethod(utils.add_tag_trace)
######################### #########################
# Python implementation # # Python implementation #
......
...@@ -127,6 +127,8 @@ class SeqOptimizer(Optimizer, list): ...@@ -127,6 +127,8 @@ class SeqOptimizer(Optimizer, list):
for optimizer in self: for optimizer in self:
try: try:
optimizer.optimize(env) optimizer.optimize(env)
except AssertionError: # do not catch Assertion failures
raise
except Exception, e: except Exception, e:
if self.failure_callback: if self.failure_callback:
self.failure_callback(e, self, optimizer) self.failure_callback(e, self, optimizer)
......
...@@ -5,8 +5,7 @@ __docformat__ = "restructuredtext en" ...@@ -5,8 +5,7 @@ __docformat__ = "restructuredtext en"
import copy import copy
import utils import utils
from utils import MethodNotDefined, object2 from utils import MethodNotDefined, object2
from graph import Variable import graph
import traceback
from theano import config from theano import config
######## ########
...@@ -202,6 +201,9 @@ class PureType(object): ...@@ -202,6 +201,9 @@ class PureType(object):
""" """
Variable = graph.Variable #the type that will be created by call to make_variable.
Constant = graph.Constant #the type that will be created by call to make_constant
def filter(self, data, strict = False): def filter(self, data, strict = False):
"""Required: Return data or an appropriately wrapped/converted data. """Required: Return data or an appropriately wrapped/converted data.
...@@ -233,8 +235,11 @@ class PureType(object): ...@@ -233,8 +235,11 @@ class PureType(object):
A pretty string for printing and debugging. A pretty string for printing and debugging.
""" """
r = Variable(self, name = name) return self.Variable(self, name = name)
return r
def make_constant(self, value, name=None):
return self.Constant(type=self, data=value, name=name)
def __call__(self, name = None): def __call__(self, name = None):
"""Return a new `Variable` instance of Type `self`. """Return a new `Variable` instance of Type `self`.
...@@ -244,11 +249,7 @@ class PureType(object): ...@@ -244,11 +249,7 @@ class PureType(object):
A pretty string for printing and debugging. A pretty string for printing and debugging.
""" """
r = self.make_variable(name) return utils.add_tag_trace(self.make_variable(name))
limit = config.traceback.limit
if limit == -1: limit = None
r.tag.trace = traceback.extract_stack(limit=limit)[:-1]
return r
def values_eq(self, a, b): def values_eq(self, a, b):
""" """
...@@ -319,9 +320,11 @@ class Type(object2, PureType, CLinkerType): ...@@ -319,9 +320,11 @@ class Type(object2, PureType, CLinkerType):
""" """
## DELETEME ##
class SingletonType(Type): class SingletonType(Type):
"""WRITEME""" """Convenient Base class for a Type subclass with no attributes
It saves having to implement __eq__ and __hash__
"""
__instance = None __instance = None
def __new__(cls): def __new__(cls):
if cls.__instance is None: if cls.__instance is None:
...@@ -378,14 +381,5 @@ class Generic(SingletonType): ...@@ -378,14 +381,5 @@ class Generic(SingletonType):
Py_INCREF(py_%(name)s); Py_INCREF(py_%(name)s);
""" % locals() """ % locals()
generic = Generic() generic = Generic()
## DELETEME ##
class PropertiedType(Type):
"""WRITEME"""
def __eq__(self, other):
return type(self) == type(other) and self.__dict__ == other.__dict__
...@@ -2,7 +2,18 @@ ...@@ -2,7 +2,18 @@
# import op # import op
# import variable # import variable
import re, os from theano import config
import re, os, traceback
def add_tag_trace(thing):
"""Add tag.trace to an node or variable.
The argument is returned after being affected (inplace).
"""
limit = config.traceback.limit
if limit == -1: limit = None
thing.tag.trace = traceback.extract_stack(limit=limit)[:-1]
return thing
def hashgen(): def hashgen():
hashgen.next += 1 hashgen.next += 1
......
...@@ -53,6 +53,10 @@ def debugprint(obj, depth=-1, print_type=False, file=None): ...@@ -53,6 +53,10 @@ def debugprint(obj, depth=-1, print_type=False, file=None):
results_to_print.extend(obj.outputs) results_to_print.extend(obj.outputs)
elif isinstance(obj, Function): elif isinstance(obj, Function):
results_to_print.extend(obj.maker.env.outputs) results_to_print.extend(obj.maker.env.outputs)
elif isinstance(obj, (list, tuple)):
results_to_print.extend(obj)
else:
raise TypeError("debugprint cannot print an object of this type", obj)
for r in results_to_print: for r in results_to_print:
debugmode.debugprint(r, depth=depth, done=done, print_type=print_type, file=_file) debugmode.debugprint(r, depth=depth, done=done, print_type=print_type, file=_file)
if file is _file: if file is _file:
......
...@@ -478,9 +478,16 @@ class GpuSum(Op): ...@@ -478,9 +478,16 @@ class GpuSum(Op):
PyErr_Format(PyExc_RuntimeError, "Failed to allocate output"); PyErr_Format(PyExc_RuntimeError, "Failed to allocate output");
%(fail)s; %(fail)s;
} }
} }
""" %locals() """ %locals()
# \begin bracket the reduction in a check that there is actually work to do
print >> sio, """
if (CudaNdarray_SIZE(%(z)s))
{
""" % locals()
# #
# Now perform the reduction # Now perform the reduction
# #
...@@ -499,9 +506,15 @@ class GpuSum(Op): ...@@ -499,9 +506,15 @@ class GpuSum(Op):
else: else:
getattr(self, 'c_code_reduce_%s'%(''.join(str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail) getattr(self, 'c_code_reduce_%s'%(''.join(str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
# \end bracket the reduction ...
print >> sio, """
}
""" % locals()
return sio.getvalue() return sio.getvalue()
def _makecall(self, node, name, x, z, fail): def _makecall(self, node, name, x, z, fail, pattern=None):
"""Return a string for making a kernel call. """Return a string for making a kernel call.
The return value looks something like: The return value looks something like:
...@@ -527,14 +540,18 @@ class GpuSum(Op): ...@@ -527,14 +540,18 @@ class GpuSum(Op):
} }
""" """
sio = StringIO.StringIO() sio = StringIO.StringIO()
if pattern is None:
pattern = ''.join(str(c) for c in self.reduce_mask) pattern = ''.join(str(c) for c in self.reduce_mask)
ndim = len(pattern) ndim = len(self.reduce_mask)
nd_out = ndim - sum(self.reduce_mask) nd_out = ndim - sum(self.reduce_mask)
print >> sio, """ print >> sio, """
if (verbose) printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n"); if (verbose) printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n");
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z; int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
if (verbose>1) printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, n_blocks.x=%%d, n_blocks.y=%%d n_shared=%%d\\n", if (verbose>1) printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d, nb_block=%%d, n_shared=%%d\\n",
n_threads.x,n_threads.y,n_threads.z,n_blocks.x,n_blocks.y,n_shared); n_threads.x,n_threads.y,n_threads.z,
n_threads.x*n_threads.y*n_threads.z,
n_blocks.x,n_blocks.y,
n_blocks.x*n_blocks.y, n_shared);
kernel_reduce_sum_%(pattern)s_%(name)s<<<n_blocks, n_threads, n_shared>>>( kernel_reduce_sum_%(pattern)s_%(name)s<<<n_blocks, n_threads, n_shared>>>(
""" %locals() """ %locals()
for i in xrange(ndim): for i in xrange(ndim):
...@@ -574,7 +591,7 @@ class GpuSum(Op): ...@@ -574,7 +591,7 @@ class GpuSum(Op):
""" %locals() """ %locals()
return sio.getvalue() return sio.getvalue()
def _k_decl(self, node, nodename): def _k_decl(self, node, nodename, pattern = None, ndim = None, reduce_mask = None):
"""Return a string to declare a kernel function """Return a string to declare a kernel function
.. code-block:: c .. code-block:: c
...@@ -591,27 +608,32 @@ class GpuSum(Op): ...@@ -591,27 +608,32 @@ class GpuSum(Op):
const int sZ0) const int sZ0)
""" %locals() """ %locals()
pattern = ''.join(str(i) for i in self.reduce_mask) if reduce_mask is None:
reduce_mask = self.reduce_mask
if ndim is None:
ndim = len(reduce_mask)
if pattern is None:
pattern = ''.join(str(i) for i in reduce_mask)
sio = StringIO.StringIO() sio = StringIO.StringIO()
print >> sio, """ print >> sio, """
static __global__ void kernel_reduce_sum_%(pattern)s_%(nodename)s( static __global__ void kernel_reduce_sum_%(pattern)s_%(nodename)s(
""" %locals() """ %locals()
for i in xrange(len(self.reduce_mask)): for i in xrange(ndim):
print >> sio, """ print >> sio, """
const int d%(i)s, const int d%(i)s,
""" %locals() """ %locals()
print >> sio, """ print >> sio, """
const float *A, const float *A,
""" %locals() """ %locals()
for i in xrange(len(self.reduce_mask)): for i in xrange(ndim):
print >> sio, """ print >> sio, """
const int sA%(i)s, const int sA%(i)s,
""" %locals() """ %locals()
print >> sio, """ print >> sio, """
float * Z float * Z
""" %locals() """ %locals()
for i in xrange(len(self.reduce_mask) - sum(self.reduce_mask)): for i in xrange(ndim - sum(reduce_mask)):
print >> sio, """ print >> sio, """
, const int sZ%(i)s , const int sZ%(i)s
""" %locals() """ %locals()
...@@ -681,6 +703,25 @@ class GpuSum(Op): ...@@ -681,6 +703,25 @@ class GpuSum(Op):
} }
""" %locals() """ %locals()
#Threads must be organized as: threadNum%nb_reduce correspond to the same sum
#nb_reduce<=warpSize
def _k_reduce_buf_multiple(self, z_pos, nb_reduce):
return """
buf[threadNum] = mysum;
__syncthreads();
// rest of function is handled by one warp
if (threadNum < %(nb_reduce)s)
{
//round up all the partial sums into the first `nb_reduce` elements
for (int i = threadNum + %(nb_reduce)s; i < threadCount; i += %(nb_reduce)s)
{
mysum += buf[i];
}
%(z_pos)s = mysum;
}
""" %locals()
def c_code_reduce_ccontig(self, sio, node, name, x, z, fail): def c_code_reduce_ccontig(self, sio, node, name, x, z, fail):
print >> sio, """ print >> sio, """
{ {
...@@ -718,6 +759,7 @@ class GpuSum(Op): ...@@ -718,6 +759,7 @@ class GpuSum(Op):
""" %locals() """ %locals()
def c_code_reduce_1(self, sio, node, name, x, z, fail): def c_code_reduce_1(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
print >> sio, """ print >> sio, """
{ {
int verbose = 0; int verbose = 0;
...@@ -725,31 +767,12 @@ class GpuSum(Op): ...@@ -725,31 +767,12 @@ class GpuSum(Op):
std::min(CudaNdarray_HOST_DIMS(%(x)s)[0], std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_THREADS_PER_BLOCK)); NUM_VECTOR_OP_THREADS_PER_BLOCK));
dim3 n_blocks(1); dim3 n_blocks(1);
if (verbose) printf("running kernel_reduce_sum_1_%(name)s\\n"); %(makecall)s
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
kernel_reduce_sum_1_%(name)s<<<n_blocks, n_threads, n_shared>>>(
CudaNdarray_HOST_DIMS(%(x)s)[0],
CudaNdarray_DEV_DATA(%(x)s),
CudaNdarray_HOST_STRIDES(%(x)s)[0],
CudaNdarray_DEV_DATA(%(z)s));
CNDA_THREAD_SYNC;
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_1_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
n_blocks.y,
n_threads.x,
n_threads.y,
n_threads.z);
%(fail)s;
}
} }
""" %locals() """ %locals()
def c_code_reduce_11(self, sio, node, name, x, z, fail): def c_code_reduce_11(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
print >> sio, """ print >> sio, """
{ {
int verbose = 0; int verbose = 0;
...@@ -762,30 +785,7 @@ class GpuSum(Op): ...@@ -762,30 +785,7 @@ class GpuSum(Op):
n_threads.y = CudaNdarray_HOST_DIMS(%(x)s)[0]; n_threads.y = CudaNdarray_HOST_DIMS(%(x)s)[0];
dim3 n_blocks(1); dim3 n_blocks(1);
if (verbose) fprintf(stdout, "running kernel_reduce_sum_11_%(name)s\\n"); %(makecall)s
if (verbose) fprint_CudaNdarray(stdout, %(x)s);
int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
kernel_reduce_sum_11_%(name)s<<<n_blocks, n_threads, n_shared>>>(
CudaNdarray_HOST_DIMS(%(x)s)[0],
CudaNdarray_HOST_DIMS(%(x)s)[1],
CudaNdarray_DEV_DATA(%(x)s),
CudaNdarray_HOST_STRIDES(%(x)s)[0],
CudaNdarray_HOST_STRIDES(%(x)s)[1],
CudaNdarray_DEV_DATA(%(z)s));
CNDA_THREAD_SYNC;
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_11_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
n_blocks.y,
n_threads.x,
n_threads.y,
n_threads.z);
%(fail)s;
}
} }
""" %locals() """ %locals()
...@@ -824,16 +824,6 @@ class GpuSum(Op): ...@@ -824,16 +824,6 @@ class GpuSum(Op):
threads_z = '' threads_z = ''
if len(self.reduce_mask)==3: if len(self.reduce_mask)==3:
threads_z = '' threads_z = ''
if config.warn.gpusum_01_011_0111_bug:
pattern = '0'+N_pattern
warn = '''
static bool warn_gpusum_01_011_0111_bug = true;
if(warn_gpusum_01_011_0111_bug && CudaNdarray_HOST_DIMS(%(x)s)[%(N)s]>4096){
printf("WARNING: old version of Theano had a silent bug with GpuSum pattern %(pattern)s when the first dimensions was bigger then 4096. Was fixed 31 may 2010. To disable this warning set the Theano flags warn.gpusum_01_011_0111_bug to False. Won't repeat the warning before we exit.\\n");
warn_gpusum_01_011_0111_bug = false;
}
'''%locals()
else: warn = ""
print >> sio, """ print >> sio, """
{ {
int verbose = 0; int verbose = 0;
...@@ -843,7 +833,6 @@ class GpuSum(Op): ...@@ -843,7 +833,6 @@ class GpuSum(Op):
%(threads_y)s %(threads_y)s
%(threads_z)s %(threads_z)s
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS)); dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS));
%(warn)s
%(makecall)s %(makecall)s
} }
""" %locals() """ %locals()
...@@ -862,16 +851,19 @@ class GpuSum(Op): ...@@ -862,16 +851,19 @@ class GpuSum(Op):
dim3 n_threads( dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(%(x)s)[0], std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
NUM_VECTOR_OP_THREADS_PER_BLOCK)); NUM_VECTOR_OP_THREADS_PER_BLOCK));
dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[1]); dim3 n_blocks(1,CudaNdarray_HOST_DIMS(%(x)s)[1]);
if (verbose) printf("running kernel_reduce_sum_10_%(name)s\\n"); if (verbose) printf("running kernel_reduce_sum_10_%(name)s\\n");
int n_shared = sizeof(float) * n_threads.x; int n_shared = sizeof(float) * n_threads.x;
kernel_reduce_sum_10_%(name)s<<<n_blocks, n_threads, n_shared>>>( kernel_reduce_sum_010_%(name)s<<<n_blocks, n_threads, n_shared>>>(
1,
CudaNdarray_HOST_DIMS(%(x)s)[0], CudaNdarray_HOST_DIMS(%(x)s)[0],
CudaNdarray_HOST_DIMS(%(x)s)[1], CudaNdarray_HOST_DIMS(%(x)s)[1],
CudaNdarray_DEV_DATA(%(x)s), CudaNdarray_DEV_DATA(%(x)s),
1,
CudaNdarray_HOST_STRIDES(%(x)s)[0], CudaNdarray_HOST_STRIDES(%(x)s)[0],
CudaNdarray_HOST_STRIDES(%(x)s)[1], CudaNdarray_HOST_STRIDES(%(x)s)[1],
CudaNdarray_DEV_DATA(%(z)s), CudaNdarray_DEV_DATA(%(z)s),
1,
CudaNdarray_HOST_STRIDES(%(z)s)[0] CudaNdarray_HOST_STRIDES(%(z)s)[0]
); );
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
...@@ -879,7 +871,7 @@ class GpuSum(Op): ...@@ -879,7 +871,7 @@ class GpuSum(Op):
if (cudaSuccess != sts) if (cudaSuccess != sts)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n", PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_10_%(name)s", "kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts), cudaGetErrorString(sts),
n_blocks.x, n_blocks.x,
n_blocks.y, n_blocks.y,
...@@ -890,7 +882,143 @@ class GpuSum(Op): ...@@ -890,7 +882,143 @@ class GpuSum(Op):
} }
} }
""" %locals() """ %locals()
def c_code_reduce_010(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
makecall_inner = self._makecall(node, name, x, z, fail, pattern="010_inner")
pattern = ''.join(str(i) for i in self.reduce_mask)
print >> sio, """
{
//int n_summations = CudaNdarray_HOST_DIMS(%(x)s)[0] * CudaNdarray_HOST_DIMS(%(x)s)[2];
//if ((n_summations >= 15 * 32) && (CudaNdarray_HOST_DIMS(%(x)s)[2]>=16))
if (1) // if the alternative is less buggy, consider not using this branch
{
// If there are a lot of summations to do, then we can use simple parallelization -
// use each thread to do one sum.
// we might as well launch blocks of 32 threads because that's the warp size.
// we could schedule more threads if we were maxing out the gridsize below, but
// the gridsize is way more than the physical hardware and I think 32 threads
// on a huge grid is enough to fully use the hardware.
dim3 n_threads(32,1,1);
// We kindof reshape the input implicitly to something 4D:
// the shape A,B,C -> A, B, D, E
// where C <= D*E < C+32
// where E==32
int A = CudaNdarray_HOST_DIMS(%(x)s)[0];
int B = CudaNdarray_HOST_DIMS(%(x)s)[1];
int C = CudaNdarray_HOST_DIMS(%(x)s)[2];
int D = C/32;
if (32*D < C) D+= 1;
assert ((C <= 32*D) && (32*D < C+32));
// The gridsize would ideally be (A, D). But we do the following logic to make
// sure we don't ask for a grid that is too big.
dim3 n_blocks(A,D);
if (n_blocks.x > NUM_VECTOR_OP_BLOCKS) n_blocks.x = NUM_VECTOR_OP_BLOCKS;
if (n_blocks.x*n_blocks.y > NUM_VECTOR_OP_BLOCKS) n_blocks.y = NUM_VECTOR_OP_BLOCKS/n_blocks.x;
int n_shared = 0;
kernel_reduce_sum_010_AD_%(name)s<<<n_blocks, n_threads, n_shared>>>(
A,B,C,D,
CudaNdarray_DEV_DATA(%(x)s),
CudaNdarray_HOST_STRIDES(%(x)s)[0],
CudaNdarray_HOST_STRIDES(%(x)s)[1],
CudaNdarray_HOST_STRIDES(%(x)s)[2],
CudaNdarray_DEV_DATA(%(z)s),
CudaNdarray_HOST_STRIDES(%(z)s)[0],
CudaNdarray_HOST_STRIDES(%(z)s)[1]
);
CNDA_THREAD_SYNC;
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_010_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
n_blocks.y,
n_threads.x,
n_threads.y,
n_threads.z);
%(fail)s;
}
}
else
{
int verbose = 2;
dim3 n_threads(std::min(32,CudaNdarray_HOST_DIMS(%(x)s)[2]));
while( (n_threads.x*(n_threads.y+1)<=NUM_VECTOR_OP_THREADS_PER_BLOCK)
&& (n_threads.y<CudaNdarray_HOST_DIMS(%(x)s)[1])){
n_threads.y++;
}
dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
(int)NUM_VECTOR_OP_BLOCKS));
n_blocks.y = std::min(
ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x),
(int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x)
);
if(std::min(std::min(CudaNdarray_HOST_STRIDES(%(x)s)[0],
CudaNdarray_HOST_STRIDES(%(x)s)[1]),
CudaNdarray_HOST_STRIDES(%(x)s)[2])
==CudaNdarray_HOST_STRIDES(%(x)s)[2]
&& n_blocks.y==ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x)){
if(verbose>1)
printf("n_block.x.1=%%d, n_block.x.2=%%d, n_block.y.1=%%d, n_block.y.2=%%d,\\n",
CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS,
ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x),
(int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x));
assert(n_threads.x<=32);
%(makecall_inner)s
}else{
n_threads.x = std::min(CudaNdarray_HOST_DIMS(%(x)s)[1],
(int)NUM_VECTOR_OP_THREADS_PER_BLOCK);
n_blocks.x = std::min(CudaNdarray_HOST_DIMS(%(x)s)[0], (int)NUM_VECTOR_OP_BLOCKS);
n_blocks.y = std::min(
CudaNdarray_HOST_DIMS(%(x)s)[2],
(int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x)
);
%(makecall)s
}
CNDA_THREAD_SYNC;
cudaError_t sts = cudaGetLastError();
if (cudaSuccess != sts)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
"kernel_reduce_sum_%(pattern)s_%(name)s",
cudaGetErrorString(sts),
n_blocks.x,
n_blocks.y,
n_threads.x,
n_threads.y,
n_threads.z);
%(fail)s;
}
}
}
""" %locals()
def c_code_reduce_0101(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail)
print >> sio, """
{
int verbose = 0;
dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(%(x)s)[3],
NUM_VECTOR_OP_THREADS_PER_BLOCK));
while (n_threads.x * n_threads.y <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
{
if (n_threads.y > CudaNdarray_HOST_DIMS(%(x)s)[1]) break;
n_threads.y += 1;
}
n_threads.y -= 1;
dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[0], CudaNdarray_HOST_DIMS(%(x)s)[2]);
%(makecall)s
}
""" %locals()
def c_code_reduce_100(self, sio, node, name, x, z, fail): def c_code_reduce_100(self, sio, node, name, x, z, fail):
makecall = self._makecall(node, name, x, z, fail) makecall = self._makecall(node, name, x, z, fail)
...@@ -1067,13 +1195,13 @@ class GpuSum(Op): ...@@ -1067,13 +1195,13 @@ class GpuSum(Op):
n_threads.z = CudaNdarray_HOST_DIMS(%(x)s)[0]; n_threads.z = CudaNdarray_HOST_DIMS(%(x)s)[0];
dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[1]); dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[1]);
%(makecall)s %(makecall)s
} }
""" %locals() """ %locals()
def c_code_cache_version(self): def c_code_cache_version(self):
return (17,) #return ()
return (19,)
def c_support_code_apply(self, node, nodename): def c_support_code_apply(self, node, nodename):
...@@ -1207,24 +1335,60 @@ class GpuSum(Op): ...@@ -1207,24 +1335,60 @@ class GpuSum(Op):
} }
} }
""" %locals() """ %locals()
if self.reduce_mask == (1,0): if self.reduce_mask == (0,1,0) or self.reduce_mask == (1,0):
# this kernel uses one block for each column, # this kernel uses one block for each column,
# threads per block for each element per column. # threads per block for each element per column.
#TODO: This kernel is pretty inefficient in terms of reading, because if A is #TODO: This kernel is pretty inefficient in terms of reading, because if A is
# c_contiguous (typical case) then each warp is accessing non-contigous # c_contiguous (typical case) then each warp is accessing non-contigous
# memory (a segment of a column). # memory (a segment of a column).
reducebuf = self._k_reduce_buf('Z[blockIdx.x * sZ0]') reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2*sZ1]')
print >> sio, """ print >> sio, """
static __global__ void kernel_reduce_sum_10_%(nodename)s( static __global__ void kernel_reduce_sum_010_%(nodename)s(
const int d0, const int d0,
const int d1, const int d1,
const float *A, const int sA0, const int sA1, const int d2,
float * Z, const int sZ0) const float *A, const int sA0, const int sA1, const int sA2,
float * Z, const int sZ0, const int sZ1)
{ {
const int threadCount = blockDim.x; const int threadCount = blockDim.x;
const int threadNum = threadIdx.x; const int threadNum = threadIdx.x;
extern __shared__ float buf[]; extern __shared__ float buf[];
if (warpSize != 32)
{
return; //TODO: set error code
}
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
{
for (int i2 = blockIdx.y; i2 < d2; i2 += gridDim.y)
{
float mysum = 0.0f;
for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)
{
mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
}
%(reducebuf)s
}
}
}
""" %locals()
if self.reduce_mask == (0,1,0):
print >> sio, """
static __global__ void kernel_reduce_sum_010_AD_%(nodename)s(
const int A,
const int B,
const int C,
const int D,
//const int E, // THIS is 32
const float *X, const int sX0, const int sX1, const int sX2,
float * Z, const int sZ0, const int sZ1)
{
const int threadCount = blockDim.x;
const int threadNum = threadIdx.x;
float mysum = 0.0f; float mysum = 0.0f;
if (warpSize != 32) if (warpSize != 32)
...@@ -1232,13 +1396,62 @@ class GpuSum(Op): ...@@ -1232,13 +1396,62 @@ class GpuSum(Op):
return; //TODO: set error code return; //TODO: set error code
} }
for (int i0 = threadIdx.x; i0 < d0; i0 += blockDim.x) for (int a = blockIdx.x; a < A; a += gridDim.x)
{ {
float Ai = A[i0 * sA0 + blockIdx.x * sA1]; for (int i2_D = blockIdx.y; i2_D < D; i2_D += gridDim.y)
mysum += Ai; {
int c = i2_D * 32 + threadIdx.x;
if (c < C)
{
mysum = 0;
for (int b = 0; b < B; ++b)
{
mysum += X[a * sX0 + b * sX1 + c * sX2];
}
Z[a * sZ0 + c * sZ1] = mysum;
}
}
}
}
""" %locals()
if self.reduce_mask == (0,1,0):
#
# This kernel is optimized when the inner most dimensions have the smallest stride.
# this kernel uses one block for multiple column(up to 32TODO),
# threads per block for each element per column.
#thread.x = dim 2 contiguous
#thread.y = dim 1
#block.x = dim 0
#block.y = dim 1 rest
init = self._k_init(node,nodename)
decl = self._k_decl(node, nodename, pattern="010_inner")
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
print >> sio, """
%(decl)s
{
if(warpSize<blockDim.x){
//TODO: set error code
Z[0] = -666;
return;
}
%(init)s
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
{
for (int i2 = blockIdx.y*blockDim.x+threadIdx.x; i2 < d2; i2 += gridDim.y*blockDim.x)
{
for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
{
mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
} }
%(reducebuf)s %(reducebuf)s
} }
}
}
""" %locals() """ %locals()
if self.reduce_mask == (1,1,0): if self.reduce_mask == (1,1,0):
# this kernel uses one block for each column, # this kernel uses one block for each column,
...@@ -1387,6 +1600,34 @@ class GpuSum(Op): ...@@ -1387,6 +1600,34 @@ class GpuSum(Op):
} }
} }
""" %locals() """ %locals()
if self.reduce_mask == (0,1,0,1):
# this kernel uses one block for each row,
# threads per block for each element per row.
reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]')
decl = self._k_decl(node, nodename)
init = self._k_init(node, nodename)
print >> sio, """
%(decl)s
{
%(init)s
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
{
for (int i2 = blockIdx.y; i2 < d2; i2 += gridDim.y)
{
float mysum = 0.0f;
for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
{
for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
{
mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3];
}
}
%(reducebuf)s
}
}
}
""" %locals()
if self.reduce_mask == (1,1,1,1): if self.reduce_mask == (1,1,1,1):
reducebuf = self._k_reduce_buf('Z[0]') reducebuf = self._k_reduce_buf('Z[0]')
decl = self._k_decl(node, nodename) decl = self._k_decl(node, nodename)
......
...@@ -149,17 +149,15 @@ class GpuGemm(Op): ...@@ -149,17 +149,15 @@ class GpuGemm(Op):
""" """
implement the gemm on the gpu. implement the gemm on the gpu.
..note: This probably don't work correctly for no_inplace gemm.
Need to check al least refcount.
""" """
def __init__(self, inplace): def __init__(self, inplace):
self.__setstate__({'inplace':inplace}) self.__setstate__({'inplace':inplace})
def __str__(self): def __str__(self):
if self.inplace: inplace_str = 'inplace' if self.inplace:
else: inplace_str = 'no_inplace' return 'GpuGemm{inplace}'
return '%s{%s}' % (self.__class__.__name__, inplace_str) else:
return 'GpuGemm{no_inplace}'
def __eq__(self, other): def __eq__(self, other):
return (type(self) == type(other)\ return (type(self) == type(other)\
...@@ -183,7 +181,7 @@ class GpuGemm(Op): ...@@ -183,7 +181,7 @@ class GpuGemm(Op):
return Apply(self, [z, a, x, y, b], [z.type()]) return Apply(self, [z, a, x, y, b], [z.type()])
def c_code_cache_version(self): def c_code_cache_version(self):
return (2,) return (3,)
def c_code(self, node, name, inputs, outputs, sub): def c_code(self, node, name, inputs, outputs, sub):
#z_out = alpha * dot(x,y) + beta * z_in #z_out = alpha * dot(x,y) + beta * z_in
...@@ -192,8 +190,9 @@ class GpuGemm(Op): ...@@ -192,8 +190,9 @@ class GpuGemm(Op):
z_in, a, x, y, b = inputs z_in, a, x, y, b = inputs
z_out, = outputs z_out, = outputs
fail = sub['fail'] fail = sub['fail']
if self.inplace: sio = StringIO.StringIO()
return """
print >> sio, """
#define REAL float #define REAL float
float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT)
...@@ -205,51 +204,48 @@ class GpuGemm(Op): ...@@ -205,51 +204,48 @@ class GpuGemm(Op):
: (REAL)(((double*)%(b)s->data)[0]); : (REAL)(((double*)%(b)s->data)[0]);
#undef REAL #undef REAL
if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_in)s)) """
{ if self.inplace:
%(fail)s; print >> sio, """
}
Py_XDECREF(%(z_out)s); Py_XDECREF(%(z_out)s);
%(z_out)s = %(z_in)s; %(z_out)s = %(z_in)s;
Py_INCREF(%(z_out)s); Py_INCREF(%(z_out)s);
""" % locals() """
else: else:
return """ print >> sio, """
#define REAL float if (!%(z_out)s
float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) || (%(z_out)s->nd != 2)
? (REAL)(((float*)%(a)s->data)[0])
: (REAL)(((double*)%(a)s->data)[0]);
float %(name)s_b = (%(b)s->descr->type_num == PyArray_FLOAT) ?
(REAL)(((float*)%(b)s->data)[0])
: (REAL)(((double*)%(b)s->data)[0]);
#undef REAL
if ((NULL == %(z_out)s)
|| (CudaNdarray_HOST_DIMS(%(z_out)s)[0] != CudaNdarray_HOST_DIMS(%(z_in)s)[0]) || (CudaNdarray_HOST_DIMS(%(z_out)s)[0] != CudaNdarray_HOST_DIMS(%(z_in)s)[0])
|| (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1])) || (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1])
)
{ {
Py_XDECREF(%(z_out)s); Py_XDECREF(%(z_out)s);
%(z_out)s = (CudaNdarray*)CudaNdarray_Copy(%(z_in)s); %(z_out)s = (CudaNdarray*)CudaNdarray_Copy(%(z_in)s);
if (!%(z_out)s)
if(!%(z_out)s) { {
PyErr_SetString(PyExc_MemoryError, "failed to alloc GpuGemm{no_inplace} output"); %(fail)s;
%(fail)s }
} }
}else{ else
if(CudaNdarray_CopyFromCudaNdarray(%(z_out)s,%(z_in)s)){ {
PyErr_SetString(PyExc_MemoryError, "failed to copy input in GpuGemm{no_inplace}"); if (CudaNdarray_CopyFromCudaNdarray(%(z_out)s, %(z_in)s))
%(fail)s {
%(fail)s;
} }
} }
"""
print >> sio, """
if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_out)s)) if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_out)s))
{ {
%(fail)s; %(fail)s;
} }
""" % locals() """
gpu_gemm_inplace = GpuGemm(True)
gpu_gemm_no_inplace = GpuGemm(False) return sio.getvalue() % locals()
gpu_gemm_no_inplace = GpuGemm(inplace=False)
gpu_gemm_inplace = GpuGemm(inplace=True)
## ##
# Not really a BLAS operation, but whatever. # Not really a BLAS operation, but whatever.
......
...@@ -558,6 +558,10 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape) ...@@ -558,6 +558,10 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
free(rval_dims); free(rval_dims);
return NULL; return NULL;
} }
if (rval_size==0)
{
return CudaNdarray_NewDims(rval_nd, rval_dims);
}
if(CudaNdarray_is_c_contiguous(self)) if(CudaNdarray_is_c_contiguous(self))
{ {
...@@ -1019,6 +1023,44 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other) ...@@ -1019,6 +1023,44 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other)
Py_INCREF(py_self); Py_INCREF(py_self);
return py_self; return py_self;
} }
case 5:
{
dim3 n_blocks(
std::min(CudaNdarray_HOST_DIMS(self)[1], NUM_VECTOR_OP_BLOCKS),
CudaNdarray_HOST_DIMS(self)[2]
);
while (n_blocks.x * n_blocks.y > NUM_VECTOR_OP_BLOCKS) n_blocks.y /= 2;
dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(self)[3], NUM_VECTOR_OP_THREADS_PER_BLOCK)
);
for (int i = 0; i < CudaNdarray_HOST_DIMS(self)[0]; ++i)
{
k_iAdd_4<<<n_blocks, n_threads>>>(
CudaNdarray_HOST_DIMS(self)[1],
CudaNdarray_HOST_DIMS(self)[2],
CudaNdarray_HOST_DIMS(self)[3],
CudaNdarray_HOST_DIMS(self)[4],
CudaNdarray_DEV_DATA(self) + i * CudaNdarray_HOST_STRIDES(self)[0],
CudaNdarray_HOST_STRIDES(self)[1],
CudaNdarray_HOST_STRIDES(self)[2],
CudaNdarray_HOST_STRIDES(self)[3],
CudaNdarray_HOST_STRIDES(self)[4],
CudaNdarray_DEV_DATA(other) + i * CudaNdarray_HOST_STRIDES(other)[0],
CudaNdarray_HOST_STRIDES(other)[1],
CudaNdarray_HOST_STRIDES(other)[2],
CudaNdarray_HOST_STRIDES(other)[3],
CudaNdarray_HOST_STRIDES(other)[4]);
CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iAdd", cudaGetErrorString(err));
return NULL;
}
}
Py_INCREF(py_self);
return py_self;
}
} }
PyErr_Format(PyExc_NotImplementedError, "inplace_add w nd=%i\n", self->nd); PyErr_Format(PyExc_NotImplementedError, "inplace_add w nd=%i\n", self->nd);
...@@ -1370,7 +1412,7 @@ CudaNdarray_setitem(PyObject *o, PyObject *key, PyObject *v) ...@@ -1370,7 +1412,7 @@ CudaNdarray_setitem(PyObject *o, PyObject *key, PyObject *v)
{ {
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_setitem: syncing structure to device failed"); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_setitem: syncing structure to device failed");
Py_DECREF(rval); Py_DECREF(rval);
return NULL; return -1;
} }
CudaNdarray *viewCopyForComparison = CudaNdarray *viewCopyForComparison =
......
...@@ -2,12 +2,13 @@ import sys ...@@ -2,12 +2,13 @@ import sys
import theano import theano
import numpy import numpy
from theano import scalar as scal from theano import scalar as scal
from theano import tensor, compile from theano import tensor, compile, gof
from theano.gof import local_optimizer, EquilibriumDB, SequenceDB, Optimizer, toolbox, DestroyHandler from theano.gof import local_optimizer, EquilibriumDB, SequenceDB, Optimizer, toolbox, DestroyHandler
from theano.sandbox.cuda.basic_ops import * from theano.sandbox.cuda.basic_ops import *
from theano.sandbox.cuda.type import CudaNdarrayType from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda.blas import gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace, gpu_gemm_no_inplace, GpuConv from theano.sandbox.cuda.blas import (gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace,
gpu_gemm_no_inplace, GpuConv)
from theano.sandbox.cuda.blas import GpuDownsampleFactorMax, GpuDownsampleFactorMaxGrad from theano.sandbox.cuda.blas import GpuDownsampleFactorMax, GpuDownsampleFactorMaxGrad
from theano.sandbox.cuda.nnet import ( from theano.sandbox.cuda.nnet import (
GpuCrossentropySoftmaxArgmax1HotWithBias, GpuCrossentropySoftmaxArgmax1HotWithBias,
...@@ -187,34 +188,25 @@ def local_gpu_dot22scalar(node): ...@@ -187,34 +188,25 @@ def local_gpu_dot22scalar(node):
@local_optimizer([]) @local_optimizer([])
def local_gpu_gemm(node): def local_gpu_gemm(node):
""" """
work for inplace and not inplace gemm
gpu_from_host(gemm) -> gpu_gemm(gpu_from_host) gpu_from_host(gemm) -> gpu_gemm(gpu_from_host)
gemm(host_from_gpu) -> host_from_gpu(gpu_gemm) gemm(host_from_gpu) -> host_from_gpu(gpu_gemm)
""" """
gemms = {tensor.blas.gemm_inplace: gpu_gemm_inplace,
tensor.blas.gemm_no_inplace: gpu_gemm_no_inplace}
if node.op == gpu_from_host: if node.op == gpu_from_host:
host_input = node.inputs[0] host_input = node.inputs[0]
if host_input.owner and host_input.owner.op == tensor.blas.gemm_inplace: if host_input.owner and host_input.owner.op in gemms:
z, a, x, y, b = host_input.owner.inputs op = host_input.owner.op
return [gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
elif host_input.owner and host_input.owner.op == tensor.blas.gemm_no_inplace:
z, a, x, y, b = host_input.owner.inputs z, a, x, y, b = host_input.owner.inputs
return [gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)] return [gemms[op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
elif node.op == tensor.blas.gemm_inplace: if node.op in gemms:
z, a, x, y, b = node.inputs
x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
if x_on_gpu or y_on_gpu or z_on_gpu:
return [host_from_gpu(gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
elif node.op == tensor.blas.gemm_no_inplace:
z, a, x, y, b = node.inputs z, a, x, y, b = node.inputs
x_on_gpu = (x.owner and x.owner.op == host_from_gpu) x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
y_on_gpu = (y.owner and y.owner.op == host_from_gpu) y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
z_on_gpu = (z.owner and z.owner.op == host_from_gpu) z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
if x_on_gpu or y_on_gpu or z_on_gpu: if x_on_gpu or y_on_gpu or z_on_gpu:
return [host_from_gpu(gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))] return [host_from_gpu(gemms[node.op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
return False return False
@register_opt() @register_opt()
...@@ -234,8 +226,44 @@ def local_gpu_sum(node): ...@@ -234,8 +226,44 @@ def local_gpu_sum(node):
gsum=GpuSum(reduce_mask) gsum=GpuSum(reduce_mask)
pattern=(''.join(str(i) for i in reduce_mask)) pattern=(''.join(str(i) for i in reduce_mask))
if hasattr(gsum, 'c_code_reduce_%s'%pattern): if hasattr(gsum, 'c_code_reduce_%s'%pattern):
return [host_from_gpu(gsum(gpu_from_host(x)))] rval = host_from_gpu(gsum(gpu_from_host(x)))
if rval.type == node.outputs[0].type:
return [rval]
else:
print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
return None
else: else:
# Try to make a simpler pattern based on reshaping
# The principle is that if two adjacent dimensions have the same value in
# the reduce_mask, then we can reshape to make them a single dimension, do
# the sum, and then reshape to get them back.
shape_of = node.env.shape_feature.shape_of
x_shape = shape_of[x]
new_in_shp = [x_shape[0]]
new_mask = [reduce_mask[0]]
for i in range(1, x.type.ndim):
if reduce_mask[i] == reduce_mask[i-1]:
new_in_shp[-1] *= x_shape[i]
else:
new_mask.append(reduce_mask[i])
new_in_shp.append(x_shape[i])
pattern=(''.join(str(i) for i in new_mask))
new_gsum = GpuSum(new_mask)
if hasattr(new_gsum, 'c_code_reduce_%s'%pattern):
reshaped_x = x.reshape(tensor.stack(*new_in_shp))
sum_reshaped_x = host_from_gpu(new_gsum(gpu_from_host(reshaped_x)))
unreshaped_sum = sum_reshaped_x.reshape(tensor.stack(*shape_of[node.outputs[0]]))
if unreshaped_sum.type == node.outputs[0].type:
return [unreshaped_sum]
else:
print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
return None
raise Exception("GpuSum don't have implemented the pattern",pattern) raise Exception("GpuSum don't have implemented the pattern",pattern)
return False return False
......
...@@ -34,11 +34,14 @@ def test_sum(): ...@@ -34,11 +34,14 @@ def test_sum():
test sum pattern 1, 11, 10, 01, 100, 110, 011, 001, 111, 0011, 0111, 1011, 1111 test sum pattern 1, 11, 10, 01, 100, 110, 011, 001, 111, 0011, 0111, 1011, 1111
TODO: test with broadcast TODO: test with broadcast
""" """
for shape, pattern in [((0,),[0]),((5,),[0]), for shape, pattern in [((100,3,1300),[1]),
((0,),[0]),((5,),[0]),
((0,0),[0,1]),((1,0),[0,1]),((5,4),[0,1]),((33,31),[0,1]),((5,4),[1]),((5,4),[0]),#need something bigger then 32 for some opt test. ((0,0),[0,1]),((1,0),[0,1]),((5,4),[0,1]),((33,31),[0,1]),((5,4),[1]),((5,4),[0]),#need something bigger then 32 for some opt test.
((5,4,3),[0]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]), ((5,4,3),[0]),((5,4,3),[1]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]),
((0,0,0,0),[0,1,2,3]), ((0,0,0,0),[0,1,2,3]),
((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]), ((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
((5,4,3,10,11),[1,2]),
((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
#test shape bigger then 4096 on each dimension to make sure that we work correctly when we don't have enought thread/block in each dimensions #test shape bigger then 4096 on each dimension to make sure that we work correctly when we don't have enought thread/block in each dimensions
((4100,3),[0]),((3,4101),[0]),#10 ((4100,3),[0]),((3,4101),[0]),#10
...@@ -47,7 +50,7 @@ def test_sum(): ...@@ -47,7 +50,7 @@ def test_sum():
((4100,3),[0,1]),((3,4101),[0,1]),#11 ((4100,3),[0,1]),((3,4101),[0,1]),#11
((4100,4,3),[0]),((5,4100,3),[0]),((5,4,4100),[0]),#100 ((4100,4,3),[0]),((5,4100,3),[0]),((5,4,4100),[0]),#100
#((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented ((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented
((4100,4,3),[2]),((5,4100,3),[2]),((5,4,4100),[2]),#001 ((4100,4,3),[2]),((5,4100,3),[2]),((5,4,4100),[2]),#001
((4100,4,3),[0,1]),((5,4100,3),[0,1]),((5,4,4100),[0,1]),#110 ((4100,4,3),[0,1]),((5,4100,3),[0,1]),((5,4,4100),[0,1]),#110
((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011 ((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011
...@@ -70,9 +73,9 @@ def test_sum(): ...@@ -70,9 +73,9 @@ def test_sum():
assert tcn.GpuSum in [x.op.__class__ for x in f.maker.env.toposort()] assert tcn.GpuSum in [x.op.__class__ for x in f.maker.env.toposort()]
assert T.Sum in [x.op.__class__ for x in f2.maker.env.toposort()] assert T.Sum in [x.op.__class__ for x in f2.maker.env.toposort()]
if val.size==0: if val.size==0:
assert f2(val)==f(val) assert f2(val)==f(val), ('shape', shape, 'pattern', pattern)
else: else:
assert numpy.allclose(f2(val),f(val)) assert numpy.allclose(f2(val),f(val)), ('shape', shape, 'pattern', pattern)
#test with dimshuffle #test with dimshuffle
......
...@@ -191,25 +191,31 @@ class Scalar(Type): ...@@ -191,25 +191,31 @@ class Scalar(Type):
typedef theano_complex%(nbits)s complex_type; typedef theano_complex%(nbits)s complex_type;
typedef npy_float%(half_nbits)s scalar_type; typedef npy_float%(half_nbits)s scalar_type;
complex_type operator +(complex_type y) { complex_type operator +(const complex_type &y) const {
complex_type ret; complex_type ret;
ret.real = this->real + y.real; ret.real = this->real + y.real;
ret.imag = this->imag + y.imag; ret.imag = this->imag + y.imag;
return ret; return ret;
} }
complex_type operator -(complex_type y) { complex_type operator -() const {
complex_type ret;
ret.real = -this->real;
ret.imag = -this->imag;
return ret;
}
complex_type operator -(const complex_type &y) const {
complex_type ret; complex_type ret;
ret.real = this->real - y.real; ret.real = this->real - y.real;
ret.imag = this->imag - y.imag; ret.imag = this->imag - y.imag;
return ret; return ret;
} }
complex_type operator *(complex_type y) { complex_type operator *(const complex_type &y) const {
complex_type ret; complex_type ret;
ret.real = this->real * y.real - this->imag * y.imag; ret.real = this->real * y.real - this->imag * y.imag;
ret.imag = this->real * y.imag + this->imag * y.real; ret.imag = this->real * y.imag + this->imag * y.real;
return ret; return ret;
} }
complex_type operator /(complex_type y) { complex_type operator /(const complex_type &y) const {
complex_type ret; complex_type ret;
scalar_type y_norm_square = y.real * y.real + y.imag * y.imag; scalar_type y_norm_square = y.real * y.real + y.imag * y.imag;
ret.real = (this->real * y.real + this->imag * y.imag) / y_norm_square; ret.real = (this->real * y.real + this->imag * y.imag) / y_norm_square;
...@@ -296,6 +302,7 @@ class Scalar(Type): ...@@ -296,6 +302,7 @@ class Scalar(Type):
return "" return ""
def c_code_cache_version(self): def c_code_cache_version(self):
return (8,) # put const around operators and added unary '-' operator
# no need to put lib.amdlibm here as c_compile_args() are put in the key. # no need to put lib.amdlibm here as c_compile_args() are put in the key.
return (7,) # make complex c code optional return (7,) # make complex c code optional
return (6,) # added implemeentations of operators that work with scalar arguments return (6,) # added implemeentations of operators that work with scalar arguments
...@@ -459,7 +466,39 @@ def float_out_nocomplex(*types): ...@@ -459,7 +466,39 @@ def float_out_nocomplex(*types):
if type in complex_types: if type in complex_types:
raise TypeError('complex argument not supported') raise TypeError('complex argument not supported')
return float64, return float64,
class unary_out_lookup(gof.utils.object2):
"""
get a output_types_preference object by passing a dictionary:
unary_out_lookup({int8:int32, float32:complex128})
The result is an op that maps in8 to int32 and float32 to complex128 and other input types
lead to a TypeError.
"""
def __init__(self, type_table):
self.tbl = type_table
def __call__(self, *types):
if len(types) == 1:
types = types[0]
try:
rval = self.tbl[types]
except:
raise TypeError(types)
if isinstance(types, (list, tuple)):
return rval
else:
return [rval]
def __eq__(self, other):
return type(self) == type(other) and self.tbl == other.tbl
def __hash__(self):
return hash(type(self)) # ignore hash of table
def real_out(type):
if type == complex64:
return float32,
if type == complex128:
return float64,
return type,
class ScalarOp(Op): class ScalarOp(Op):
...@@ -810,6 +849,13 @@ class Mul(ScalarOp): ...@@ -810,6 +849,13 @@ class Mul(ScalarOp):
retval = [] retval = []
for input in inputs: for input in inputs:
if input.type in grad_types: if input.type in grad_types:
if input.type in complex_types:
# does casting from real to complex work?
dz_dinput = cast(mul(*(utils.difference(inputs, [input]))), input.type.dtype)
x = real(dz_dinput)
y = imag(dz_dinput)
retval += [complex(x*real(gz)+y*imag(gz), x*imag(gz)-y*real(gz))]
else:
retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)] retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)]
else: else:
retval += [None] retval += [None]
...@@ -1088,11 +1134,9 @@ class Abs(UnaryScalarOp): ...@@ -1088,11 +1134,9 @@ class Abs(UnaryScalarOp):
return numpy.abs(x) return numpy.abs(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in grad_types: if x.type in grad_types:
return gz * sgn(x), return gz * x / abs(x), # formula works for complex and real
else: else:
return None, return None,
#backport
#return gz * sgn(x) if x.type in grad_types else None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
type = node.inputs[0].type type = node.inputs[0].type
if type in int_types: if type in int_types:
...@@ -1478,6 +1522,91 @@ class Tanh(UnaryScalarOp): ...@@ -1478,6 +1522,91 @@ class Tanh(UnaryScalarOp):
return "%(z)s = tanh(%(x)s);" % locals() return "%(z)s = tanh(%(x)s);" % locals()
tanh = Tanh(upgrade_to_float, name = 'tanh') tanh = Tanh(upgrade_to_float, name = 'tanh')
class Real(UnaryScalarOp):
"""Extract the real coordinate of a complex number. """
def impl(self, x):
return numpy.real(x)
def grad(self, (x, ), (gz, )):
return [complex(gz, 0)]
real = Real(real_out, name='real')
class Imag(UnaryScalarOp):
def impl(self, x):
return numpy.imag(x)
def grad(self, (x, ), (gz, )):
return [complex(0, gz)]
imag = Imag(real_out, name='imag')
class Angle(UnaryScalarOp):
def impl(self, x):
return numpy.angle(x)
def grad(self, (c, ), (gtheta, )):
# y = x.imag
# r = sqrt(y**2 + x.real**2)
# g = y/r
# if x == 0 and y == 0:
# theta = 0
# elif x >= 0:
# theta = numpy.arcsin(g)
# else:
# theta = -numpy.arcsin(g)+numpy.pi
x = real(c)
y = imag(c)
r = abs(c)
gr = -gtheta * y / (r**2 * sqrt(1 - (y/r)**2))
gx = gr * x/r
gy = gr * y/r
return [complex(gx, gy)]
angle = Angle(specific_out(float64), name='angle')
class Complex(BinaryScalarOp):
@staticmethod
def output_types_preference(x,y):
if x in complex_types:
raise TypeError(x)
if y in complex_types:
raise TypeError(y)
up = Scalar.upcast(x, y)
if up in ('float64', 'int64', 'uint64', 'int32', 'uint32'):
return [complex128]
else:
return [complex64]
def impl(self, x, y):
return numpy.complex(x, y)
def grad(self, (x,y), (z,)):
return [real(z), imag(z)]
complex = Complex(name='complex')
class Conj(UnaryScalarOp):
def impl(self, x):
return numpy.conj(x)
def grad(self, (x, ), (gz, )):
return [conj(gz)]
conj = Conj(same_out, name='conj')
class ComplexFromPolar(BinaryScalarOp):
@staticmethod
def output_types_preference(x,y):
return Complex.output_types_preference(x,y)
def impl(self, r, theta):
if r < 0:
raise ValueError('polar radius must be non-negative', r)
x = r*numpy.cos(theta)
y = r*numpy.sin(theta)
if x.dtype == 'float32':
return numpy.complex64(numpy.complex(x,y))
else:
return numpy.complex128(numpy.complex(x,y))
def grad(self, (r,theta), (gz,)):
gr = cos(theta) * real(gz) + sin(theta) * imag(gz)
gtheta = -real(gz) * r * sin(theta) + imag(gz) * r * cos(theta)
return [gr, gtheta]
complex_from_polar = ComplexFromPolar(name='complex_from_polar')
class Composite(ScalarOp): class Composite(ScalarOp):
...@@ -1642,3 +1771,4 @@ class Composite(ScalarOp): ...@@ -1642,3 +1771,4 @@ class Composite(ScalarOp):
#otherwise self.perform won't work. #otherwise self.perform won't work.
self.__init__(self.inputs, self.outputs) self.__init__(self.inputs, self.outputs)
...@@ -10,7 +10,7 @@ if sys.version_info >= (2,5): ...@@ -10,7 +10,7 @@ if sys.version_info >= (2,5):
import functools import functools
import numpy, theano import numpy, theano
from copy import copy #from copy import copy as python_copy
from theano import gof from theano import gof
from theano.gof import Variable, Op, utils, Type, Constant, Value from theano.gof import Variable, Op, utils, Type, Constant, Value
...@@ -342,10 +342,10 @@ def get_constant_value(v): ...@@ -342,10 +342,10 @@ def get_constant_value(v):
# it is not a constant, but in some cases it *could* be replaced with one. # it is not a constant, but in some cases it *could* be replaced with one.
# Note that this would have an effect on the broadcasting of inputs and so on # Note that this would have an effect on the broadcasting of inputs and so on
try: try:
complex(v.data) #works for all numeric scalars numpy.complex(v.data) #works for all numeric scalars
return v.data return v.data
except: except:
raise TypeError(v) raise TypeError('v.data is non-numeric', v)
if v.owner: if v.owner:
if isinstance(v.owner.op, Alloc): if isinstance(v.owner.op, Alloc):
return get_constant_value(v.owner.inputs[0]) return get_constant_value(v.owner.inputs[0])
...@@ -484,12 +484,12 @@ class TensorType(Type): ...@@ -484,12 +484,12 @@ class TensorType(Type):
return False return False
if 'int' in str(a.dtype): if 'int' in str(a.dtype):
return numpy.all(a==b) return numpy.all(a==b)
elif a.shape == (): #for comparing scalars, use broadcasting. #elif a.shape == (): #for comparing scalars, use broadcasting.
# Note: according to James B, there was a reason for the ## Note: according to James B, there was a reason for the
# following two lines, that may seem weird at first glance. ## following two lines, that may seem weird at first glance.
# If someone can figure out what it is, please say it here! ## If someone can figure out what it is, please say it here!
ones = numpy.ones(2) #ones = numpy.ones(2)
return _allclose(ones * a, ones*b) #return _allclose(ones * a, ones*b) ### dtype handling is wrong here
else: else:
cmp = _allclose(a, b) cmp = _allclose(a, b)
if cmp: if cmp:
...@@ -1638,88 +1638,25 @@ def sinh(a): ...@@ -1638,88 +1638,25 @@ def sinh(a):
def tanh(a): def tanh(a):
"""hyperbolic tangent of a""" """hyperbolic tangent of a"""
class Real(Op): @_scal_elemwise
"""Extract the real elements of a complex ndarray""" def real(z):
view_map = {0:[0]} """Return real component of complex-valued tensor `z`"""
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def make_node(self, x):
_x = as_tensor(x)
y_dtype = _x.type.dtype
if y_dtype == 'complex64':
y_dtype = 'float32'
if y_dtype == 'complex128':
y_dtype = 'float64'
_y = Tensor(y_dtype, _x.type.broadcastable)()
return Apply(self, [_x], [_y])
def perform(self, node, (x,), (y,)):
if str(x.dtype).startswith('complex'):
y[0] = x.real
else:
y[0] = x
def grad(self, inputs, (g_y,)):
#TODO: waiting on a Complex(real=, imag=) op that can merge
#things back into a complex tensor
raise NotImplementedError()
_real = Real()
@constructor
def real(x):
"""Return the real part of real or complex-valued `x`
For real-valued `x`, `x` itself is returned.
"""
_x = as_tensor_variable(x)
if _x.type.dtype.startswith('complex'):
return _real(x)
else:
return _x
class Imag(Op): @_scal_elemwise
"""Extract the imaginary elements of a complex ndarray""" def imag(z):
view_map = {0:[0]} """Return imaginary component of complex-valued tensor `z`"""
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def make_node(self, x):
_x = as_tensor_variable(x)
if not _x.type.dtype.startswith('complex'):
raise TypeError('Imag(x) requires complex x', x)
if _x.type.dtype == 'complex64': y_dtype = 'float32'
elif _x.type.dtype == 'complex128': y_dtype = 'float64'
else:
raise NotImplementedError('what is this?', y_dtype)
_y = Tensor(y_dtype, _x.type.broadcastable)()
return Apply(self, [_x], [_y])
def perform(self, node, (x,), (y,)):
if str(x.dtype).startswith('complex'):
y[0] = x.imag
else:
y[0] = x * 0
def grad(self, inputs, (g_y,)):
# TODO: waiting on a complex(real=, imag=) op that can merge
# things back into a complex tensor
raise NotImplementedError()
_imag = Imag()
@constructor
def imag(x):
"""Return the imaginary part of real or complex-valued `x`
For real-valued 'x' this returns `zeros_like(x)`. @_scal_elemwise
""" def angle(z):
_x = as_tensor_variable(x) """Return polar-coordinate angle of complex-valued tensor `z`"""
if _x.type.dtype.startswith('complex'):
return _imag(x)
else:
return zeros_like(x)
@constructor @_scal_elemwise
def angle(x): def complex(real, imag):
"""Return the angular component of complex-valued `x`""" """Return complex-valued tensor with `real` and `imag` components"""
raise NotImplementedError()
@_scal_elemwise
def complex_from_polar(abs, angle):
"""Return complex-valued tensor from polar coordinate specification"""
########################## ##########################
# Misc # Misc
...@@ -2150,6 +2087,7 @@ def pow(a, b): ...@@ -2150,6 +2087,7 @@ def pow(a, b):
def clip(x, min, max): def clip(x, min, max):
"""clip x to be between min and max""" """clip x to be between min and max"""
# see decorator for function body # see decorator for function body
# for grep: clamp, bound
pprint.assign(add, printing.OperatorPrinter('+', -2, 'either')) pprint.assign(add, printing.OperatorPrinter('+', -2, 'either'))
pprint.assign(mul, printing.OperatorPrinter('*', -1, 'either')) pprint.assign(mul, printing.OperatorPrinter('*', -1, 'either'))
...@@ -2344,7 +2282,7 @@ class Subtensor(Op): ...@@ -2344,7 +2282,7 @@ class Subtensor(Op):
and (idx.step is None or idx.step == 1): and (idx.step is None or idx.step == 1):
outshp.append(xl) outshp.append(xl)
else: else:
#No easy way to compute the shape # Not implemented yet
outshp.append(shape_i(i)(node.outputs[0])) outshp.append(shape_i(i)(node.outputs[0]))
i += 1 i += 1
else: else:
...@@ -2443,31 +2381,48 @@ class SubtensorPrinter: ...@@ -2443,31 +2381,48 @@ class SubtensorPrinter:
pprint.assign(lambda pstate, r: r.owner and isinstance(r.owner.op, Subtensor), SubtensorPrinter()) pprint.assign(lambda pstate, r: r.owner and isinstance(r.owner.op, Subtensor), SubtensorPrinter())
def setsubtensor(x, y, idx_list, inplace=False): def setsubtensor(x, y, idx_list, inplace=False):
""" print >> sys.stderr, "tensor.setsubtensor is deprecated - please use set_subtensor"
setsubtensor is meant to replicate the following numpy behaviour: x[i,j,k] = y the_op = IncSubtensor(idx_list, inplace, set_instead_of_inc=True)
return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
def incsubtensor(x, y, idx_list, inplace=False):
print >> sys.stderr, "tensor.setsubtensor is deprecated - please use set_subtensor"
the_op = IncSubtensor(idx_list, inplace, set_instead_of_inc=False)
return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
def set_subtensor(x, y):
"""Return x with the given subtensor overwritten by y.
Example: To replicate the numpy expression "r[10:] = 5", type
>>> new_r = set_subtensor(r[10:], 5)
:param x: symbolic variable for the lvalue of = operation :param x: symbolic variable for the lvalue of = operation
:param y: symbolic variable for the rvalue of = operation :param y: symbolic variable for the rvalue of = operation
:param idx_list: tuple of length x.dim, containing indices with which to index x.
:param inplace: boolean to declare whether the operation is in place or not (False unless
called from within an optimization)
:Details: idx_list can be a tuple containing a mixture of numeric constants, symbolic :see: theano.tensor.basic.setsubtensor
scalar values and standard numpy slice objects. i.e:
idx_list=(1,2,3), idx_list=(1,b,3) where b is an iscalar variable,
idx_list=(slice(start,stop,step),b,3) equivalent to x[start:stop:step, b, 3]
""" """
the_op = IncSubtensor(idx_list, inplace, True) return inc_subtensor(x, y, set_instead_of_inc=True)
return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
def incsubtensor(x, y, idx_list, inplace=False): def inc_subtensor(x, y, set_instead_of_inc=False):
""" """Return x with the given subtensor incremented by y.
incsubtensor is meant to replicate the following numpy behaviour: x[i,j,k] += y
:param x: the symbolic result of a Subtensor operation.
:param y: the amount by which to increment ths subtensor in question
Example: To replicate the numpy expression "r[10:] += 5", type
>>> new_r = inc_subtensor(r[10:], 5)
:see: theano.tensor.basic.setsubtensor :see: theano.tensor.basic.setsubtensor
""" """
the_op = IncSubtensor(idx_list, inplace, False) # retrieve idx_list from x.owner
return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable))) if not isinstance(x.owner.op, Subtensor):
raise TypeError('x must be result of a subtensor operation')
the_op = IncSubtensor(x.owner.op.idx_list, inplace, set_instead_of_inc)
real_x = x.owner.inputs[0]
real_idxargs = x.owner.inputs[1:]
return the_op(real_x, y, *real_idxargs)
class IncSubtensor(Op): class IncSubtensor(Op):
"""Increment a subtensor. """Increment a subtensor.
...@@ -3125,6 +3080,8 @@ class Reshape(Op): ...@@ -3125,6 +3080,8 @@ class Reshape(Op):
raise ValueError('Cannot reshape input of shape %s to shape %s' % (x.shape,shp)) raise ValueError('Cannot reshape input of shape %s to shape %s' % (x.shape,shp))
def grad(self, (x, shp), (g_out,)): def grad(self, (x, shp), (g_out,)):
return [reshape(g_out, shape(x), ndim=x.ndim), None] return [reshape(g_out, shape(x), ndim=x.ndim), None]
def infer_shape(self, node, ishapes):
return [node.inputs[1]]
def reshape(x, newshape, ndim=None, name=None): def reshape(x, newshape, ndim=None, name=None):
if ndim is None: if ndim is None:
...@@ -3155,7 +3112,10 @@ class Flatten(Op): ...@@ -3155,7 +3112,10 @@ class Flatten(Op):
def perform(self, node, (x,), (out,)): def perform(self, node, (x,), (out,)):
outdim = self.outdim outdim = self.outdim
if outdim == 1: if outdim == 1:
try:
out[0] = x.reshape(x.size) out[0] = x.reshape(x.size)
except AttributeError:
out[0] = x.reshape((numpy.prod(x.shape),))
elif outdim == len(x.shape): elif outdim == len(x.shape):
out[0] = x out[0] = x
else: else:
...@@ -3865,8 +3825,34 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False): ...@@ -3865,8 +3825,34 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False):
class numeric_grad: class numeric_grad:
"""WRITEME""" """WRITEME"""
# Note on step sizes and tolerances:
#
# There is a relationship between the step size and the function value and the measurement
# error that is incurred due to rounding. The finite difference we measure is
# delta = f(x0) - f(x0+eps)
#
# For maximum precision, f should be close to zero.
# For every power of 2 that f departs from zero, we lose a bit of precision in delta.
#
# Even in this case of maximum accuracy, there is a tradeoff between stepsize and
# measurement error.
# Taking small steps allows us to measure large derivatives accuractly, but longer steps
# are required to measure small derivatives accurately. However longer steps introduce
# bias into our measurement in general for non-linear functions.
#
# It would be interesting to have a version of numeric grad that used an adaptive stepsize.
#
# For now, we use a heuristic that catches very bad gradients, but is not perfectly
# accurate.
type_eps = {'float64': 1e-7, type_eps = {'float64': 1e-7,
'float32': 3e-3} 'float32': 3e-3,
numpy.dtype('float64'):1e-7,
numpy.dtype('float32'):3e-3}
type_tol = {'float64': 1e-4,
numpy.dtype('float64'):1e-4,
'float32': 1e-1,
numpy.dtype('float32'):1e-1}
def __init__(self, f, pt, eps=None): def __init__(self, f, pt, eps=None):
"""Return the gradient of f at pt. """Return the gradient of f at pt.
...@@ -3942,12 +3928,29 @@ class numeric_grad: ...@@ -3942,12 +3928,29 @@ class numeric_grad:
self.gf = self.gf[0] self.gf = self.gf[0]
@staticmethod @staticmethod
def abs_rel_err(a,b,eps=1.0e-10): def abs_rel_err(a,b,tol=None):
"""Return a small number when a and b are close, relative to how big they are""" """Return a small number when a and b are close, relative to how big they are.
return abs(a-b) / (abs(a)+abs(b)+eps)
def max_err(self, g_pt): Formula used: a - b / (abs(a) + abs(b) + tol)
"""Return the biggest relative error between g_pt and self.gf"""
`tol` defaults to relatively generous value that is suitable for catching completely
wrong gradients. (`self.type_tol`
"""
if tol is None:
tol = __builtin__.max(numeric_grad.type_tol[a.dtype], numeric_grad.type_tol[b.dtype])
return abs(a-b) / (abs(a)+abs(b)+tol)
def abs_rel_errors(self, g_pt):
"""Return the abs rel error of gradient estimate `g_pt`
`g_pt` must be a list of ndarrays of the same length as self.gf, otherwise a ValueError
is raised.
Corresponding ndarrays in `g_pt` and `self.gf` must have the same shape or ValueError
is raised.
"""
if len(g_pt) != len(self.gf): if len(g_pt) != len(self.gf):
raise ValueError('argument has wrong number of elements', len(g_pt)) raise ValueError('argument has wrong number of elements', len(g_pt))
errs = [] errs = []
...@@ -3955,7 +3958,12 @@ class numeric_grad: ...@@ -3955,7 +3958,12 @@ class numeric_grad:
if a.shape != b.shape: if a.shape != b.shape:
raise ValueError('argument element %i has wrong shape %s' %(i,str((a.shape, raise ValueError('argument element %i has wrong shape %s' %(i,str((a.shape,
b.shape)))) b.shape))))
errs.append(numpy.max(numeric_grad.abs_rel_err(a,b))) errs.append(numeric_grad.abs_rel_err(a,b))
return errs
def max_err(self, g_pt):
"""Return the biggest relative error between g_pt and self.gf"""
errs = [numpy.max(e) for e in self.abs_rel_errors(g_pt)]
if numpy.all(numpy.isfinite(errs)): if numpy.all(numpy.isfinite(errs)):
return numpy.max(errs), numpy.argmax(errs) return numpy.max(errs), numpy.argmax(errs)
else: else:
...@@ -4058,7 +4066,19 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast ...@@ -4058,7 +4066,19 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast
max_err, max_err_pos = num_grad.max_err(analytic_grad) max_err, max_err_pos = num_grad.max_err(analytic_grad)
if max_err > tol: if max_err > tol:
raise Exception(verify_grad.E_grad, (max_err, tol, max_err_pos)) raise verify_grad.E_grad(tol, num_grad, analytic_grad)
class GradientError(Exception):
"""This error is raised when a gradient is calculated, but incorrect."""
def __init__(self, tol, num_grad, analytic_grad):
self.num_grad = num_grad
self.analytic_grad = analytic_grad
self.tol = tol
def __str__(self):
max_errs = [numpy.max(e) for e in self.num_grad.abs_rel_errors(self.analytic_grad)]
return "GradientError: numeric gradient and analytic gradient differ than %f (%s)" %(
self.tol, max_errs)
def abs_rel_errors(self):
return self.num_grad.abs_rel_errors(self.analytic_grad)
verify_grad.E_grad = 'gradient error exceeded tolerance' verify_grad.E_grad = GradientError
"""This error is raised when a gradient is calculated, but incorrect."""
...@@ -618,7 +618,7 @@ class Elemwise(Op): ...@@ -618,7 +618,7 @@ class Elemwise(Op):
errormsg = 'Failed calling ufunc for op', self.scalar_op,\ errormsg = 'Failed calling ufunc for op', self.scalar_op,\
'for params of shape', [arg.shape for arg in ufunc_args] 'for params of shape', [arg.shape for arg in ufunc_args]
e.args = e.args + errormsg e.args = e.args + errormsg
raise e raise
if ufunc.nout == 1: variables = [variables] if ufunc.nout == 1: variables = [variables]
for variable, storage in zip(variables, output_storage): for variable, storage in zip(variables, output_storage):
if storage[0].shape: if storage[0].shape:
......
...@@ -438,13 +438,13 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op): ...@@ -438,13 +438,13 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
y_idx = tensor.as_tensor_variable(y_idx) y_idx = tensor.as_tensor_variable(y_idx)
if x.type.ndim != 2 \ if x.type.ndim != 2 \
or x.type.dtype not in ['float32', 'float64']: or x.type.dtype not in ['float32', 'float64']:
raise ValueError('x must be 2-d tensor of floats') raise ValueError('x must be 2-d tensor of floats', x.type)
if b.type.ndim != 1 \ if b.type.ndim != 1 \
or x.type.dtype not in ['float32', 'float64']: or x.type.dtype not in ['float32', 'float64']:
raise ValueError('b must be 1-d tensor of floats') raise ValueError('b must be 1-d tensor of floats', b.type)
if y_idx.type.ndim != 1 \ if y_idx.type.ndim != 1 \
or y_idx.type.dtype not in ['int8', 'int16', 'int32', 'int64']: or y_idx.type.dtype not in ['int8', 'int16', 'int32', 'int64']:
raise ValueError('y_idx must be 1-d tensor of ints') raise ValueError('y_idx must be 1-d tensor of ints', y_idx.type)
# TODO: Is this correct? It used to be y, not y_idx # TODO: Is this correct? It used to be y, not y_idx
nll = tensor.TensorType(x.type.dtype, nll = tensor.TensorType(x.type.dtype,
......
...@@ -180,11 +180,12 @@ def local_dimshuffle_lift(node): ...@@ -180,11 +180,12 @@ def local_dimshuffle_lift(node):
new_order = [x == 'x' and 'x' or inode.op.new_order[x] for x in op.new_order] new_order = [x == 'x' and 'x' or inode.op.new_order[x] for x in op.new_order]
inplace = op.inplace and inode.op.inplace inplace = op.inplace and inode.op.inplace
iinput = inode.inputs[0] iinput = inode.inputs[0]
if new_order == range(len(new_order)): if new_order == range(len(new_order)) and (len(new_order) == iinput.type.ndim):
return [iinput] return [iinput]
else: else:
return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs
register_canonicalize(local_dimshuffle_lift) register_canonicalize(local_dimshuffle_lift)
register_specialize(local_dimshuffle_lift) register_specialize(local_dimshuffle_lift)
...@@ -519,8 +520,8 @@ def local_fill_to_alloc(node): ...@@ -519,8 +520,8 @@ def local_fill_to_alloc(node):
# TODO: cut out un-necessary dimshuffles of v # TODO: cut out un-necessary dimshuffles of v
rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])] rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
if rval[0].type != node.outputs[0].type: #if rval[0].type != node.outputs[0].type:
print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str') #print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type, assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type,
'orig', node.outputs[0].type, 'orig', node.outputs[0].type,
...@@ -1048,9 +1049,41 @@ def local_reshape_chain(node): ...@@ -1048,9 +1049,41 @@ def local_reshape_chain(node):
if not opt.check_chain(node, T.Reshape, T.Reshape): if not opt.check_chain(node, T.Reshape, T.Reshape):
return False return False
# TODO: this can permit a failing program to run by eliminating the the lower
# reshape
return [node.op(node.inputs[0].owner.inputs[0], node.inputs[1])] return [node.op(node.inputs[0].owner.inputs[0], node.inputs[1])]
register_canonicalize(local_reshape_chain) register_canonicalize(local_reshape_chain)
if 0:
# TODO: Test that this optimziation works.
@register_canonicalize
@gof.local_optimizer([])
def local_scalar_reshape(node):
"""Eliminate reshape Ops whose inputs and outputs are scalars """
if isinstance(node.op, T.Reshape):
x, shp = node.inputs
if x.ndim == 0 and T.get_vector_length(shp)==0:
return [x]
if 0:
# TODO: Finish writing and testing this optimization.
# The idea is that if we can prove the output to this sum
# has a zero-size dimension, then it can be replaced by an appropriately typed and
# broadcasted zero.
@register_canonicalize
@gof.local_optimizer([])
def local_sum_over_empty(node):
if isinstance(node.op, T.Sum):
y, = node.outputs
y_shape = node.env.shape_feature.shape_of[y]
def tmp(thing):
try:
return T.get_constant_value(thing)
except (TypeError, ValueError), e:
print e, thing.owner.inputs[0]
return None
print 'LOCAL SUM EMPTY', [tmp(s) for s in y_shape]
################## ##################
# Middleman cuts # # Middleman cuts #
...@@ -1063,11 +1096,18 @@ def local_fill_cut(node): ...@@ -1063,11 +1096,18 @@ def local_fill_cut(node):
If c.type == a.type. If c.type == a.type.
""" """
# this optimization is essentially for getting broadcasting to replace fill.
# This is always possible when using a Compound Elemwise operation,
# but it is not always possible without one (consider filling a large matrix with a scalar,
# and then adding another scalar. The only numbers that count are the two scalars, but we
# can't ignore the large matrix because it gives the shape of the result.
if not opt.check_chain(node, T.Elemwise): if not opt.check_chain(node, T.Elemwise):
return False return False
output = node.outputs[0] output = node.outputs[0]
try: try:
#reference is some input with the same type as the input but that is not produced by a fill
reference = [input reference = [input
for input in node.inputs for input in node.inputs
if input.type == output.type and (not input.owner or input.owner.op != T.fill)][0] if input.type == output.type and (not input.owner or input.owner.op != T.fill)][0]
...@@ -1086,7 +1126,12 @@ def local_fill_cut(node): ...@@ -1086,7 +1126,12 @@ def local_fill_cut(node):
if new_inputs == node.inputs: if new_inputs == node.inputs:
return False return False
return node.op.make_node(*new_inputs).outputs
rval = node.op(*new_inputs)
if isinstance(rval, gof.Variable):
return rval.owner.outputs
else:
return rval[0].owner.outputs
register_canonicalize(local_fill_cut) register_canonicalize(local_fill_cut)
......
import unittest import unittest
import theano import theano
from theano.tensor import * from theano.tensor import *
from theano.tests import unittest_tools as utt
class TestRealImag(unittest.TestCase): class TestRealImag(unittest.TestCase):
...@@ -15,9 +16,60 @@ class TestRealImag(unittest.TestCase): ...@@ -15,9 +16,60 @@ class TestRealImag(unittest.TestCase):
x= dvector() x= dvector()
rng = numpy.random.RandomState(23) rng = numpy.random.RandomState(23)
xval = rng.randn(10) xval = rng.randn(10)
assert numpy.all( 0 == theano.function([x], imag(x))(xval)) try:
assert numpy.all( xval == theano.function([x], real(x))(xval)) numpy.all( 0 == theano.function([x], imag(x))(xval))
assert 0
except TypeError:
pass
try:
numpy.all( xval == theano.function([x], real(x))(xval))
assert 0
except TypeError:
pass
def test_cast(self): def test_cast(self):
x= zvector() x= zvector()
self.failUnlessRaises(TypeError, cast, x, 'int32') self.failUnlessRaises(TypeError, cast, x, 'int32')
def test_complex(self):
rng = numpy.random.RandomState(2333)
m = fmatrix()
c = complex(m[0], m[1])
assert c.type == cvector
r,i = [real(c), imag(c)]
assert r.type == fvector
assert i.type == fvector
f = theano.function([m], [r,i] )
mval = numpy.asarray(rng.randn(2,5), dtype='float32')
rval, ival = f(mval)
assert numpy.all(rval == mval[0]), (rval,mval[0])
assert numpy.all(ival == mval[1]), (ival, mval[1])
def test_complex_grads(self):
def f(m):
c = complex(m[0], m[1])
return .5 * real(c) + .9 * imag(c)
rng = numpy.random.RandomState(9333)
mval = numpy.asarray(rng.randn(2,5))
utt.verify_grad(f, [mval])
def test_polar_grads(self):
def f(m):
c = complex_from_polar(abs(m[0]), m[1])
return .5 * real(c) + .9 * imag(c)
rng = numpy.random.RandomState(9333)
mval = numpy.asarray(rng.randn(2,5))
utt.verify_grad(f, [mval])
def test_abs_grad(self):
def f(m):
c = complex(m[0], m[1])
return .5 * abs(c)
rng = numpy.random.RandomState(9333)
mval = numpy.asarray(rng.randn(2,5))
utt.verify_grad(f, [mval])
...@@ -63,3 +63,15 @@ def verify_grad(op, pt, n_tests=2, rng=None, *args, **kwargs): ...@@ -63,3 +63,15 @@ def verify_grad(op, pt, n_tests=2, rng=None, *args, **kwargs):
rng = numpy.random rng = numpy.random
T.verify_grad(op, pt, n_tests, rng, *args, **kwargs) T.verify_grad(op, pt, n_tests, rng, *args, **kwargs)
#
# This supports the following syntax:
#
# try:
# verify_grad(...)
# except verify_grad.E_grad, e:
# print e.num_grad.gf
# print e.analytic_grad
# ...
#
#
verify_grad.E_grad = T.verify_grad.E_grad
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论