merge

755188ae · James Bergstra · c5f98527 · 02c3d8c3 · 755188ae · 755188ae
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -150,4 +150,11 @@ Config Attributes
    A directory with bin/, lib/, include/ folders containing cuda utilities.
+.. attribute:: gcc.cxxflags
+    Default: ""
+    Extra parameters to pass to gcc when compiling.  Extra include paths,
+    library paths, configuration options, etc.
--- a/doc/proposals/advidx.txt
+++ b/doc/proposals/advidx.txt
+==================
+Advanced Indexing
+==================
+Continue the Advanced Indexing project that is on either github or bitbucket.
--- a/doc/proposals/conditional.txt
+++ b/doc/proposals/conditional.txt
+=============================================================================
+Proposal for New Linking Strategy supporting Lazy Evaluation: Op.make_thunk
+=============================================================================
+.. note::
+   Proposal made June 2010.
+Motivation
+===========
+Conditional evaluation is useful to describe many optimization algorithms where
+the update expressions depend on internal state.
+True conditional evaluation requires lazy graph evaluation.
+Without lazy graph evaluation, the runtime of a graph can be exponential in the
+number of conditionals instead of linear.  No one waits an exponential amount of
+time, so instead people work around this problem in various other ways, but it
+would be better if theano had an 'if-then-else' expression (call it cond).
+A lazily-evaluted 'cond' requires a linker to use a different method for
+interacting with Ops.  Neither the current perform() nor c_code() approaches
+support lazy evaluation.
+Why do perform (and c_code) not handle lazy evaluation?
+The syntax of the current perform() could be extended to be compatible with lazy
+evaluation. For example, the  linker could set all inputs to None, and use the
+return value from perform() to see which inputs are required.  But all the Ops
+that currently implement a perform() function would be broken because their
+perform implementations do not ask for inputs before using them.  I don't see a
+way around this.  The same restriction applies to c_code.
+The way around this is to introduce a new interface for the linker to talk to
+Ops.  I propose that we add an Op.make_thunk() that returns an object satisfying
+this interface.
+At the same time, it appears that as we try to integrate PyCUDA Ops another
+problem arises.  We would like to use Op.perform() to drive the GPU, but it is
+natural to move compilation of the CUDA kernel to a point after make_node() and a
+point before perform().   The point where the linker makes an thunk from the Op
+seems like a natural choice.
+A third motivation for introducing an Op.make_thunk function is to clarify the
+relationship between Ops (the classes you implement in Python) and mathematical
+operations (the more abstract things in terms of which you think when using
+Theano).
+I propose that *technically* an Op, when conditioned by particular inputs,
+generates *at most one implementation* that defines the behaviour of that Op.
+In *intuitive terms*, the abstract mathematical steps that we sometimes talk about regarding Theano
+still correspond to Ops -- it's just that these Ops have relatively generic
+implementations.
+The process of optimization is to specialize those generic implementations
+by using information from the rest of the graph.
+If we accept that an Op corresponds to at most one implementation,
+then it makes sense to ask an Op instance to expose that implementation via a
+standard interface (Op.make_thunk).
+It does not make sense to pass arguments to Op.make_thunk such as 'py' or "c|py"
+to tell the Op which implementation to use.  The Op instance represents just one
+implementation, and flags such as 'py' or 'c|py' should be passed to the Op's
+constructor.
+Proposal: Op.make_thunk
+==========================
+There are two interface items I propose to add.  The first is a Thunk object
+(which we have never had before), and the second is a new function (make_thunk)
+in the PureOp class (a superclass of Op) that will return a Thunk.
+..code-block:: python
+  class Thunk (object):
+    """Abstract class / interface
+    It describes the interface used by a Theano linker to execute the nodes in a
+    graph.  Thunk instances are in correspondance with Apply instances that
+    remain in the final form of the graph after optimization.
+    """
+    lazy = property(...,
+       """True means the thunk may trigger lazy evaluation.
+        False means the thunk always requires all inputs and computes all
+        outputs.
+        Consequently False implies that __call__ always returns None
+        """
+    def __call__(self):
+       """Thunk will compute some number (or zero) of outputs and in the case
+       that it cannot compute all its outputs for lack of inputs, this function
+       will return a list of input indexes that are required.  The linker will
+       typically compute those required inputs and then call this
+       __call__ function again.
+       The thunk is considered to be finished when it returns an empty list or
+       None.
+       """
+..code-block:: python
+    class PureOp(object):                       # recall:
+                                                # Op inherits from PureOp
+        def make_node(self, *inputs):           # leave alone 
+           ...
+        def perform(self, node, 
+              inputs, output_storage):          # move to `Op` class 
+           ...
+        def make_thunk(self, node,              # new function
+              input_computed, output_computed,
+              input_registers, output_registers,
+              ): 
+           """
+           :type node: Apply instance
+           :param node: previous rval from make_node(self, *inputs)
+           :type input_computed: list of len-1 lists, with values in (0,1).
+           :param input_computed: at runtime, input_computed[i][0]==1 implies
+               that the i'th input has been computed and stored at
+               input_registers[i][0], and is available for use.
+               Otherwise the content of input_registers[i][0] is undefined.
+           :type output_computed: list of len-1 lists, with values in (0,1).
+           :param output_computed: at runtime, output_computed[i][0]==1 implies
+               that the i'th output has already been computed and stored at
+               output_registers[i][0].
+               Otherwise, output_registers[i][0] will contain either None, or
+               a value that was previously computed by this thunk.
+           :type input_registers: list of len-1 lists
+           :type output_registers: list of len-1 lists
+           :param input_registers: the i'th input can be read from
+           input_registers[i][0] when input_computed[i][0] == 1.
+           :param output_registers: the i'th output must be stored to 
+           output_registers[i][0], at which point the thunk must set output_computed[i][0] == 1.
+           :returns: a Thunk (subclass) instance
+           """
+The Thunk class can have subclasses that use Op.perform and Op.c_code as we use
+them now.  The interface of Thunk is backward-compatible with the thunks built
+by the CLinker and PerformLinker.  If a graph contains zero Thunks with
+lazy==True, then the current Linkers will continue to work.
+The new Thunk interface will support a new LazyLinker that can run programs for
+which some thunks have lazy==True.
+The Thunk class can have subclasses that are implemented in C, which might help
+performance.
--- a/doc/proposals/dp_optimization.txt
+++ b/doc/proposals/dp_optimization.txt
+========================
+DP Instruction Selection
+========================
+Read Ch 9 of Modern Compiler Implementation about instruction selection.
+We should probably be doing graph optimization totally differently:  
+Optimizations *only add* new ways of implementing something, they do not replace
+the old way.  Every graph node (apply) as a cost, and Dynamic Programming (DP)
+is used to select the minimum cost graph.
+The advantage of this approach is that optimizations do not have to run in such
+a careful order, and graph selection would be much faster.
+Think about how aliasing and destructive operations (the destroy-handler) would
+fit in this approach.
--- a/doc/proposals/intermediate_language.txt
+++ b/doc/proposals/intermediate_language.txt
+=====================
+Intermediate Language
+=====================
+It would be nice to be able to use Theano from other languages.
+This requires two things: a way to communicate the expression to the theano
+compiler, and a way to pass data to and from the compiled function.
+One way to do this would be define a textual representation of theano graphs.
+A Scheme-like language seems appropriate.  Perhaps just scheme would be
+appropriate.
+How to pass shared variables?
--- a/doc/proposals/mongodb_cache.txt
+++ b/doc/proposals/mongodb_cache.txt
+=================
+MongoDB DLL Cache
+=================
+In network environments (like at DIRO on NFS3), a distributed DB like mongo or couch is faster and more
+robust to concurrency than the $HOME/.theano.  Also, a single cache could be
+shared by multiple users.  This would result in less compilation time, for
+everyone, and less stale-cache problems.
--- a/doc/proposals/noupdates.txt
+++ b/doc/proposals/noupdates.txt
@@ -4,6 +4,7 @@ Automatic updates
 .. note:
   Proposed 2010 01 13
+   Done 2010 04 ??
 The Module version of RandomStreams could arrange for the automatic update of

--- a/doc/proposals/opencl.txt
+++ b/doc/proposals/opencl.txt
+=======
+OpenCL
+=======
+Migrate the GPU code-generators to the PyCUDA style, and eventually to OpenCL.
+This means mainly to use a different kind of code-generation strategy.  The
+kernel itself is compiled, but the calling code remains in python or cython.  We
+would no longer generate entire C files this way, and no longer use the CLinker
+for GPU code.
--- a/doc/proposals/premerge.txt
+++ b/doc/proposals/premerge.txt
+Proactive Merging
+=================
+Merge is done now as an optimization. 
+But if Merging was done at graph construction time, things like #476 would work.
+Additionally, memo-izing at graph construction time would make it possible to
+define recursive formula with recursive python functions (e.g. Fibonacci).
+Currently the merge optimization would make the Fibonacci series linear, but the
+size of the program used to express the program would be exponential.
--- a/doc/proposals/tensor_attributes.txt
+++ b/doc/proposals/tensor_attributes.txt
+========================
+Add tensor attributes
+========================
+Size, shape, psd, symmetric, triangular, contiguous.  
+Add these attributes to the TensorType with the option always that they be
+'unknown'.
+Add attributes that are useful for optimizations, or useful for code generation.
--- a/theano/compile/debugmode.py
+++ b/theano/compile/debugmode.py
@@ -104,6 +104,47 @@ class BadCLinkerOutput(DebugModeError):
        print >> sio, "  val_py  :", self.val_py
        print >> sio, "  val_c   :", self.val_c
        print >> sio, "  op      :", self.offending_op()
+        try:
+            ssio = StringIO()
+            print >> ssio, "  PyValue shape, dtype, strides, min, max:", 
+            print >> ssio, self.val_py.shape,
+            print >> ssio, self.val_py.dtype,
+            print >> ssio, self.val_py.strides
+            print >> ssio, self.val_py.min(),
+            print >> ssio, self.val_py.max(),
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()
+        except:
+            pass
+        try:
+            ssio = StringIO()
+            print >> ssio, "  CValue shape, dtype, strides, min, max:", 
+            print >> ssio, self.val_c.shape,
+            print >> ssio, self.val_c.dtype,
+            print >> ssio, self.val_c.strides,
+            print >> ssio, self.val_c.min(),
+            print >> ssio, self.val_c.max(),
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()
+        except:
+            pass
+        try:
+            ov=numpy.asarray(self.val_c)
+            nv=numpy.asarray(self.val_py)
+            ssio = StringIO()
+            print >> ssio, "  Max Abs Diff: ", numpy.max(numpy.absolute(nv-ov))
+            print >> ssio, "  Mean Abs Diff: ", numpy.mean(numpy.absolute(nv-ov))
+            print >> ssio, "  Median Abs Diff: ", numpy.median(numpy.absolute(nv-ov))
+            print >> ssio, "  Std Abs Diff: ", numpy.std(numpy.absolute(nv-ov))
+            reldiff = numpy.absolute(nv-ov) / (numpy.absolute(nv)+numpy.absolute(ov))
+            print >> ssio, "  Max Rel Diff: ", numpy.max(reldiff)
+            print >> ssio, "  Mean Rel Diff: ", numpy.mean(reldiff)
+            print >> ssio, "  Median Rel Diff: ", numpy.median(reldiff)
+            print >> ssio, "  Std Rel Diff: ", numpy.std(reldiff)
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()                                    
+        except:
+            pass
        return sio.getvalue()
 class BadOptimization(DebugModeError):

--- a/theano/gof/cc.py
+++ b/theano/gof/cc.py
@@ -858,6 +858,7 @@ class CLinker(link.Linker):
        if config.gcc.cxxflags:
            sig.append(config.gcc.cxxflags)
+        error_on_play = [False]
        def in_sig(i, topological_pos, i_idx):
            # assert that every input to every node is one of'
            # - an env input
@@ -868,7 +869,14 @@ class CLinker(link.Linker):
            # yield a 'position' that reflects its role in code_gen()
            if isinstance(i, graph.Constant): #orphans
                if id(i) not in constant_ids:
-                    constant_ids[id(i)] = (i.signature(), topological_pos, i_idx)
+                    isig = (i.signature(), topological_pos, i_idx)
+                    try:
+                        hash(isig)
+                    except: #generic constants don't have a hashable signature
+                        error_on_play[0] = True
+                        return None
+                    constant_ids[id(i)] = isig
+                else:
                    isig = constant_ids[id(i)]
                #print 'SIGNATURE', i.signature()
                #return i.signature()
@@ -903,6 +911,12 @@ class CLinker(link.Linker):
                    for ipos,i in enumerate(node.inputs)),
                tuple(o in no_recycling for o in node.outputs)))
+            if error_on_play[0]:
+                # if one of the signatures is not hashable
+                # then bypass the cache mechanism and 
+                # compile fresh every time
+                return None
            op_pos[node] = node_pos
            env_computed_set.update(node.outputs)

--- a/theano/gof/compilelock.py
+++ b/theano/gof/compilelock.py
@@ -238,6 +238,7 @@ def refresh_lock(lock_file):
    lock_write = open(lock_file, 'w')
    lock_write.write(unique_id + '\n')
    lock_write.close()
+    print lock_file
    return unique_id
 class Unlocker(object):

--- a/theano/gof/destroyhandler.py
+++ b/theano/gof/destroyhandler.py
@@ -7,6 +7,7 @@ if sys.version_info[:2] >= (2,5):
 import toolbox
 import graph
+from theano.gof import deque
 from env import InconsistencyError
@@ -44,6 +45,92 @@ class DestroyHandler(object):
        return self.map[env].orderings(env)
+def _dfs_toposort(i, r_out, orderings):
+    """
+    i - list of inputs
+    o - list of outputs
+    orderings - dict of additions to the normal inputs and outputs
+    Returns nothing.  Raises exception for graph with cycles
+    """
+    #this is hard-coded reimplementation of functions  from graph.py
+    # reason: go faster, prepare for port to C.
+    assert isinstance(r_out, (tuple, list, deque))
+    # TODO: For more speed - use a defaultdict for the orderings
+    iset = set(i)
+    if 0:
+        def expand(obj): 
+            rval = []
+            if obj not in iset:
+                if isinstance(obj, graph.Variable): 
+                    if obj.owner:
+                        rval = [obj.owner]
+                if isinstance(obj, graph.Apply):
+                    rval = list(obj.inputs)
+                rval.extend(orderings.get(obj, []))
+            else:
+                assert not orderings.get(obj, [])
+            return rval
+    expand_cache = {}
+    # reachable, clients = stack_search( deque(r_out), deps, 'dfs', True)
+    start=deque(r_out)
+    rval_set = set()
+    rval_set.add(id(None))
+    rval_list = list()
+    expand_inv = {}
+    sources = deque()
+    while start:
+        l = start.pop()# this makes the search dfs
+        if id(l) not in rval_set:
+            rval_list.append(l)
+            rval_set.add(id(l))
+            if l in iset:
+                assert not orderings.get(l, [])
+                expand_l = []
+            else:
+                try:
+                    if l.owner:
+                        expand_l = [l.owner]
+                    else:
+                        expand_l = []
+                except AttributeError:
+                    expand_l = list(l.inputs)
+                expand_l.extend(orderings.get(l, []))
+            if expand_l:
+                for r in expand_l:
+                    expand_inv.setdefault(r, []).append(l)
+                start.extend(expand_l)
+            else:
+                sources.append(l)
+            expand_cache[l] = expand_l
+    assert len(rval_list) == len(rval_set)-1
+    rset = set()
+    rlist = []
+    while sources:
+        node = sources.popleft()
+        if node not in rset:
+            rlist.append(node)
+            rset.add(node)
+            for client in expand_inv.get(node, []):
+                expand_cache[client] = [a for a in expand_cache[client] if a is not node]
+                if not expand_cache[client]:
+                    sources.append(client)
+    if len(rlist) != len(rval_list):
+        raise ValueError('graph contains cycles')
+    #return [o for o in rlist if isinstance(o, graph.Apply)]
 def getroot(r, view_i):
    """
    For views: Return non-view variable which is ultimatly viewed by r.
@@ -303,7 +390,8 @@ class DestroyHandlerHelper2(toolbox.Bookkeeper):
                raise
            #print 'orderings:', ords
            try:
-                graph.io_toposort(env.inputs, env.outputs, ords)
+                ### graph.io_toposort(env.inputs, env.outputs, ords)
+                _dfs_toposort(env.inputs, env.outputs, ords)
            except ValueError, e:
                #print 'not passing.', ords
                if 'cycles' in str(e):

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -9,9 +9,10 @@ compatible with `gof`'s :doc:`graph` routines.
 __docformat__ = "restructuredtext en"
 import utils
-import traceback
 from theano import config
 class CLinkerObject(object):
    """Standard elements of an Op or Type used with the CLinker
    """
@@ -320,9 +321,7 @@ class PureOp(object):
        """
        node = self.make_node(*inputs, **kwargs)
-        limit = config.traceback.limit
+        self.add_tag_trace(node)
-        if limit == -1: limit = None
-        node.tag.trace = traceback.extract_stack(limit=limit)[:-1]
        if self.default_output is not None:
            return node.outputs[self.default_output]
        else:
@@ -331,6 +330,9 @@ class PureOp(object):
            else:
                return node.outputs
+    # Convenience so that subclass implementers don't have to import utils
+    # just to self.add_tag_trace
+    add_tag_trace = staticmethod(utils.add_tag_trace)
    #########################
    # Python implementation #

--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
@@ -127,6 +127,8 @@ class SeqOptimizer(Optimizer, list):
        for optimizer in self:
            try:
                optimizer.optimize(env)
+            except AssertionError: # do not catch Assertion failures
+                raise
            except Exception, e:
                if self.failure_callback:
                    self.failure_callback(e, self, optimizer)

--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -5,8 +5,7 @@ __docformat__ = "restructuredtext en"
 import copy
 import utils
 from utils import MethodNotDefined, object2
-from graph import Variable
+import graph
-import traceback
 from theano import config
 ########
@@ -202,6 +201,9 @@ class PureType(object):
    """
+    Variable = graph.Variable #the type that will be created by call to make_variable.
+    Constant = graph.Constant #the type that will be created by call to make_constant
    def filter(self, data, strict = False):
        """Required: Return data or an appropriately wrapped/converted data.
@@ -233,8 +235,11 @@ class PureType(object):
            A pretty string for printing and debugging.
        """
-        r = Variable(self, name = name)
+        return self.Variable(self, name = name)
-        return r
+    def make_constant(self, value, name=None):
+        return self.Constant(type=self, data=value, name=name)
    def __call__(self, name = None):
        """Return a new `Variable` instance of Type `self`.
@@ -244,11 +249,7 @@ class PureType(object):
            A pretty string for printing and debugging.
        """
-        r = self.make_variable(name)
+        return utils.add_tag_trace(self.make_variable(name))
-        limit = config.traceback.limit
-        if limit == -1: limit = None
-        r.tag.trace = traceback.extract_stack(limit=limit)[:-1]
-        return r
    def values_eq(self, a, b):
        """
@@ -319,9 +320,11 @@ class Type(object2, PureType, CLinkerType):
    """
-## DELETEME ##
 class SingletonType(Type):
-    """WRITEME"""
+    """Convenient Base class for a Type subclass with no attributes
+    It saves having to implement __eq__ and __hash__
+    """
    __instance = None
    def __new__(cls):
        if cls.__instance is None:
@@ -378,14 +381,5 @@ class Generic(SingletonType):
        Py_INCREF(py_%(name)s);
        """ % locals()
 generic = Generic()
-## DELETEME ##
-class PropertiedType(Type):
-    """WRITEME"""
-    def __eq__(self, other):
-        return type(self) == type(other) and self.__dict__ == other.__dict__
--- a/theano/gof/utils.py
+++ b/theano/gof/utils.py
@@ -2,7 +2,18 @@
 # import op
 # import variable
-import re, os
+from theano import config
+import re, os, traceback
+def add_tag_trace(thing):
+    """Add tag.trace to an node or variable.
+    The argument is returned after being affected (inplace).
+    """
+    limit = config.traceback.limit
+    if limit == -1: limit = None
+    thing.tag.trace = traceback.extract_stack(limit=limit)[:-1]
+    return thing
 def hashgen():
    hashgen.next += 1

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -53,6 +53,10 @@ def debugprint(obj, depth=-1, print_type=False, file=None):
        results_to_print.extend(obj.outputs)
    elif isinstance(obj, Function):
        results_to_print.extend(obj.maker.env.outputs)
+    elif isinstance(obj, (list, tuple)):
+        results_to_print.extend(obj)
+    else:
+        raise TypeError("debugprint cannot print an object of this type", obj)
    for r in results_to_print:
        debugmode.debugprint(r, depth=depth, done=done, print_type=print_type, file=_file)
    if file is _file:

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
--- a/theano/sandbox/cuda/blas.py
+++ b/theano/sandbox/cuda/blas.py
@@ -149,17 +149,15 @@ class GpuGemm(Op):
    """
    implement the gemm on the gpu.
-    ..note: This probably don't work correctly for no_inplace gemm.
-            Need to check al least refcount.
    """
    def __init__(self, inplace):
        self.__setstate__({'inplace':inplace})
    def __str__(self):
-        if self.inplace: inplace_str = 'inplace'
+        if self.inplace:
-        else: inplace_str = 'no_inplace'
+            return 'GpuGemm{inplace}'
-        return '%s{%s}' % (self.__class__.__name__, inplace_str)
+        else:
+            return 'GpuGemm{no_inplace}'
    def __eq__(self, other):
        return (type(self) == type(other)\
@@ -183,7 +181,7 @@ class GpuGemm(Op):
        return Apply(self, [z, a, x, y, b], [z.type()])
    def c_code_cache_version(self):
-        return (2,)
+        return (3,)
    def c_code(self, node, name, inputs, outputs, sub):
        #z_out = alpha * dot(x,y) + beta * z_in
@@ -192,8 +190,9 @@ class GpuGemm(Op):
        z_in, a, x, y, b = inputs
        z_out, = outputs
        fail = sub['fail']
-        if self.inplace:
+        sio = StringIO.StringIO()
-            return """
+        print >> sio, """
        #define REAL float
        float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) 
@@ -205,51 +204,48 @@ class GpuGemm(Op):
        : (REAL)(((double*)%(b)s->data)[0]);
        #undef REAL
-        if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_in)s))
+        """
-        {
+        if self.inplace:
-            %(fail)s;
+            print >> sio, """
-        }
            Py_XDECREF(%(z_out)s);
            %(z_out)s = %(z_in)s;
            Py_INCREF(%(z_out)s);
-        """ % locals()
+            """
        else:
-            return """
+            print >> sio, """
-        #define REAL float
+            if (!%(z_out)s
-        float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) 
+                || (%(z_out)s->nd != 2)
-        ? (REAL)(((float*)%(a)s->data)[0])
-        : (REAL)(((double*)%(a)s->data)[0]);
-        float %(name)s_b = (%(b)s->descr->type_num == PyArray_FLOAT) ?
-        (REAL)(((float*)%(b)s->data)[0])
-        : (REAL)(((double*)%(b)s->data)[0]);
-        #undef REAL
-        if ((NULL == %(z_out)s)
                || (CudaNdarray_HOST_DIMS(%(z_out)s)[0] != CudaNdarray_HOST_DIMS(%(z_in)s)[0])
-            || (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1]))
+                || (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1])
+                )
            {
                Py_XDECREF(%(z_out)s);
                %(z_out)s = (CudaNdarray*)CudaNdarray_Copy(%(z_in)s);
+                if (!%(z_out)s)
-            if(!%(z_out)s) {
+                {
-                PyErr_SetString(PyExc_MemoryError, "failed to alloc GpuGemm{no_inplace} output");
+                    %(fail)s;
-                %(fail)s
+                }
            }
-        }else{
+            else
-            if(CudaNdarray_CopyFromCudaNdarray(%(z_out)s,%(z_in)s)){
+            {
-                PyErr_SetString(PyExc_MemoryError, "failed to copy input in GpuGemm{no_inplace}");
+                if (CudaNdarray_CopyFromCudaNdarray(%(z_out)s, %(z_in)s))
-                %(fail)s
+                {
+                    %(fail)s;
                }
            }
+            """
+        print >> sio, """
        if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_out)s))
        {
            %(fail)s;
        }
-        """ % locals()
+        """
-gpu_gemm_inplace = GpuGemm(True)
-gpu_gemm_no_inplace = GpuGemm(False)
+        return sio.getvalue() % locals()
+gpu_gemm_no_inplace = GpuGemm(inplace=False)
+gpu_gemm_inplace = GpuGemm(inplace=True)
 ##
 # Not really a BLAS operation, but whatever.

--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -558,6 +558,10 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
        free(rval_dims);
        return NULL;
    }
+    if (rval_size==0)
+    {
+        return CudaNdarray_NewDims(rval_nd, rval_dims);
+    }
    if(CudaNdarray_is_c_contiguous(self))
    {
@@ -1019,6 +1023,44 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other)
                Py_INCREF(py_self);
                return py_self;
            }
+        case 5:
+            {
+                dim3 n_blocks(
+                        std::min(CudaNdarray_HOST_DIMS(self)[1], NUM_VECTOR_OP_BLOCKS),
+                        CudaNdarray_HOST_DIMS(self)[2]
+                        );
+                while (n_blocks.x * n_blocks.y > NUM_VECTOR_OP_BLOCKS) n_blocks.y /= 2;
+                dim3 n_threads(
+                        std::min(CudaNdarray_HOST_DIMS(self)[3], NUM_VECTOR_OP_THREADS_PER_BLOCK)
+                    );
+                for (int i = 0; i < CudaNdarray_HOST_DIMS(self)[0]; ++i)
+                {
+                    k_iAdd_4<<<n_blocks, n_threads>>>(
+                            CudaNdarray_HOST_DIMS(self)[1],
+                            CudaNdarray_HOST_DIMS(self)[2],
+                            CudaNdarray_HOST_DIMS(self)[3],
+                            CudaNdarray_HOST_DIMS(self)[4],
+                            CudaNdarray_DEV_DATA(self) + i * CudaNdarray_HOST_STRIDES(self)[0],
+                            CudaNdarray_HOST_STRIDES(self)[1],
+                            CudaNdarray_HOST_STRIDES(self)[2],
+                            CudaNdarray_HOST_STRIDES(self)[3],
+                            CudaNdarray_HOST_STRIDES(self)[4],
+                            CudaNdarray_DEV_DATA(other) + i * CudaNdarray_HOST_STRIDES(other)[0],
+                            CudaNdarray_HOST_STRIDES(other)[1],
+                            CudaNdarray_HOST_STRIDES(other)[2],
+                            CudaNdarray_HOST_STRIDES(other)[3],
+                            CudaNdarray_HOST_STRIDES(other)[4]);
+                    CNDA_THREAD_SYNC;
+                    cudaError_t err = cudaGetLastError();
+                    if( cudaSuccess != err) 
+                    {
+                        PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iAdd", cudaGetErrorString(err));
+                        return NULL;
+                    }
+                }
+                Py_INCREF(py_self);
+                return py_self;
+            }
    }
    PyErr_Format(PyExc_NotImplementedError, "inplace_add w nd=%i\n", self->nd);
@@ -1370,7 +1412,7 @@ CudaNdarray_setitem(PyObject *o, PyObject  *key, PyObject  *v)
    {
        PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_setitem: syncing structure to device failed");
        Py_DECREF(rval);
-        return NULL;
+        return -1;
    }
    CudaNdarray *viewCopyForComparison = 

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -2,12 +2,13 @@ import sys
 import theano
 import numpy
 from theano import scalar as scal
-from theano import tensor, compile
+from theano import tensor, compile, gof
 from theano.gof import local_optimizer, EquilibriumDB, SequenceDB, Optimizer, toolbox, DestroyHandler
 from theano.sandbox.cuda.basic_ops import *
 from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda.blas import gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace, gpu_gemm_no_inplace, GpuConv
+from theano.sandbox.cuda.blas import (gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace,
+        gpu_gemm_no_inplace, GpuConv)
 from theano.sandbox.cuda.blas import GpuDownsampleFactorMax, GpuDownsampleFactorMaxGrad
 from theano.sandbox.cuda.nnet import (
        GpuCrossentropySoftmaxArgmax1HotWithBias,
@@ -187,34 +188,25 @@ def local_gpu_dot22scalar(node):
 @local_optimizer([])
 def local_gpu_gemm(node):
    """
-    work for inplace and not inplace gemm
    gpu_from_host(gemm) -> gpu_gemm(gpu_from_host)
    gemm(host_from_gpu) -> host_from_gpu(gpu_gemm)
    """
+    gemms = {tensor.blas.gemm_inplace: gpu_gemm_inplace,
+            tensor.blas.gemm_no_inplace: gpu_gemm_no_inplace}
    if node.op == gpu_from_host:
        host_input = node.inputs[0]
-        if host_input.owner and host_input.owner.op == tensor.blas.gemm_inplace:
+        if host_input.owner and host_input.owner.op in gemms:
-            z, a, x, y, b = host_input.owner.inputs
+            op = host_input.owner.op
-            return [gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
-        elif host_input.owner and host_input.owner.op == tensor.blas.gemm_no_inplace:
            z, a, x, y, b = host_input.owner.inputs
-            return [gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
+            return [gemms[op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
-    elif node.op == tensor.blas.gemm_inplace:
+    if node.op in gemms:
-        z, a, x, y, b = node.inputs
-        x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
-        y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
-        z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
-        if x_on_gpu or y_on_gpu or z_on_gpu:
-            return [host_from_gpu(gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
-    elif node.op == tensor.blas.gemm_no_inplace:
        z, a, x, y, b = node.inputs
        x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
        y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
        z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
        if x_on_gpu or y_on_gpu or z_on_gpu:
-            return [host_from_gpu(gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
+            return [host_from_gpu(gemms[node.op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
    return False
 @register_opt()
@@ -234,8 +226,44 @@ def local_gpu_sum(node):
                gsum=GpuSum(reduce_mask)
                pattern=(''.join(str(i) for i in reduce_mask))
                if hasattr(gsum, 'c_code_reduce_%s'%pattern):
-                    return [host_from_gpu(gsum(gpu_from_host(x)))]
+                    rval = host_from_gpu(gsum(gpu_from_host(x)))
+                    if rval.type == node.outputs[0].type:
+                        return [rval]
+                    else:
+                        print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
+                        return None
                else:
+                    # Try to make a simpler pattern based on reshaping
+                    # The principle is that if two adjacent dimensions have the same value in
+                    # the reduce_mask, then we can reshape to make them a single dimension, do
+                    # the sum, and then reshape to get them back.
+                    shape_of = node.env.shape_feature.shape_of
+                    x_shape = shape_of[x]
+                    new_in_shp = [x_shape[0]]
+                    new_mask = [reduce_mask[0]]
+                    for i in range(1, x.type.ndim):
+                        if reduce_mask[i] == reduce_mask[i-1]:
+                            new_in_shp[-1] *= x_shape[i]
+                        else:
+                            new_mask.append(reduce_mask[i])
+                            new_in_shp.append(x_shape[i])
+                    pattern=(''.join(str(i) for i in new_mask))
+                    new_gsum = GpuSum(new_mask)
+                    if hasattr(new_gsum, 'c_code_reduce_%s'%pattern):
+                        reshaped_x = x.reshape(tensor.stack(*new_in_shp))
+                        sum_reshaped_x = host_from_gpu(new_gsum(gpu_from_host(reshaped_x)))
+                        unreshaped_sum = sum_reshaped_x.reshape(tensor.stack(*shape_of[node.outputs[0]]))
+                        if unreshaped_sum.type == node.outputs[0].type:
+                            return [unreshaped_sum]
+                        else:
+                            print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
+                            return None
                        raise Exception("GpuSum don't have implemented the pattern",pattern)
    return False

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -34,11 +34,14 @@ def test_sum():
    test sum pattern 1, 11, 10, 01, 100, 110, 011, 001, 111, 0011, 0111, 1011, 1111
    TODO: test with broadcast
    """
-    for shape, pattern in [((0,),[0]),((5,),[0]),
+    for shape, pattern in [((100,3,1300),[1]),
+                           ((0,),[0]),((5,),[0]),
                           ((0,0),[0,1]),((1,0),[0,1]),((5,4),[0,1]),((33,31),[0,1]),((5,4),[1]),((5,4),[0]),#need something bigger then 32 for some opt test.
-                           ((5,4,3),[0]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]),
+                           ((5,4,3),[0]),((5,4,3),[1]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]),
                           ((0,0,0,0),[0,1,2,3]),
                           ((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
+                           ((5,4,3,10,11),[1,2]),
+                           ((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
                           #test shape bigger then 4096 on each dimension to make sure that we work correctly when we don't have enought thread/block in each dimensions
                           ((4100,3),[0]),((3,4101),[0]),#10
@@ -47,7 +50,7 @@ def test_sum():
                           ((4100,3),[0,1]),((3,4101),[0,1]),#11
                           ((4100,4,3),[0]),((5,4100,3),[0]),((5,4,4100),[0]),#100
-                           #((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented
+                           ((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented
                           ((4100,4,3),[2]),((5,4100,3),[2]),((5,4,4100),[2]),#001
                           ((4100,4,3),[0,1]),((5,4100,3),[0,1]),((5,4,4100),[0,1]),#110
                           ((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011
@@ -70,9 +73,9 @@ def test_sum():
        assert tcn.GpuSum in [x.op.__class__ for x in f.maker.env.toposort()]
        assert T.Sum in [x.op.__class__ for x in f2.maker.env.toposort()]
        if val.size==0:
-            assert f2(val)==f(val)
+            assert f2(val)==f(val), ('shape', shape, 'pattern', pattern)
        else:
-            assert numpy.allclose(f2(val),f(val))
+            assert numpy.allclose(f2(val),f(val)), ('shape', shape, 'pattern', pattern)
        #test with dimshuffle

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -191,25 +191,31 @@ class Scalar(Type):
                typedef theano_complex%(nbits)s complex_type;
                typedef npy_float%(half_nbits)s scalar_type;
-                complex_type operator +(complex_type y) {
+                complex_type operator +(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real + y.real;
                    ret.imag = this->imag + y.imag;
                    return ret;
                }
-                complex_type operator -(complex_type y) {
+                complex_type operator -() const {
+                    complex_type ret;
+                    ret.real = -this->real;
+                    ret.imag = -this->imag;
+                    return ret;
+                }
+                complex_type operator -(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real - y.real;
                    ret.imag = this->imag - y.imag;
                    return ret;
                }
-                complex_type operator *(complex_type y) {
+                complex_type operator *(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real * y.real - this->imag * y.imag;
                    ret.imag = this->real * y.imag + this->imag * y.real;
                    return ret;
                }
-                complex_type operator /(complex_type y) {
+                complex_type operator /(const complex_type &y) const {
                    complex_type ret;
                    scalar_type y_norm_square = y.real * y.real + y.imag * y.imag;
                    ret.real = (this->real * y.real + this->imag * y.imag) / y_norm_square;
@@ -296,6 +302,7 @@ class Scalar(Type):
            return ""
    def c_code_cache_version(self):
+        return (8,) # put const around operators and added unary '-' operator
        # no need to put lib.amdlibm here as c_compile_args() are put in the key.
        return (7,)  # make complex c code optional
        return (6,)  # added implemeentations of operators that work with scalar arguments
@@ -459,7 +466,39 @@ def float_out_nocomplex(*types):
        if type in complex_types:
            raise TypeError('complex argument not supported')
    return float64,
+class unary_out_lookup(gof.utils.object2):
+    """
+    get a output_types_preference object by passing a dictionary:
+    unary_out_lookup({int8:int32, float32:complex128})
+    The result is an op that maps in8 to int32 and float32 to complex128 and other input types
+    lead to a TypeError.
+    """
+    def __init__(self, type_table):
+        self.tbl = type_table
+    def __call__(self, *types):
+        if len(types) == 1:
+            types = types[0]
+        try:
+            rval = self.tbl[types]
+        except:
+            raise TypeError(types)
+        if isinstance(types, (list, tuple)):
+            return rval
+        else:
+            return [rval]
+    def __eq__(self, other):
+        return type(self) == type(other) and self.tbl == other.tbl
+    def __hash__(self):
+        return hash(type(self)) # ignore hash of table
+def real_out(type):
+    if type == complex64:
+        return float32,
+    if type == complex128:
+        return float64,
+    return type,
 class ScalarOp(Op):
@@ -810,6 +849,13 @@ class Mul(ScalarOp):
      retval = []
      for input in inputs:
        if input.type in grad_types:
+          if input.type in complex_types:
+            # does casting from real to complex work?
+            dz_dinput = cast(mul(*(utils.difference(inputs, [input]))), input.type.dtype)
+            x = real(dz_dinput)
+            y = imag(dz_dinput)
+            retval += [complex(x*real(gz)+y*imag(gz), x*imag(gz)-y*real(gz))]
+          else:
            retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)]
        else:
          retval += [None]
@@ -1088,11 +1134,9 @@ class Abs(UnaryScalarOp):
        return numpy.abs(x)
    def grad(self, (x, ), (gz, )):
        if x.type in grad_types:
-          return gz * sgn(x),
+            return gz * x / abs(x), # formula works for complex and real
        else:
            return None,
-        #backport
-        #return gz * sgn(x) if x.type in grad_types else None,
    def c_code(self, node, name, (x, ), (z, ), sub):
        type = node.inputs[0].type
        if type in int_types:
@@ -1478,6 +1522,91 @@ class Tanh(UnaryScalarOp):
        return "%(z)s = tanh(%(x)s);" % locals()
 tanh = Tanh(upgrade_to_float, name = 'tanh')
+class Real(UnaryScalarOp):
+    """Extract the real coordinate of a complex number.  """
+    def impl(self, x):
+        return numpy.real(x)
+    def grad(self, (x, ), (gz, )):
+        return [complex(gz, 0)]
+real = Real(real_out, name='real')
+class Imag(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.imag(x)
+    def grad(self, (x, ), (gz, )):
+        return [complex(0, gz)]
+imag = Imag(real_out, name='imag')
+class Angle(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.angle(x)
+    def grad(self, (c, ), (gtheta, )):
+        # y = x.imag
+        # r = sqrt(y**2 + x.real**2)
+        # g = y/r
+        # if x == 0 and y == 0:
+        #     theta = 0
+        # elif x >= 0:
+        #     theta = numpy.arcsin(g)
+        # else:
+        #     theta = -numpy.arcsin(g)+numpy.pi
+        x = real(c)
+        y = imag(c)
+        r = abs(c)
+        gr = -gtheta * y / (r**2 * sqrt(1 - (y/r)**2))
+        gx = gr * x/r
+        gy = gr * y/r
+        return [complex(gx, gy)]
+angle = Angle(specific_out(float64), name='angle')
+class Complex(BinaryScalarOp):
+    @staticmethod
+    def output_types_preference(x,y):
+        if x in complex_types:
+            raise TypeError(x)
+        if y in complex_types:
+            raise TypeError(y)
+        up = Scalar.upcast(x, y)
+        if up in ('float64', 'int64', 'uint64', 'int32', 'uint32'):
+            return [complex128]
+        else:
+            return [complex64]
+    def impl(self, x, y):
+        return numpy.complex(x, y)
+    def grad(self, (x,y), (z,)):
+        return [real(z), imag(z)]
+complex = Complex(name='complex')
+class Conj(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.conj(x)
+    def grad(self, (x, ), (gz, )):
+        return [conj(gz)]
+conj = Conj(same_out, name='conj')
+class ComplexFromPolar(BinaryScalarOp):
+    @staticmethod
+    def output_types_preference(x,y):
+        return Complex.output_types_preference(x,y)
+    def impl(self, r, theta):
+        if r < 0:
+            raise ValueError('polar radius must be non-negative', r)
+        x = r*numpy.cos(theta)
+        y = r*numpy.sin(theta)
+        if x.dtype == 'float32':
+            return numpy.complex64(numpy.complex(x,y))
+        else:
+            return numpy.complex128(numpy.complex(x,y))
+    def grad(self, (r,theta), (gz,)):
+        gr = cos(theta) * real(gz) + sin(theta) * imag(gz)
+        gtheta = -real(gz) * r * sin(theta) + imag(gz) * r * cos(theta)
+        return [gr, gtheta]
+complex_from_polar = ComplexFromPolar(name='complex_from_polar')
 class Composite(ScalarOp):
@@ -1642,3 +1771,4 @@ class Composite(ScalarOp):
        #otherwise self.perform won't work.
        self.__init__(self.inputs, self.outputs)
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -618,7 +618,7 @@ class Elemwise(Op):
            errormsg = 'Failed calling ufunc for op', self.scalar_op,\
                        'for params of shape', [arg.shape for arg in ufunc_args]
            e.args = e.args + errormsg
-            raise e
+            raise
        if ufunc.nout == 1: variables = [variables]
        for variable, storage in zip(variables, output_storage):
            if storage[0].shape:

--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -438,13 +438,13 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        y_idx = tensor.as_tensor_variable(y_idx)
        if x.type.ndim != 2 \
                or x.type.dtype not in ['float32', 'float64']:
-            raise ValueError('x must be 2-d tensor of floats')
+            raise ValueError('x must be 2-d tensor of floats', x.type)
        if b.type.ndim != 1 \
                or x.type.dtype not in ['float32', 'float64']:
-            raise ValueError('b must be 1-d tensor of floats')
+            raise ValueError('b must be 1-d tensor of floats', b.type)
        if y_idx.type.ndim != 1 \
                or y_idx.type.dtype not in ['int8', 'int16', 'int32', 'int64']:
-            raise ValueError('y_idx must be 1-d tensor of ints')
+            raise ValueError('y_idx must be 1-d tensor of ints', y_idx.type)
 #       TODO: Is this correct? It used to be y, not y_idx
        nll = tensor.TensorType(x.type.dtype,

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -180,11 +180,12 @@ def local_dimshuffle_lift(node):
        new_order = [x == 'x' and 'x' or inode.op.new_order[x] for x in op.new_order]
        inplace = op.inplace and inode.op.inplace
        iinput = inode.inputs[0]
-        if new_order == range(len(new_order)):
+        if new_order == range(len(new_order)) and (len(new_order) == iinput.type.ndim):
            return [iinput]
        else:
            return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs
 register_canonicalize(local_dimshuffle_lift)
 register_specialize(local_dimshuffle_lift)
@@ -519,8 +520,8 @@ def local_fill_to_alloc(node):
            # TODO: cut out un-necessary dimshuffles of v
            rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
-        if rval[0].type != node.outputs[0].type:
+        #if rval[0].type != node.outputs[0].type:
-            print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
+            #print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
        assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type,
                'orig', node.outputs[0].type,
@@ -1048,9 +1049,41 @@ def local_reshape_chain(node):
    if not opt.check_chain(node, T.Reshape, T.Reshape):
        return False
+    # TODO: this can permit a failing program to run by eliminating the the lower
+    #       reshape
    return [node.op(node.inputs[0].owner.inputs[0], node.inputs[1])]
 register_canonicalize(local_reshape_chain)
+if 0:
+    # TODO: Test that this optimziation works.
+    @register_canonicalize
+    @gof.local_optimizer([])
+    def local_scalar_reshape(node):
+        """Eliminate reshape Ops whose inputs and outputs are scalars """
+        if isinstance(node.op, T.Reshape):
+            x, shp = node.inputs
+            if x.ndim == 0 and T.get_vector_length(shp)==0:
+                return [x]
+if 0:
+    # TODO: Finish writing and testing this optimization.
+    #       The idea is that if we can prove the output to this sum
+    #       has a zero-size dimension, then it can be replaced by an appropriately typed and
+    #       broadcasted zero.
+    @register_canonicalize
+    @gof.local_optimizer([])
+    def local_sum_over_empty(node):
+        if isinstance(node.op, T.Sum):
+            y, = node.outputs
+            y_shape = node.env.shape_feature.shape_of[y]
+            def tmp(thing):
+                try: 
+                    return T.get_constant_value(thing)
+                except (TypeError, ValueError), e:
+                    print e, thing.owner.inputs[0]
+                    return None
+            print 'LOCAL SUM EMPTY', [tmp(s) for s in y_shape]
 ##################
 # Middleman cuts #
@@ -1063,11 +1096,18 @@ def local_fill_cut(node):
    If c.type == a.type.
    """
+    # this optimization is essentially for getting broadcasting to replace fill. 
+    # This is always possible when using a Compound Elemwise operation, 
+    # but it is not always possible without one (consider filling a large matrix with a scalar,
+    # and then adding another scalar.  The only numbers that count are the two scalars, but we
+    # can't ignore the large matrix because it gives the shape of the result.
    if not opt.check_chain(node, T.Elemwise):
        return False
    output = node.outputs[0]
    try:
+        #reference is some input with the same type as the input but that is not produced by a fill
        reference = [input
                     for input in node.inputs
                     if input.type == output.type and (not input.owner or input.owner.op != T.fill)][0]
@@ -1086,7 +1126,12 @@ def local_fill_cut(node):
    if new_inputs == node.inputs:
        return False
-    return node.op.make_node(*new_inputs).outputs
+    rval = node.op(*new_inputs)
+    if isinstance(rval, gof.Variable):
+        return rval.owner.outputs
+    else:
+        return rval[0].owner.outputs
 register_canonicalize(local_fill_cut)

--- a/theano/tensor/tests/test_complex.py
+++ b/theano/tensor/tests/test_complex.py
 import unittest
 import theano
 from theano.tensor import *
+from theano.tests import unittest_tools as utt
 class TestRealImag(unittest.TestCase):
@@ -15,9 +16,60 @@ class TestRealImag(unittest.TestCase):
        x= dvector()
        rng = numpy.random.RandomState(23)
        xval = rng.randn(10)
-        assert numpy.all( 0 == theano.function([x], imag(x))(xval))
+        try:
-        assert numpy.all( xval == theano.function([x], real(x))(xval))
+            numpy.all( 0 == theano.function([x], imag(x))(xval))
+            assert 0
+        except TypeError:
+            pass
+        try:
+            numpy.all( xval == theano.function([x], real(x))(xval))
+            assert 0
+        except TypeError:
+            pass
    def test_cast(self):
        x= zvector()
        self.failUnlessRaises(TypeError, cast, x, 'int32')
+    def test_complex(self):
+        rng = numpy.random.RandomState(2333)
+        m = fmatrix()
+        c = complex(m[0], m[1])
+        assert c.type == cvector
+        r,i = [real(c), imag(c)]
+        assert r.type == fvector
+        assert i.type == fvector
+        f = theano.function([m], [r,i] )
+        mval = numpy.asarray(rng.randn(2,5), dtype='float32')
+        rval, ival = f(mval)
+        assert numpy.all(rval == mval[0]), (rval,mval[0])
+        assert numpy.all(ival == mval[1]), (ival, mval[1])
+    def test_complex_grads(self):
+        def f(m):
+            c = complex(m[0], m[1])
+            return .5 * real(c) + .9 * imag(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
+    def test_polar_grads(self):
+        def f(m):
+            c = complex_from_polar(abs(m[0]), m[1])
+            return .5 * real(c) + .9 * imag(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
+    def test_abs_grad(self):
+        def f(m):
+            c = complex(m[0], m[1])
+            return .5 * abs(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
--- a/theano/tests/unittest_tools.py
+++ b/theano/tests/unittest_tools.py
@@ -63,3 +63,15 @@ def verify_grad(op, pt, n_tests=2, rng=None, *args, **kwargs):
        rng = numpy.random
    T.verify_grad(op, pt, n_tests, rng, *args, **kwargs)
+#
+# This supports the following syntax:
+# 
+# try:
+#     verify_grad(...)
+# except verify_grad.E_grad, e:
+#     print e.num_grad.gf
+#     print e.analytic_grad
+#     ...
+#
+#
+verify_grad.E_grad = T.verify_grad.E_grad