merge

f70d4149 · James Bergstra · fcf5c80e · 755188ae · f70d4149 · f70d4149
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -150,4 +150,11 @@ Config Attributes
    A directory with bin/, lib/, include/ folders containing cuda utilities.
+.. attribute:: gcc.cxxflags
+    Default: ""
+    Extra parameters to pass to gcc when compiling.  Extra include paths,
+    library paths, configuration options, etc.
--- a/doc/proposals/advidx.txt
+++ b/doc/proposals/advidx.txt
+==================
+Advanced Indexing
+==================
+Continue the Advanced Indexing project that is on either github or bitbucket.
--- a/doc/proposals/conditional.txt
+++ b/doc/proposals/conditional.txt
+=============================================================================
+Proposal for New Linking Strategy supporting Lazy Evaluation: Op.make_thunk
+=============================================================================
+.. note::
+   Proposal made June 2010.
+Motivation
+===========
+Conditional evaluation is useful to describe many optimization algorithms where
+the update expressions depend on internal state.
+True conditional evaluation requires lazy graph evaluation.
+Without lazy graph evaluation, the runtime of a graph can be exponential in the
+number of conditionals instead of linear.  No one waits an exponential amount of
+time, so instead people work around this problem in various other ways, but it
+would be better if theano had an 'if-then-else' expression (call it cond).
+A lazily-evaluted 'cond' requires a linker to use a different method for
+interacting with Ops.  Neither the current perform() nor c_code() approaches
+support lazy evaluation.
+Why do perform (and c_code) not handle lazy evaluation?
+The syntax of the current perform() could be extended to be compatible with lazy
+evaluation. For example, the  linker could set all inputs to None, and use the
+return value from perform() to see which inputs are required.  But all the Ops
+that currently implement a perform() function would be broken because their
+perform implementations do not ask for inputs before using them.  I don't see a
+way around this.  The same restriction applies to c_code.
+The way around this is to introduce a new interface for the linker to talk to
+Ops.  I propose that we add an Op.make_thunk() that returns an object satisfying
+this interface.
+At the same time, it appears that as we try to integrate PyCUDA Ops another
+problem arises.  We would like to use Op.perform() to drive the GPU, but it is
+natural to move compilation of the CUDA kernel to a point after make_node() and a
+point before perform().   The point where the linker makes an thunk from the Op
+seems like a natural choice.
+A third motivation for introducing an Op.make_thunk function is to clarify the
+relationship between Ops (the classes you implement in Python) and mathematical
+operations (the more abstract things in terms of which you think when using
+Theano).
+I propose that *technically* an Op, when conditioned by particular inputs,
+generates *at most one implementation* that defines the behaviour of that Op.
+In *intuitive terms*, the abstract mathematical steps that we sometimes talk about regarding Theano
+still correspond to Ops -- it's just that these Ops have relatively generic
+implementations.
+The process of optimization is to specialize those generic implementations
+by using information from the rest of the graph.
+If we accept that an Op corresponds to at most one implementation,
+then it makes sense to ask an Op instance to expose that implementation via a
+standard interface (Op.make_thunk).
+It does not make sense to pass arguments to Op.make_thunk such as 'py' or "c|py"
+to tell the Op which implementation to use.  The Op instance represents just one
+implementation, and flags such as 'py' or 'c|py' should be passed to the Op's
+constructor.
+Proposal: Op.make_thunk
+==========================
+There are two interface items I propose to add.  The first is a Thunk object
+(which we have never had before), and the second is a new function (make_thunk)
+in the PureOp class (a superclass of Op) that will return a Thunk.
+..code-block:: python
+  class Thunk (object):
+    """Abstract class / interface
+    It describes the interface used by a Theano linker to execute the nodes in a
+    graph.  Thunk instances are in correspondance with Apply instances that
+    remain in the final form of the graph after optimization.
+    """
+    lazy = property(...,
+       """True means the thunk may trigger lazy evaluation.
+        False means the thunk always requires all inputs and computes all
+        outputs.
+        Consequently False implies that __call__ always returns None
+        """
+    def __call__(self):
+       """Thunk will compute some number (or zero) of outputs and in the case
+       that it cannot compute all its outputs for lack of inputs, this function
+       will return a list of input indexes that are required.  The linker will
+       typically compute those required inputs and then call this
+       __call__ function again.
+       The thunk is considered to be finished when it returns an empty list or
+       None.
+       """
+..code-block:: python
+    class PureOp(object):                       # recall:
+                                                # Op inherits from PureOp
+        def make_node(self, *inputs):           # leave alone 
+           ...
+        def perform(self, node, 
+              inputs, output_storage):          # move to `Op` class 
+           ...
+        def make_thunk(self, node,              # new function
+              input_computed, output_computed,
+              input_registers, output_registers,
+              ): 
+           """
+           :type node: Apply instance
+           :param node: previous rval from make_node(self, *inputs)
+           :type input_computed: list of len-1 lists, with values in (0,1).
+           :param input_computed: at runtime, input_computed[i][0]==1 implies
+               that the i'th input has been computed and stored at
+               input_registers[i][0], and is available for use.
+               Otherwise the content of input_registers[i][0] is undefined.
+           :type output_computed: list of len-1 lists, with values in (0,1).
+           :param output_computed: at runtime, output_computed[i][0]==1 implies
+               that the i'th output has already been computed and stored at
+               output_registers[i][0].
+               Otherwise, output_registers[i][0] will contain either None, or
+               a value that was previously computed by this thunk.
+           :type input_registers: list of len-1 lists
+           :type output_registers: list of len-1 lists
+           :param input_registers: the i'th input can be read from
+           input_registers[i][0] when input_computed[i][0] == 1.
+           :param output_registers: the i'th output must be stored to 
+           output_registers[i][0], at which point the thunk must set output_computed[i][0] == 1.
+           :returns: a Thunk (subclass) instance
+           """
+The Thunk class can have subclasses that use Op.perform and Op.c_code as we use
+them now.  The interface of Thunk is backward-compatible with the thunks built
+by the CLinker and PerformLinker.  If a graph contains zero Thunks with
+lazy==True, then the current Linkers will continue to work.
+The new Thunk interface will support a new LazyLinker that can run programs for
+which some thunks have lazy==True.
+The Thunk class can have subclasses that are implemented in C, which might help
+performance.
--- a/doc/proposals/dp_optimization.txt
+++ b/doc/proposals/dp_optimization.txt
+========================
+DP Instruction Selection
+========================
+Read Ch 9 of Modern Compiler Implementation about instruction selection.
+We should probably be doing graph optimization totally differently:  
+Optimizations *only add* new ways of implementing something, they do not replace
+the old way.  Every graph node (apply) as a cost, and Dynamic Programming (DP)
+is used to select the minimum cost graph.
+The advantage of this approach is that optimizations do not have to run in such
+a careful order, and graph selection would be much faster.
+Think about how aliasing and destructive operations (the destroy-handler) would
+fit in this approach.
--- a/doc/proposals/intermediate_language.txt
+++ b/doc/proposals/intermediate_language.txt
+=====================
+Intermediate Language
+=====================
+It would be nice to be able to use Theano from other languages.
+This requires two things: a way to communicate the expression to the theano
+compiler, and a way to pass data to and from the compiled function.
+One way to do this would be define a textual representation of theano graphs.
+A Scheme-like language seems appropriate.  Perhaps just scheme would be
+appropriate.
+How to pass shared variables?
--- a/doc/proposals/mongodb_cache.txt
+++ b/doc/proposals/mongodb_cache.txt
+=================
+MongoDB DLL Cache
+=================
+In network environments (like at DIRO on NFS3), a distributed DB like mongo or couch is faster and more
+robust to concurrency than the $HOME/.theano.  Also, a single cache could be
+shared by multiple users.  This would result in less compilation time, for
+everyone, and less stale-cache problems.
--- a/doc/proposals/noupdates.txt
+++ b/doc/proposals/noupdates.txt
@@ -4,6 +4,7 @@ Automatic updates
 .. note:
   Proposed 2010 01 13
+   Done 2010 04 ??
 The Module version of RandomStreams could arrange for the automatic update of

--- a/doc/proposals/opencl.txt
+++ b/doc/proposals/opencl.txt
+=======
+OpenCL
+=======
+Migrate the GPU code-generators to the PyCUDA style, and eventually to OpenCL.
+This means mainly to use a different kind of code-generation strategy.  The
+kernel itself is compiled, but the calling code remains in python or cython.  We
+would no longer generate entire C files this way, and no longer use the CLinker
+for GPU code.
--- a/doc/proposals/premerge.txt
+++ b/doc/proposals/premerge.txt
+Proactive Merging
+=================
+Merge is done now as an optimization. 
+But if Merging was done at graph construction time, things like #476 would work.
+Additionally, memo-izing at graph construction time would make it possible to
+define recursive formula with recursive python functions (e.g. Fibonacci).
+Currently the merge optimization would make the Fibonacci series linear, but the
+size of the program used to express the program would be exponential.
--- a/doc/proposals/tensor_attributes.txt
+++ b/doc/proposals/tensor_attributes.txt
+========================
+Add tensor attributes
+========================
+Size, shape, psd, symmetric, triangular, contiguous.  
+Add these attributes to the TensorType with the option always that they be
+'unknown'.
+Add attributes that are useful for optimizations, or useful for code generation.
--- a/theano/compile/debugmode.py
+++ b/theano/compile/debugmode.py
@@ -104,6 +104,47 @@ class BadCLinkerOutput(DebugModeError):
        print >> sio, "  val_py  :", self.val_py
        print >> sio, "  val_c   :", self.val_c
        print >> sio, "  op      :", self.offending_op()
+        try:
+            ssio = StringIO()
+            print >> ssio, "  PyValue shape, dtype, strides, min, max:", 
+            print >> ssio, self.val_py.shape,
+            print >> ssio, self.val_py.dtype,
+            print >> ssio, self.val_py.strides
+            print >> ssio, self.val_py.min(),
+            print >> ssio, self.val_py.max(),
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()
+        except:
+            pass
+        try:
+            ssio = StringIO()
+            print >> ssio, "  CValue shape, dtype, strides, min, max:", 
+            print >> ssio, self.val_c.shape,
+            print >> ssio, self.val_c.dtype,
+            print >> ssio, self.val_c.strides,
+            print >> ssio, self.val_c.min(),
+            print >> ssio, self.val_c.max(),
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()
+        except:
+            pass
+        try:
+            ov=numpy.asarray(self.val_c)
+            nv=numpy.asarray(self.val_py)
+            ssio = StringIO()
+            print >> ssio, "  Max Abs Diff: ", numpy.max(numpy.absolute(nv-ov))
+            print >> ssio, "  Mean Abs Diff: ", numpy.mean(numpy.absolute(nv-ov))
+            print >> ssio, "  Median Abs Diff: ", numpy.median(numpy.absolute(nv-ov))
+            print >> ssio, "  Std Abs Diff: ", numpy.std(numpy.absolute(nv-ov))
+            reldiff = numpy.absolute(nv-ov) / (numpy.absolute(nv)+numpy.absolute(ov))
+            print >> ssio, "  Max Rel Diff: ", numpy.max(reldiff)
+            print >> ssio, "  Mean Rel Diff: ", numpy.mean(reldiff)
+            print >> ssio, "  Median Rel Diff: ", numpy.median(reldiff)
+            print >> ssio, "  Std Rel Diff: ", numpy.std(reldiff)
+            # only if all succeeds to we add anything to sio
+            print >> sio, ssio.getvalue()                                    
+        except:
+            pass
        return sio.getvalue()
 class BadOptimization(DebugModeError):

--- a/theano/gof/cc.py
+++ b/theano/gof/cc.py
@@ -858,6 +858,7 @@ class CLinker(link.Linker):
        if config.gcc.cxxflags:
            sig.append(config.gcc.cxxflags)
+        error_on_play = [False]
        def in_sig(i, topological_pos, i_idx):
            # assert that every input to every node is one of'
            # - an env input
@@ -868,7 +869,14 @@ class CLinker(link.Linker):
            # yield a 'position' that reflects its role in code_gen()
            if isinstance(i, graph.Constant): #orphans
                if id(i) not in constant_ids:
-                    constant_ids[id(i)] = (i.signature(), topological_pos, i_idx)
+                    isig = (i.signature(), topological_pos, i_idx)
+                    try:
+                        hash(isig)
+                    except: #generic constants don't have a hashable signature
+                        error_on_play[0] = True
+                        return None
+                    constant_ids[id(i)] = isig
+                else:
                    isig = constant_ids[id(i)]
                #print 'SIGNATURE', i.signature()
                #return i.signature()
@@ -903,6 +911,12 @@ class CLinker(link.Linker):
                    for ipos,i in enumerate(node.inputs)),
                tuple(o in no_recycling for o in node.outputs)))
+            if error_on_play[0]:
+                # if one of the signatures is not hashable
+                # then bypass the cache mechanism and 
+                # compile fresh every time
+                return None
            op_pos[node] = node_pos
            env_computed_set.update(node.outputs)

--- a/theano/gof/compilelock.py
+++ b/theano/gof/compilelock.py
@@ -242,6 +242,7 @@ def refresh_lock(lock_file):
    lock_write = open(lock_file, 'w')
    lock_write.write(unique_id + '\n')
    lock_write.close()
+    print lock_file
    return unique_id
 class Unlocker(object):

--- a/theano/gof/destroyhandler.py
+++ b/theano/gof/destroyhandler.py
@@ -7,6 +7,7 @@ if sys.version_info[:2] >= (2,5):
 import toolbox
 import graph
+from theano.gof import deque
 from env import InconsistencyError
@@ -44,6 +45,92 @@ class DestroyHandler(object):
        return self.map[env].orderings(env)
+def _dfs_toposort(i, r_out, orderings):
+    """
+    i - list of inputs
+    o - list of outputs
+    orderings - dict of additions to the normal inputs and outputs
+    Returns nothing.  Raises exception for graph with cycles
+    """
+    #this is hard-coded reimplementation of functions  from graph.py
+    # reason: go faster, prepare for port to C.
+    assert isinstance(r_out, (tuple, list, deque))
+    # TODO: For more speed - use a defaultdict for the orderings
+    iset = set(i)
+    if 0:
+        def expand(obj): 
+            rval = []
+            if obj not in iset:
+                if isinstance(obj, graph.Variable): 
+                    if obj.owner:
+                        rval = [obj.owner]
+                if isinstance(obj, graph.Apply):
+                    rval = list(obj.inputs)
+                rval.extend(orderings.get(obj, []))
+            else:
+                assert not orderings.get(obj, [])
+            return rval
+    expand_cache = {}
+    # reachable, clients = stack_search( deque(r_out), deps, 'dfs', True)
+    start=deque(r_out)
+    rval_set = set()
+    rval_set.add(id(None))
+    rval_list = list()
+    expand_inv = {}
+    sources = deque()
+    while start:
+        l = start.pop()# this makes the search dfs
+        if id(l) not in rval_set:
+            rval_list.append(l)
+            rval_set.add(id(l))
+            if l in iset:
+                assert not orderings.get(l, [])
+                expand_l = []
+            else:
+                try:
+                    if l.owner:
+                        expand_l = [l.owner]
+                    else:
+                        expand_l = []
+                except AttributeError:
+                    expand_l = list(l.inputs)
+                expand_l.extend(orderings.get(l, []))
+            if expand_l:
+                for r in expand_l:
+                    expand_inv.setdefault(r, []).append(l)
+                start.extend(expand_l)
+            else:
+                sources.append(l)
+            expand_cache[l] = expand_l
+    assert len(rval_list) == len(rval_set)-1
+    rset = set()
+    rlist = []
+    while sources:
+        node = sources.popleft()
+        if node not in rset:
+            rlist.append(node)
+            rset.add(node)
+            for client in expand_inv.get(node, []):
+                expand_cache[client] = [a for a in expand_cache[client] if a is not node]
+                if not expand_cache[client]:
+                    sources.append(client)
+    if len(rlist) != len(rval_list):
+        raise ValueError('graph contains cycles')
+    #return [o for o in rlist if isinstance(o, graph.Apply)]
 def getroot(r, view_i):
    """
    For views: Return non-view variable which is ultimatly viewed by r.
@@ -303,7 +390,8 @@ class DestroyHandlerHelper2(toolbox.Bookkeeper):
                raise
            #print 'orderings:', ords
            try:
-                graph.io_toposort(env.inputs, env.outputs, ords)
+                ### graph.io_toposort(env.inputs, env.outputs, ords)
+                _dfs_toposort(env.inputs, env.outputs, ords)
            except ValueError, e:
                #print 'not passing.', ords
                if 'cycles' in str(e):

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -9,9 +9,10 @@ compatible with `gof`'s :doc:`graph` routines.
 __docformat__ = "restructuredtext en"
 import utils
-import traceback
 from theano import config
 class CLinkerObject(object):
    """Standard elements of an Op or Type used with the CLinker
    """
@@ -320,9 +321,7 @@ class PureOp(object):
        """
        node = self.make_node(*inputs, **kwargs)
-        limit = config.traceback.limit
+        self.add_tag_trace(node)
-        if limit == -1: limit = None
-        node.tag.trace = traceback.extract_stack(limit=limit)[:-1]
        if self.default_output is not None:
            return node.outputs[self.default_output]
        else:
@@ -331,6 +330,9 @@ class PureOp(object):
            else:
                return node.outputs
+    # Convenience so that subclass implementers don't have to import utils
+    # just to self.add_tag_trace
+    add_tag_trace = staticmethod(utils.add_tag_trace)
    #########################
    # Python implementation #

--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
@@ -127,6 +127,8 @@ class SeqOptimizer(Optimizer, list):
        for optimizer in self:
            try:
                optimizer.optimize(env)
+            except AssertionError: # do not catch Assertion failures
+                raise
            except Exception, e:
                if self.failure_callback:
                    self.failure_callback(e, self, optimizer)

--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -5,8 +5,7 @@ __docformat__ = "restructuredtext en"
 import copy
 import utils
 from utils import MethodNotDefined, object2
-from graph import Variable
+import graph
-import traceback
 from theano import config
 ########
@@ -202,6 +201,9 @@ class PureType(object):
    """
+    Variable = graph.Variable #the type that will be created by call to make_variable.
+    Constant = graph.Constant #the type that will be created by call to make_constant
    def filter(self, data, strict = False):
        """Required: Return data or an appropriately wrapped/converted data.
@@ -233,8 +235,11 @@ class PureType(object):
            A pretty string for printing and debugging.
        """
-        r = Variable(self, name = name)
+        return self.Variable(self, name = name)
-        return r
+    def make_constant(self, value, name=None):
+        return self.Constant(type=self, data=value, name=name)
    def __call__(self, name = None):
        """Return a new `Variable` instance of Type `self`.
@@ -244,11 +249,7 @@ class PureType(object):
            A pretty string for printing and debugging.
        """
-        r = self.make_variable(name)
+        return utils.add_tag_trace(self.make_variable(name))
-        limit = config.traceback.limit
-        if limit == -1: limit = None
-        r.tag.trace = traceback.extract_stack(limit=limit)[:-1]
-        return r
    def values_eq(self, a, b):
        """
@@ -319,9 +320,11 @@ class Type(object2, PureType, CLinkerType):
    """
-## DELETEME ##
 class SingletonType(Type):
-    """WRITEME"""
+    """Convenient Base class for a Type subclass with no attributes
+    It saves having to implement __eq__ and __hash__
+    """
    __instance = None
    def __new__(cls):
        if cls.__instance is None:
@@ -378,14 +381,5 @@ class Generic(SingletonType):
        Py_INCREF(py_%(name)s);
        """ % locals()
 generic = Generic()
-## DELETEME ##
-class PropertiedType(Type):
-    """WRITEME"""
-    def __eq__(self, other):
-        return type(self) == type(other) and self.__dict__ == other.__dict__
--- a/theano/gof/utils.py
+++ b/theano/gof/utils.py
@@ -2,7 +2,18 @@
 # import op
 # import variable
-import re, os
+from theano import config
+import re, os, traceback
+def add_tag_trace(thing):
+    """Add tag.trace to an node or variable.
+    The argument is returned after being affected (inplace).
+    """
+    limit = config.traceback.limit
+    if limit == -1: limit = None
+    thing.tag.trace = traceback.extract_stack(limit=limit)[:-1]
+    return thing
 def hashgen():
    hashgen.next += 1

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -53,6 +53,10 @@ def debugprint(obj, depth=-1, print_type=False, file=None):
        results_to_print.extend(obj.outputs)
    elif isinstance(obj, Function):
        results_to_print.extend(obj.maker.env.outputs)
+    elif isinstance(obj, (list, tuple)):
+        results_to_print.extend(obj)
+    else:
+        raise TypeError("debugprint cannot print an object of this type", obj)
    for r in results_to_print:
        debugmode.debugprint(r, depth=depth, done=done, print_type=print_type, file=_file)
    if file is _file:

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -478,9 +478,16 @@ class GpuSum(Op):
                PyErr_Format(PyExc_RuntimeError, "Failed to allocate output");
                %(fail)s;
            }
        }
        """ %locals()
+        # \begin bracket the reduction in a check that there is actually work to do 
+        print >> sio, """
+        if (CudaNdarray_SIZE(%(z)s))
+        {
+        """ % locals()
        #
        # Now perform the reduction
        #
@@ -499,9 +506,15 @@ class GpuSum(Op):
        else:
            getattr(self, 'c_code_reduce_%s'%(''.join(str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
+        # \end bracket the reduction ...
+        print >> sio, """
+        }
+        """ % locals()
        return sio.getvalue()
-    def _makecall(self, node, name, x, z, fail):
+    def _makecall(self, node, name, x, z, fail, pattern=None):
        """Return a string for making a kernel call.
            The return value looks something like:
@@ -527,14 +540,18 @@ class GpuSum(Op):
                }
        """
        sio = StringIO.StringIO()
+        if pattern is None:
            pattern = ''.join(str(c) for c in self.reduce_mask)
-        ndim = len(pattern)
+        ndim = len(self.reduce_mask)
        nd_out = ndim - sum(self.reduce_mask)
        print >> sio, """
            if (verbose) printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n");
            int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
-            if (verbose>1) printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, n_blocks.x=%%d, n_blocks.y=%%d n_shared=%%d\\n",
+            if (verbose>1) printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d, nb_block=%%d, n_shared=%%d\\n",
-                                  n_threads.x,n_threads.y,n_threads.z,n_blocks.x,n_blocks.y,n_shared);
+                                  n_threads.x,n_threads.y,n_threads.z,
+                                  n_threads.x*n_threads.y*n_threads.z,
+                                  n_blocks.x,n_blocks.y,
+                                  n_blocks.x*n_blocks.y, n_shared);
            kernel_reduce_sum_%(pattern)s_%(name)s<<<n_blocks, n_threads, n_shared>>>(
            """ %locals()
        for i in xrange(ndim):
@@ -574,7 +591,7 @@ class GpuSum(Op):
        """ %locals()
        return sio.getvalue()
-    def _k_decl(self, node, nodename):
+    def _k_decl(self, node, nodename, pattern = None, ndim = None, reduce_mask = None):
        """Return a string to declare a kernel function
        .. code-block:: c
@@ -591,27 +608,32 @@ class GpuSum(Op):
                    const int sZ0)
        """ %locals()
-        pattern = ''.join(str(i) for i in self.reduce_mask)
+        if reduce_mask is None:
+            reduce_mask = self.reduce_mask
+        if ndim is None:
+            ndim = len(reduce_mask)
+        if pattern is None:
+            pattern = ''.join(str(i) for i in reduce_mask)
        sio = StringIO.StringIO()
        print >> sio, """
            static __global__ void kernel_reduce_sum_%(pattern)s_%(nodename)s(
        """ %locals()
-        for i in xrange(len(self.reduce_mask)):
+        for i in xrange(ndim):
            print >> sio, """
                    const int d%(i)s,
        """ %locals()
        print >> sio, """
                    const float *A,
        """ %locals()
-        for i in xrange(len(self.reduce_mask)):
+        for i in xrange(ndim):
            print >> sio, """
                    const int sA%(i)s,
        """ %locals()
        print >> sio, """
                    float * Z
        """ %locals()
-        for i in xrange(len(self.reduce_mask) - sum(self.reduce_mask)):
+        for i in xrange(ndim - sum(reduce_mask)):
            print >> sio, """
                    , const int sZ%(i)s
        """ %locals()
@@ -681,6 +703,25 @@ class GpuSum(Op):
        }
        """ %locals()
+    #Threads must be organized as: threadNum%nb_reduce correspond to the same sum
+    #nb_reduce<=warpSize
+    def _k_reduce_buf_multiple(self, z_pos, nb_reduce):
+        return """
+        buf[threadNum] = mysum;
+        __syncthreads();
+        // rest of function is handled by one warp
+        if (threadNum < %(nb_reduce)s)
+        {
+            //round up all the partial sums into the first `nb_reduce` elements
+            for (int i = threadNum + %(nb_reduce)s; i < threadCount; i += %(nb_reduce)s)
+            {
+                mysum += buf[i];
+            }
+            %(z_pos)s = mysum;
+        }
+        """ %locals()
    def c_code_reduce_ccontig(self, sio, node, name, x, z, fail):
        print >> sio, """
        {
@@ -718,6 +759,7 @@ class GpuSum(Op):
        """ %locals()
    def c_code_reduce_1(self, sio, node, name, x, z, fail):
+        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
            int verbose = 0;
@@ -725,31 +767,12 @@ class GpuSum(Op):
                    std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
                            NUM_VECTOR_OP_THREADS_PER_BLOCK));
            dim3 n_blocks(1);
-            if (verbose) printf("running kernel_reduce_sum_1_%(name)s\\n");
+            %(makecall)s
-            int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
-            kernel_reduce_sum_1_%(name)s<<<n_blocks, n_threads, n_shared>>>(
-                    CudaNdarray_HOST_DIMS(%(x)s)[0],
-                    CudaNdarray_DEV_DATA(%(x)s),
-                    CudaNdarray_HOST_STRIDES(%(x)s)[0],
-                    CudaNdarray_DEV_DATA(%(z)s));
-            CNDA_THREAD_SYNC;
-            cudaError_t sts = cudaGetLastError();
-            if (cudaSuccess != sts) 
-            {
-                PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
-                    "kernel_reduce_sum_1_%(name)s",
-                    cudaGetErrorString(sts),
-                    n_blocks.x,
-                    n_blocks.y,
-                    n_threads.x,
-                    n_threads.y,
-                    n_threads.z);
-                %(fail)s;
-            }
        }
        """ %locals()
    def c_code_reduce_11(self, sio, node, name, x, z, fail):
+        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
            int verbose = 0;
@@ -762,30 +785,7 @@ class GpuSum(Op):
                n_threads.y = CudaNdarray_HOST_DIMS(%(x)s)[0]; 
            dim3 n_blocks(1);
-            if (verbose) fprintf(stdout, "running kernel_reduce_sum_11_%(name)s\\n");
+            %(makecall)s
-            if (verbose) fprint_CudaNdarray(stdout, %(x)s);
-            int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
-            kernel_reduce_sum_11_%(name)s<<<n_blocks, n_threads, n_shared>>>(
-                    CudaNdarray_HOST_DIMS(%(x)s)[0],
-                    CudaNdarray_HOST_DIMS(%(x)s)[1],
-                    CudaNdarray_DEV_DATA(%(x)s),
-                    CudaNdarray_HOST_STRIDES(%(x)s)[0],
-                    CudaNdarray_HOST_STRIDES(%(x)s)[1],
-                    CudaNdarray_DEV_DATA(%(z)s));
-            CNDA_THREAD_SYNC;
-            cudaError_t sts = cudaGetLastError();
-            if (cudaSuccess != sts) 
-            {
-                PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
-                    "kernel_reduce_sum_11_%(name)s",
-                    cudaGetErrorString(sts),
-                    n_blocks.x,
-                    n_blocks.y,
-                    n_threads.x,
-                    n_threads.y,
-                    n_threads.z);
-                %(fail)s;
-            }
        }
        """ %locals()
@@ -824,16 +824,6 @@ class GpuSum(Op):
            threads_z = ''
        if len(self.reduce_mask)==3:
            threads_z = ''
-        if config.warn.gpusum_01_011_0111_bug:
-            pattern = '0'+N_pattern
-            warn = '''
-        static bool warn_gpusum_01_011_0111_bug = true;
-        if(warn_gpusum_01_011_0111_bug && CudaNdarray_HOST_DIMS(%(x)s)[%(N)s]>4096){
-            printf("WARNING: old version of Theano had a silent bug with GpuSum pattern %(pattern)s when the first dimensions was bigger then 4096. Was fixed 31 may 2010. To disable this warning set the Theano flags warn.gpusum_01_011_0111_bug to False. Won't repeat the warning before we exit.\\n");
-            warn_gpusum_01_011_0111_bug = false;
-        }
-'''%locals()
-        else: warn = ""
        print >> sio, """
        {
            int verbose = 0;
@@ -843,7 +833,6 @@ class GpuSum(Op):
            %(threads_y)s
            %(threads_z)s
            dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS));
-            %(warn)s
            %(makecall)s
        }
        """ %locals()
@@ -862,16 +851,19 @@ class GpuSum(Op):
            dim3 n_threads(
                    std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
                            NUM_VECTOR_OP_THREADS_PER_BLOCK));
-            dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[1]);
+            dim3 n_blocks(1,CudaNdarray_HOST_DIMS(%(x)s)[1]);
            if (verbose) printf("running kernel_reduce_sum_10_%(name)s\\n");
            int n_shared = sizeof(float) * n_threads.x;
-            kernel_reduce_sum_10_%(name)s<<<n_blocks, n_threads, n_shared>>>(
+            kernel_reduce_sum_010_%(name)s<<<n_blocks, n_threads, n_shared>>>(
+                    1,
                    CudaNdarray_HOST_DIMS(%(x)s)[0],
                    CudaNdarray_HOST_DIMS(%(x)s)[1],
                    CudaNdarray_DEV_DATA(%(x)s),
+                    1,
                    CudaNdarray_HOST_STRIDES(%(x)s)[0],
                    CudaNdarray_HOST_STRIDES(%(x)s)[1],
                    CudaNdarray_DEV_DATA(%(z)s),
+                    1,
                    CudaNdarray_HOST_STRIDES(%(z)s)[0]
                    );
            CNDA_THREAD_SYNC;
@@ -879,7 +871,7 @@ class GpuSum(Op):
            if (cudaSuccess != sts) 
            {
                PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
-                    "kernel_reduce_sum_10_%(name)s",
+                    "kernel_reduce_sum_010_%(name)s",
                    cudaGetErrorString(sts),
                    n_blocks.x,
                    n_blocks.y,
@@ -890,7 +882,143 @@ class GpuSum(Op):
            }
        }
        """ %locals()
+    def c_code_reduce_010(self, sio, node, name, x, z, fail):
+        makecall = self._makecall(node, name, x, z, fail)
+        makecall_inner = self._makecall(node, name, x, z, fail, pattern="010_inner")
+        pattern = ''.join(str(i) for i in self.reduce_mask)
+        print >> sio, """
+        {
+            //int n_summations = CudaNdarray_HOST_DIMS(%(x)s)[0] * CudaNdarray_HOST_DIMS(%(x)s)[2]; 
+            //if ((n_summations >= 15 * 32) && (CudaNdarray_HOST_DIMS(%(x)s)[2]>=16))
+            if (1) // if the alternative is less buggy, consider not using this branch
+            {
+                // If there are a lot of summations to do, then we can use simple parallelization -
+                // use each thread to do one sum.
+                // we might as well launch blocks of 32 threads because that's the warp size.
+                // we could schedule more threads if we were maxing out the gridsize below, but
+                // the gridsize is way more than the physical hardware and I think 32 threads
+                // on a huge grid is enough to fully use the hardware.
+                dim3 n_threads(32,1,1);
+                // We kindof reshape the input implicitly to something 4D:
+                //  the shape A,B,C    ->   A, B, D, E  
+                //  where C <= D*E < C+32
+                //  where E==32
+                int A = CudaNdarray_HOST_DIMS(%(x)s)[0];
+                int B = CudaNdarray_HOST_DIMS(%(x)s)[1];
+                int C = CudaNdarray_HOST_DIMS(%(x)s)[2];
+                int D = C/32;
+                if (32*D < C) D+= 1;
+                assert ((C <= 32*D) && (32*D < C+32));
+                // The gridsize would ideally be (A, D).  But we do the following logic to make
+                // sure we don't ask for a grid that is too big.
+                dim3 n_blocks(A,D);
+                if (n_blocks.x > NUM_VECTOR_OP_BLOCKS) n_blocks.x = NUM_VECTOR_OP_BLOCKS;
+                if (n_blocks.x*n_blocks.y > NUM_VECTOR_OP_BLOCKS) n_blocks.y = NUM_VECTOR_OP_BLOCKS/n_blocks.x;
+                int n_shared = 0;
+                kernel_reduce_sum_010_AD_%(name)s<<<n_blocks, n_threads, n_shared>>>(
+                A,B,C,D,
+                        CudaNdarray_DEV_DATA(%(x)s),
+                        CudaNdarray_HOST_STRIDES(%(x)s)[0],
+                        CudaNdarray_HOST_STRIDES(%(x)s)[1],
+                        CudaNdarray_HOST_STRIDES(%(x)s)[2],
+                        CudaNdarray_DEV_DATA(%(z)s),
+                        CudaNdarray_HOST_STRIDES(%(z)s)[0],
+                        CudaNdarray_HOST_STRIDES(%(z)s)[1]
+                        );
+                CNDA_THREAD_SYNC;
+                cudaError_t sts = cudaGetLastError();
+                if (cudaSuccess != sts) 
+                {
+                    PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                        "kernel_reduce_sum_010_%(name)s",
+                        cudaGetErrorString(sts),
+                        n_blocks.x,
+                        n_blocks.y,
+                        n_threads.x,
+                        n_threads.y,
+                        n_threads.z);
+                    %(fail)s;
+                }
+            }
+            else
+            {
+                int verbose = 2;
+                  dim3 n_threads(std::min(32,CudaNdarray_HOST_DIMS(%(x)s)[2]));
+                  while(    (n_threads.x*(n_threads.y+1)<=NUM_VECTOR_OP_THREADS_PER_BLOCK)
+                         && (n_threads.y<CudaNdarray_HOST_DIMS(%(x)s)[1])){
+                      n_threads.y++;
+                  }
+                  dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
+                                (int)NUM_VECTOR_OP_BLOCKS));
+                  n_blocks.y = std::min(
+                      ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x),
+                      (int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x)
+                      );
+                if(std::min(std::min(CudaNdarray_HOST_STRIDES(%(x)s)[0],
+                                     CudaNdarray_HOST_STRIDES(%(x)s)[1]),
+                            CudaNdarray_HOST_STRIDES(%(x)s)[2])
+                   ==CudaNdarray_HOST_STRIDES(%(x)s)[2]
+                  && n_blocks.y==ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x)){
+                  if(verbose>1)
+                    printf("n_block.x.1=%%d, n_block.x.2=%%d, n_block.y.1=%%d, n_block.y.2=%%d,\\n",
+                           CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS,
+                           ceil_intdiv(CudaNdarray_HOST_DIMS(%(x)s)[2],(int)n_threads.x),
+                                       (int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x));
+                  assert(n_threads.x<=32);
+                  %(makecall_inner)s
+                }else{
+                  n_threads.x = std::min(CudaNdarray_HOST_DIMS(%(x)s)[1],
+                                  (int)NUM_VECTOR_OP_THREADS_PER_BLOCK);
+                  n_blocks.x = std::min(CudaNdarray_HOST_DIMS(%(x)s)[0], (int)NUM_VECTOR_OP_BLOCKS);
+                  n_blocks.y = std::min(
+                      CudaNdarray_HOST_DIMS(%(x)s)[2],
+                      (int)(NUM_VECTOR_OP_BLOCKS / n_blocks.x)
+                      );
+                  %(makecall)s
+                }
+                CNDA_THREAD_SYNC;
+                cudaError_t sts = cudaGetLastError();
+                if (cudaSuccess != sts) 
+                {
+                    PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                        "kernel_reduce_sum_%(pattern)s_%(name)s",
+                        cudaGetErrorString(sts),
+                        n_blocks.x,
+                        n_blocks.y,
+                        n_threads.x,
+                        n_threads.y,
+                        n_threads.z);
+                    %(fail)s;
+                }
+            }
+        }
+        """ %locals()
+    def c_code_reduce_0101(self, sio, node, name, x, z, fail):
+        makecall = self._makecall(node, name, x, z, fail)
+        print >> sio, """
+        {
+            int verbose = 0;
+            dim3 n_threads(
+                    std::min(CudaNdarray_HOST_DIMS(%(x)s)[3],
+                            NUM_VECTOR_OP_THREADS_PER_BLOCK));
+            while (n_threads.x * n_threads.y <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
+            {
+                if (n_threads.y > CudaNdarray_HOST_DIMS(%(x)s)[1]) break;
+                n_threads.y += 1;
+            }
+            n_threads.y -= 1;
+            dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[0], CudaNdarray_HOST_DIMS(%(x)s)[2]);
+            %(makecall)s
+        }
+        """ %locals()
    def c_code_reduce_100(self, sio, node, name, x, z, fail):
        makecall = self._makecall(node, name, x, z, fail)
@@ -1067,13 +1195,13 @@ class GpuSum(Op):
                n_threads.z = CudaNdarray_HOST_DIMS(%(x)s)[0]; 
            dim3 n_blocks(CudaNdarray_HOST_DIMS(%(x)s)[1]);
            %(makecall)s
        }
        """ %locals()
    def c_code_cache_version(self):
-        return (17,)
+        #return ()
+        return (19,)
    def c_support_code_apply(self, node, nodename):
@@ -1207,24 +1335,60 @@ class GpuSum(Op):
                }
            }
            """ %locals()
-        if self.reduce_mask == (1,0):
+        if self.reduce_mask == (0,1,0) or self.reduce_mask == (1,0):
            # this kernel uses one block for each column, 
            # threads per block for each element per column.
            #TODO: This kernel is pretty inefficient in terms of reading, because if A is
            #      c_contiguous (typical case) then each warp is accessing non-contigous
            #      memory (a segment of a column).
-            reducebuf = self._k_reduce_buf('Z[blockIdx.x * sZ0]')
+            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2*sZ1]')
            print >> sio, """
-            static __global__ void kernel_reduce_sum_10_%(nodename)s(
+            static __global__ void kernel_reduce_sum_010_%(nodename)s(
                    const int d0,
                    const int d1,
-                    const float *A, const int sA0, const int sA1,
+                    const int d2,
-                    float * Z, const int sZ0)
+                    const float *A, const int sA0, const int sA1, const int sA2,
+                    float * Z, const int sZ0, const int sZ1)
            {
                const int threadCount = blockDim.x;
                const int threadNum = threadIdx.x;
                extern __shared__ float buf[];
+                if (warpSize != 32)
+                {
+                    return;  //TODO: set error code
+                }
+                for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
+                {
+                    for (int i2 = blockIdx.y; i2 < d2; i2 += gridDim.y)
+                    {
+                        float mysum = 0.0f;
+                        for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)
+                        {
+                            mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
+                        }
+                        %(reducebuf)s
+                    }
+                }
+            }
+            """ %locals()
+        if self.reduce_mask == (0,1,0):
+            print >> sio, """
+            static __global__ void kernel_reduce_sum_010_AD_%(nodename)s(
+                    const int A,
+                    const int B,
+                    const int C,
+                    const int D,
+                    //const int E, // THIS is 32
+                    const float *X, const int sX0, const int sX1, const int sX2,
+                    float * Z, const int sZ0, const int sZ1)
+            {
+                const int threadCount = blockDim.x;
+                const int threadNum = threadIdx.x;
                float mysum = 0.0f;
                if (warpSize != 32)
@@ -1232,13 +1396,62 @@ class GpuSum(Op):
                    return;  //TODO: set error code
                }
-                for (int i0 = threadIdx.x; i0 < d0; i0 += blockDim.x)
+                for (int a = blockIdx.x; a < A; a += gridDim.x)
                {
-                    float Ai = A[i0 * sA0 + blockIdx.x * sA1];
+                    for (int i2_D = blockIdx.y; i2_D < D; i2_D += gridDim.y)
-                    mysum += Ai;
+                    {
+                        int c = i2_D * 32 + threadIdx.x;
+                        if (c < C)
+                        {
+                            mysum = 0;
+                            for (int b = 0; b < B; ++b)
+                            {
+                                mysum += X[a * sX0 + b * sX1 + c * sX2];
+                            }
+                            Z[a * sZ0 + c * sZ1] = mysum;
+                        }
+                    }
+                }
+            }
+            """ %locals()
+        if self.reduce_mask == (0,1,0):
+            #
+            # This kernel is optimized when the inner most dimensions have the smallest stride.
+            # this kernel uses one block for multiple column(up to 32TODO),
+            # threads per block for each element per column.
+#thread.x = dim 2 contiguous
+#thread.y = dim 1
+#block.x = dim 0
+#block.y = dim 1 rest
+            init = self._k_init(node,nodename)
+            decl = self._k_decl(node, nodename, pattern="010_inner")
+            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
+            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
+            print >> sio, """
+            %(decl)s
+            {
+             if(warpSize<blockDim.x){  
+               //TODO: set error code
+               Z[0] = -666;
+               return;
+              }
+              %(init)s
+              for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
+              {
+                for (int i2 = blockIdx.y*blockDim.x+threadIdx.x; i2 < d2; i2 += gridDim.y*blockDim.x)
+                 {
+                  for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
+                  {
+                      mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
                  }
                  %(reducebuf)s
                 }
+              }
+            }
            """ %locals()
        if self.reduce_mask == (1,1,0):
            # this kernel uses one block for each column, 
@@ -1387,6 +1600,34 @@ class GpuSum(Op):
                }
            }
            """ %locals()
+        if self.reduce_mask == (0,1,0,1):
+            # this kernel uses one block for each row, 
+            # threads per block for each element per row.
+            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]')
+            decl = self._k_decl(node, nodename)
+            init = self._k_init(node, nodename)
+            print >> sio, """
+            %(decl)s
+            {
+                %(init)s
+                for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x)
+                {
+                    for (int i2 = blockIdx.y; i2 < d2; i2 += gridDim.y)
+                    {
+                        float mysum = 0.0f;
+                    for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
+                    {
+                        for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
+                        {
+                            mysum += A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3];
+                        }
+                    }
+                        %(reducebuf)s
+                    }
+                }
+            }
+            """ %locals()
        if self.reduce_mask == (1,1,1,1):
            reducebuf = self._k_reduce_buf('Z[0]')
            decl = self._k_decl(node, nodename)

--- a/theano/sandbox/cuda/blas.py
+++ b/theano/sandbox/cuda/blas.py
@@ -149,17 +149,15 @@ class GpuGemm(Op):
    """
    implement the gemm on the gpu.
-    ..note: This probably don't work correctly for no_inplace gemm.
-            Need to check al least refcount.
    """
    def __init__(self, inplace):
        self.__setstate__({'inplace':inplace})
    def __str__(self):
-        if self.inplace: inplace_str = 'inplace'
+        if self.inplace:
-        else: inplace_str = 'no_inplace'
+            return 'GpuGemm{inplace}'
-        return '%s{%s}' % (self.__class__.__name__, inplace_str)
+        else:
+            return 'GpuGemm{no_inplace}'
    def __eq__(self, other):
        return (type(self) == type(other)\
@@ -183,7 +181,7 @@ class GpuGemm(Op):
        return Apply(self, [z, a, x, y, b], [z.type()])
    def c_code_cache_version(self):
-        return (2,)
+        return (3,)
    def c_code(self, node, name, inputs, outputs, sub):
        #z_out = alpha * dot(x,y) + beta * z_in
@@ -192,8 +190,9 @@ class GpuGemm(Op):
        z_in, a, x, y, b = inputs
        z_out, = outputs
        fail = sub['fail']
-        if self.inplace:
+        sio = StringIO.StringIO()
-            return """
+        print >> sio, """
        #define REAL float
        float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) 
@@ -205,51 +204,48 @@ class GpuGemm(Op):
        : (REAL)(((double*)%(b)s->data)[0]);
        #undef REAL
-        if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_in)s))
+        """
-        {
+        if self.inplace:
-            %(fail)s;
+            print >> sio, """
-        }
            Py_XDECREF(%(z_out)s);
            %(z_out)s = %(z_in)s;
            Py_INCREF(%(z_out)s);
-        """ % locals()
+            """
        else:
-            return """
+            print >> sio, """
-        #define REAL float
+            if (!%(z_out)s
-        float %(name)s_a = (%(a)s->descr->type_num == PyArray_FLOAT) 
+                || (%(z_out)s->nd != 2)
-        ? (REAL)(((float*)%(a)s->data)[0])
-        : (REAL)(((double*)%(a)s->data)[0]);
-        float %(name)s_b = (%(b)s->descr->type_num == PyArray_FLOAT) ?
-        (REAL)(((float*)%(b)s->data)[0])
-        : (REAL)(((double*)%(b)s->data)[0]);
-        #undef REAL
-        if ((NULL == %(z_out)s)
                || (CudaNdarray_HOST_DIMS(%(z_out)s)[0] != CudaNdarray_HOST_DIMS(%(z_in)s)[0])
-            || (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1]))
+                || (CudaNdarray_HOST_DIMS(%(z_out)s)[1] != CudaNdarray_HOST_DIMS(%(z_in)s)[1])
+                )
            {
                Py_XDECREF(%(z_out)s);
                %(z_out)s = (CudaNdarray*)CudaNdarray_Copy(%(z_in)s);
+                if (!%(z_out)s)
-            if(!%(z_out)s) {
+                {
-                PyErr_SetString(PyExc_MemoryError, "failed to alloc GpuGemm{no_inplace} output");
+                    %(fail)s;
-                %(fail)s
+                }
            }
-        }else{
+            else
-            if(CudaNdarray_CopyFromCudaNdarray(%(z_out)s,%(z_in)s)){
+            {
-                PyErr_SetString(PyExc_MemoryError, "failed to copy input in GpuGemm{no_inplace}");
+                if (CudaNdarray_CopyFromCudaNdarray(%(z_out)s, %(z_in)s))
-                %(fail)s
+                {
+                    %(fail)s;
                }
            }
+            """
+        print >> sio, """
        if (CudaNdarray_gemm(%(name)s_a, %(x)s, %(y)s, %(name)s_b, %(z_out)s))
        {
            %(fail)s;
        }
-        """ % locals()
+        """
-gpu_gemm_inplace = GpuGemm(True)
-gpu_gemm_no_inplace = GpuGemm(False)
+        return sio.getvalue() % locals()
+gpu_gemm_no_inplace = GpuGemm(inplace=False)
+gpu_gemm_inplace = GpuGemm(inplace=True)
 ##
 # Not really a BLAS operation, but whatever.

--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -558,6 +558,10 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
        free(rval_dims);
        return NULL;
    }
+    if (rval_size==0)
+    {
+        return CudaNdarray_NewDims(rval_nd, rval_dims);
+    }
    if(CudaNdarray_is_c_contiguous(self))
    {
@@ -1019,6 +1023,44 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other)
                Py_INCREF(py_self);
                return py_self;
            }
+        case 5:
+            {
+                dim3 n_blocks(
+                        std::min(CudaNdarray_HOST_DIMS(self)[1], NUM_VECTOR_OP_BLOCKS),
+                        CudaNdarray_HOST_DIMS(self)[2]
+                        );
+                while (n_blocks.x * n_blocks.y > NUM_VECTOR_OP_BLOCKS) n_blocks.y /= 2;
+                dim3 n_threads(
+                        std::min(CudaNdarray_HOST_DIMS(self)[3], NUM_VECTOR_OP_THREADS_PER_BLOCK)
+                    );
+                for (int i = 0; i < CudaNdarray_HOST_DIMS(self)[0]; ++i)
+                {
+                    k_iAdd_4<<<n_blocks, n_threads>>>(
+                            CudaNdarray_HOST_DIMS(self)[1],
+                            CudaNdarray_HOST_DIMS(self)[2],
+                            CudaNdarray_HOST_DIMS(self)[3],
+                            CudaNdarray_HOST_DIMS(self)[4],
+                            CudaNdarray_DEV_DATA(self) + i * CudaNdarray_HOST_STRIDES(self)[0],
+                            CudaNdarray_HOST_STRIDES(self)[1],
+                            CudaNdarray_HOST_STRIDES(self)[2],
+                            CudaNdarray_HOST_STRIDES(self)[3],
+                            CudaNdarray_HOST_STRIDES(self)[4],
+                            CudaNdarray_DEV_DATA(other) + i * CudaNdarray_HOST_STRIDES(other)[0],
+                            CudaNdarray_HOST_STRIDES(other)[1],
+                            CudaNdarray_HOST_STRIDES(other)[2],
+                            CudaNdarray_HOST_STRIDES(other)[3],
+                            CudaNdarray_HOST_STRIDES(other)[4]);
+                    CNDA_THREAD_SYNC;
+                    cudaError_t err = cudaGetLastError();
+                    if( cudaSuccess != err) 
+                    {
+                        PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iAdd", cudaGetErrorString(err));
+                        return NULL;
+                    }
+                }
+                Py_INCREF(py_self);
+                return py_self;
+            }
    }
    PyErr_Format(PyExc_NotImplementedError, "inplace_add w nd=%i\n", self->nd);
@@ -1370,7 +1412,7 @@ CudaNdarray_setitem(PyObject *o, PyObject  *key, PyObject  *v)
    {
        PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_setitem: syncing structure to device failed");
        Py_DECREF(rval);
-        return NULL;
+        return -1;
    }
    CudaNdarray *viewCopyForComparison = 

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -2,12 +2,13 @@ import sys
 import theano
 import numpy
 from theano import scalar as scal
-from theano import tensor, compile
+from theano import tensor, compile, gof
 from theano.gof import local_optimizer, EquilibriumDB, SequenceDB, Optimizer, toolbox, DestroyHandler
 from theano.sandbox.cuda.basic_ops import *
 from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda.blas import gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace, gpu_gemm_no_inplace, GpuConv
+from theano.sandbox.cuda.blas import (gpu_dot22, gpu_dot22scalar, gpu_gemm_inplace,
+        gpu_gemm_no_inplace, GpuConv)
 from theano.sandbox.cuda.blas import GpuDownsampleFactorMax, GpuDownsampleFactorMaxGrad
 from theano.sandbox.cuda.nnet import (
        GpuCrossentropySoftmaxArgmax1HotWithBias,
@@ -187,34 +188,25 @@ def local_gpu_dot22scalar(node):
 @local_optimizer([])
 def local_gpu_gemm(node):
    """
-    work for inplace and not inplace gemm
    gpu_from_host(gemm) -> gpu_gemm(gpu_from_host)
    gemm(host_from_gpu) -> host_from_gpu(gpu_gemm)
    """
+    gemms = {tensor.blas.gemm_inplace: gpu_gemm_inplace,
+            tensor.blas.gemm_no_inplace: gpu_gemm_no_inplace}
    if node.op == gpu_from_host:
        host_input = node.inputs[0]
-        if host_input.owner and host_input.owner.op == tensor.blas.gemm_inplace:
+        if host_input.owner and host_input.owner.op in gemms:
-            z, a, x, y, b = host_input.owner.inputs
+            op = host_input.owner.op
-            return [gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
-        elif host_input.owner and host_input.owner.op == tensor.blas.gemm_no_inplace:
            z, a, x, y, b = host_input.owner.inputs
-            return [gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
+            return [gemms[op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b)]
-    elif node.op == tensor.blas.gemm_inplace:
+    if node.op in gemms:
-        z, a, x, y, b = node.inputs
-        x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
-        y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
-        z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
-        if x_on_gpu or y_on_gpu or z_on_gpu:
-            return [host_from_gpu(gpu_gemm_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
-    elif node.op == tensor.blas.gemm_no_inplace:
        z, a, x, y, b = node.inputs
        x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
        y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
        z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
        if x_on_gpu or y_on_gpu or z_on_gpu:
-            return [host_from_gpu(gpu_gemm_no_inplace(gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
+            return [host_from_gpu(gemms[node.op](gpu_from_host(z), a, gpu_from_host(x), gpu_from_host(y), b))]
    return False
 @register_opt()
@@ -234,8 +226,44 @@ def local_gpu_sum(node):
                gsum=GpuSum(reduce_mask)
                pattern=(''.join(str(i) for i in reduce_mask))
                if hasattr(gsum, 'c_code_reduce_%s'%pattern):
-                    return [host_from_gpu(gsum(gpu_from_host(x)))]
+                    rval = host_from_gpu(gsum(gpu_from_host(x)))
+                    if rval.type == node.outputs[0].type:
+                        return [rval]
+                    else:
+                        print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
+                        return None
                else:
+                    # Try to make a simpler pattern based on reshaping
+                    # The principle is that if two adjacent dimensions have the same value in
+                    # the reduce_mask, then we can reshape to make them a single dimension, do
+                    # the sum, and then reshape to get them back.
+                    shape_of = node.env.shape_feature.shape_of
+                    x_shape = shape_of[x]
+                    new_in_shp = [x_shape[0]]
+                    new_mask = [reduce_mask[0]]
+                    for i in range(1, x.type.ndim):
+                        if reduce_mask[i] == reduce_mask[i-1]:
+                            new_in_shp[-1] *= x_shape[i]
+                        else:
+                            new_mask.append(reduce_mask[i])
+                            new_in_shp.append(x_shape[i])
+                    pattern=(''.join(str(i) for i in new_mask))
+                    new_gsum = GpuSum(new_mask)
+                    if hasattr(new_gsum, 'c_code_reduce_%s'%pattern):
+                        reshaped_x = x.reshape(tensor.stack(*new_in_shp))
+                        sum_reshaped_x = host_from_gpu(new_gsum(gpu_from_host(reshaped_x)))
+                        unreshaped_sum = sum_reshaped_x.reshape(tensor.stack(*shape_of[node.outputs[0]]))
+                        if unreshaped_sum.type == node.outputs[0].type:
+                            return [unreshaped_sum]
+                        else:
+                            print >> sys.stderr, "WARNING: local_gpu_sum got type wrong"
+                            return None
                        raise Exception("GpuSum don't have implemented the pattern",pattern)
    return False

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -34,11 +34,14 @@ def test_sum():
    test sum pattern 1, 11, 10, 01, 100, 110, 011, 001, 111, 0011, 0111, 1011, 1111
    TODO: test with broadcast
    """
-    for shape, pattern in [((0,),[0]),((5,),[0]),
+    for shape, pattern in [((100,3,1300),[1]),
+                           ((0,),[0]),((5,),[0]),
                           ((0,0),[0,1]),((1,0),[0,1]),((5,4),[0,1]),((33,31),[0,1]),((5,4),[1]),((5,4),[0]),#need something bigger then 32 for some opt test.
-                           ((5,4,3),[0]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]),
+                           ((5,4,3),[0]),((5,4,3),[1]),((5,4,3),[0,1]),((5,4,3),[2]),((5,4,3),[1,2]),((5,4,3),[0,1,2]),
                           ((0,0,0,0),[0,1,2,3]),
                           ((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
+                           ((5,4,3,10,11),[1,2]),
+                           ((5,4,3,20),[2,3]), ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3]),((5,4,3,2),[1,2,3]),
                           #test shape bigger then 4096 on each dimension to make sure that we work correctly when we don't have enought thread/block in each dimensions
                           ((4100,3),[0]),((3,4101),[0]),#10
@@ -47,7 +50,7 @@ def test_sum():
                           ((4100,3),[0,1]),((3,4101),[0,1]),#11
                           ((4100,4,3),[0]),((5,4100,3),[0]),((5,4,4100),[0]),#100
-                           #((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented
+                           ((4100,4,3),[1]),((5,4100,3),[1]),((5,4,4100),[1]),#010 ##not implemented
                           ((4100,4,3),[2]),((5,4100,3),[2]),((5,4,4100),[2]),#001
                           ((4100,4,3),[0,1]),((5,4100,3),[0,1]),((5,4,4100),[0,1]),#110
                           ((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011
@@ -70,9 +73,9 @@ def test_sum():
        assert tcn.GpuSum in [x.op.__class__ for x in f.maker.env.toposort()]
        assert T.Sum in [x.op.__class__ for x in f2.maker.env.toposort()]
        if val.size==0:
-            assert f2(val)==f(val)
+            assert f2(val)==f(val), ('shape', shape, 'pattern', pattern)
        else:
-            assert numpy.allclose(f2(val),f(val))
+            assert numpy.allclose(f2(val),f(val)), ('shape', shape, 'pattern', pattern)
        #test with dimshuffle

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -191,25 +191,31 @@ class Scalar(Type):
                typedef theano_complex%(nbits)s complex_type;
                typedef npy_float%(half_nbits)s scalar_type;
-                complex_type operator +(complex_type y) {
+                complex_type operator +(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real + y.real;
                    ret.imag = this->imag + y.imag;
                    return ret;
                }
-                complex_type operator -(complex_type y) {
+                complex_type operator -() const {
+                    complex_type ret;
+                    ret.real = -this->real;
+                    ret.imag = -this->imag;
+                    return ret;
+                }
+                complex_type operator -(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real - y.real;
                    ret.imag = this->imag - y.imag;
                    return ret;
                }
-                complex_type operator *(complex_type y) {
+                complex_type operator *(const complex_type &y) const {
                    complex_type ret;
                    ret.real = this->real * y.real - this->imag * y.imag;
                    ret.imag = this->real * y.imag + this->imag * y.real;
                    return ret;
                }
-                complex_type operator /(complex_type y) {
+                complex_type operator /(const complex_type &y) const {
                    complex_type ret;
                    scalar_type y_norm_square = y.real * y.real + y.imag * y.imag;
                    ret.real = (this->real * y.real + this->imag * y.imag) / y_norm_square;
@@ -296,6 +302,7 @@ class Scalar(Type):
            return ""
    def c_code_cache_version(self):
+        return (8,) # put const around operators and added unary '-' operator
        # no need to put lib.amdlibm here as c_compile_args() are put in the key.
        return (7,)  # make complex c code optional
        return (6,)  # added implemeentations of operators that work with scalar arguments
@@ -459,7 +466,39 @@ def float_out_nocomplex(*types):
        if type in complex_types:
            raise TypeError('complex argument not supported')
    return float64,
+class unary_out_lookup(gof.utils.object2):
+    """
+    get a output_types_preference object by passing a dictionary:
+    unary_out_lookup({int8:int32, float32:complex128})
+    The result is an op that maps in8 to int32 and float32 to complex128 and other input types
+    lead to a TypeError.
+    """
+    def __init__(self, type_table):
+        self.tbl = type_table
+    def __call__(self, *types):
+        if len(types) == 1:
+            types = types[0]
+        try:
+            rval = self.tbl[types]
+        except:
+            raise TypeError(types)
+        if isinstance(types, (list, tuple)):
+            return rval
+        else:
+            return [rval]
+    def __eq__(self, other):
+        return type(self) == type(other) and self.tbl == other.tbl
+    def __hash__(self):
+        return hash(type(self)) # ignore hash of table
+def real_out(type):
+    if type == complex64:
+        return float32,
+    if type == complex128:
+        return float64,
+    return type,
 class ScalarOp(Op):
@@ -810,6 +849,13 @@ class Mul(ScalarOp):
      retval = []
      for input in inputs:
        if input.type in grad_types:
+          if input.type in complex_types:
+            # does casting from real to complex work?
+            dz_dinput = cast(mul(*(utils.difference(inputs, [input]))), input.type.dtype)
+            x = real(dz_dinput)
+            y = imag(dz_dinput)
+            retval += [complex(x*real(gz)+y*imag(gz), x*imag(gz)-y*real(gz))]
+          else:
            retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)]
        else:
          retval += [None]
@@ -1088,11 +1134,9 @@ class Abs(UnaryScalarOp):
        return numpy.abs(x)
    def grad(self, (x, ), (gz, )):
        if x.type in grad_types:
-          return gz * sgn(x),
+            return gz * x / abs(x), # formula works for complex and real
        else:
            return None,
-        #backport
-        #return gz * sgn(x) if x.type in grad_types else None,
    def c_code(self, node, name, (x, ), (z, ), sub):
        type = node.inputs[0].type
        if type in int_types:
@@ -1478,6 +1522,91 @@ class Tanh(UnaryScalarOp):
        return "%(z)s = tanh(%(x)s);" % locals()
 tanh = Tanh(upgrade_to_float, name = 'tanh')
+class Real(UnaryScalarOp):
+    """Extract the real coordinate of a complex number.  """
+    def impl(self, x):
+        return numpy.real(x)
+    def grad(self, (x, ), (gz, )):
+        return [complex(gz, 0)]
+real = Real(real_out, name='real')
+class Imag(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.imag(x)
+    def grad(self, (x, ), (gz, )):
+        return [complex(0, gz)]
+imag = Imag(real_out, name='imag')
+class Angle(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.angle(x)
+    def grad(self, (c, ), (gtheta, )):
+        # y = x.imag
+        # r = sqrt(y**2 + x.real**2)
+        # g = y/r
+        # if x == 0 and y == 0:
+        #     theta = 0
+        # elif x >= 0:
+        #     theta = numpy.arcsin(g)
+        # else:
+        #     theta = -numpy.arcsin(g)+numpy.pi
+        x = real(c)
+        y = imag(c)
+        r = abs(c)
+        gr = -gtheta * y / (r**2 * sqrt(1 - (y/r)**2))
+        gx = gr * x/r
+        gy = gr * y/r
+        return [complex(gx, gy)]
+angle = Angle(specific_out(float64), name='angle')
+class Complex(BinaryScalarOp):
+    @staticmethod
+    def output_types_preference(x,y):
+        if x in complex_types:
+            raise TypeError(x)
+        if y in complex_types:
+            raise TypeError(y)
+        up = Scalar.upcast(x, y)
+        if up in ('float64', 'int64', 'uint64', 'int32', 'uint32'):
+            return [complex128]
+        else:
+            return [complex64]
+    def impl(self, x, y):
+        return numpy.complex(x, y)
+    def grad(self, (x,y), (z,)):
+        return [real(z), imag(z)]
+complex = Complex(name='complex')
+class Conj(UnaryScalarOp):
+    def impl(self, x):
+        return numpy.conj(x)
+    def grad(self, (x, ), (gz, )):
+        return [conj(gz)]
+conj = Conj(same_out, name='conj')
+class ComplexFromPolar(BinaryScalarOp):
+    @staticmethod
+    def output_types_preference(x,y):
+        return Complex.output_types_preference(x,y)
+    def impl(self, r, theta):
+        if r < 0:
+            raise ValueError('polar radius must be non-negative', r)
+        x = r*numpy.cos(theta)
+        y = r*numpy.sin(theta)
+        if x.dtype == 'float32':
+            return numpy.complex64(numpy.complex(x,y))
+        else:
+            return numpy.complex128(numpy.complex(x,y))
+    def grad(self, (r,theta), (gz,)):
+        gr = cos(theta) * real(gz) + sin(theta) * imag(gz)
+        gtheta = -real(gz) * r * sin(theta) + imag(gz) * r * cos(theta)
+        return [gr, gtheta]
+complex_from_polar = ComplexFromPolar(name='complex_from_polar')
 class Composite(ScalarOp):
@@ -1642,3 +1771,4 @@ class Composite(ScalarOp):
        #otherwise self.perform won't work.
        self.__init__(self.inputs, self.outputs)
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -10,7 +10,7 @@ if sys.version_info >= (2,5):
  import functools
 import numpy, theano
-from copy import copy
+#from copy import copy as python_copy
 from theano import gof
 from theano.gof import Variable, Op, utils, Type, Constant,  Value
@@ -342,10 +342,10 @@ def get_constant_value(v):
        # it is not a constant, but in some cases it *could* be replaced with one.
        # Note that this would have an effect on the broadcasting of inputs and so on
        try:
-            complex(v.data) #works for all numeric scalars
+            numpy.complex(v.data) #works for all numeric scalars
            return v.data
        except:
-            raise TypeError(v)
+            raise TypeError('v.data is non-numeric', v)
    if v.owner:
        if isinstance(v.owner.op, Alloc):
            return get_constant_value(v.owner.inputs[0])
@@ -484,12 +484,12 @@ class TensorType(Type):
                return False
            if 'int' in str(a.dtype):
                return numpy.all(a==b)
-            elif a.shape == (): #for comparing scalars, use broadcasting.
+            #elif a.shape == (): #for comparing scalars, use broadcasting.
-                # Note: according to James B, there was a reason for the
+                ## Note: according to James B, there was a reason for the
-                # following two lines, that may seem weird at first glance.
+                ## following two lines, that may seem weird at first glance.
-                # If someone can figure out what it is, please say it here!
+                ## If someone can figure out what it is, please say it here!
-                ones = numpy.ones(2)
+                #ones = numpy.ones(2)
-                return _allclose(ones * a, ones*b)
+                #return _allclose(ones * a, ones*b) ### dtype handling is wrong here
            else:
                cmp = _allclose(a, b)
                if cmp:
@@ -1638,88 +1638,25 @@ def sinh(a):
 def tanh(a):
    """hyperbolic tangent of a"""
-class Real(Op):
+@_scal_elemwise
-    """Extract the real elements of a complex ndarray"""
+def real(z):
-    view_map = {0:[0]}
+    """Return real component of complex-valued tensor `z`"""
-    def __eq__(self, other):
-        return type(self) == type(other)
-    def __hash__(self):
-        return hash(type(self))
-    def make_node(self, x):
-        _x = as_tensor(x)
-        y_dtype = _x.type.dtype
-        if y_dtype == 'complex64':
-            y_dtype = 'float32'
-        if y_dtype == 'complex128':
-            y_dtype = 'float64'
-        _y = Tensor(y_dtype, _x.type.broadcastable)()
-        return Apply(self, [_x], [_y])
-    def perform(self, node, (x,), (y,)):
-        if str(x.dtype).startswith('complex'):
-            y[0] = x.real
-        else:
-            y[0] = x
-    def grad(self, inputs, (g_y,)):
-        #TODO: waiting on a Complex(real=, imag=) op that can merge
-        #things back into a complex tensor
-        raise NotImplementedError()
-_real = Real()
-@constructor
-def real(x):
-    """Return the real part of real or complex-valued `x`
-    For real-valued `x`, `x` itself is returned.
-    """
-    _x = as_tensor_variable(x)
-    if _x.type.dtype.startswith('complex'):
-        return _real(x)
-    else:
-        return _x
-class Imag(Op):
+@_scal_elemwise
-    """Extract the imaginary elements of a complex ndarray"""
+def imag(z):
-    view_map = {0:[0]}
+    """Return imaginary component of complex-valued tensor `z`"""
-    def __eq__(self, other):
-        return type(self) == type(other)
-    def __hash__(self):
-        return hash(type(self))
-    def make_node(self, x):
-        _x = as_tensor_variable(x)
-        if not _x.type.dtype.startswith('complex'):
-            raise TypeError('Imag(x) requires complex x', x)
-        if _x.type.dtype == 'complex64': y_dtype = 'float32'
-        elif _x.type.dtype == 'complex128': y_dtype = 'float64'
-        else:
-            raise NotImplementedError('what is this?', y_dtype)
-        _y = Tensor(y_dtype, _x.type.broadcastable)()
-        return Apply(self, [_x], [_y])
-    def perform(self, node, (x,), (y,)):
-        if str(x.dtype).startswith('complex'):
-            y[0] = x.imag
-        else:
-            y[0] = x * 0
-    def grad(self, inputs, (g_y,)):
-        # TODO: waiting on a complex(real=, imag=) op that can merge
-        # things back into a complex tensor
-        raise NotImplementedError()
-_imag = Imag()
-@constructor
-def imag(x):
-    """Return the imaginary part of real or complex-valued `x`
-    For real-valued 'x' this returns `zeros_like(x)`.
+@_scal_elemwise
-    """
+def angle(z):
-    _x = as_tensor_variable(x)
+    """Return polar-coordinate angle of complex-valued tensor `z`"""
-    if _x.type.dtype.startswith('complex'):
-        return _imag(x)
-    else:
-        return zeros_like(x)
-@constructor
+@_scal_elemwise
-def angle(x):
+def complex(real, imag):
-    """Return the angular component of complex-valued `x`"""
+    """Return complex-valued tensor with `real` and `imag` components"""
-    raise NotImplementedError()
+@_scal_elemwise
+def complex_from_polar(abs, angle):
+    """Return complex-valued tensor from polar coordinate specification"""
 ##########################
 # Misc
@@ -2150,6 +2087,7 @@ def pow(a, b):
 def clip(x, min, max):
    """clip x to be between min and max"""
    # see decorator for function body
+    # for grep: clamp, bound
 pprint.assign(add, printing.OperatorPrinter('+', -2, 'either'))
 pprint.assign(mul, printing.OperatorPrinter('*', -1, 'either'))
@@ -2344,7 +2282,7 @@ class Subtensor(Op):
                    and (idx.step is None or idx.step == 1):
                        outshp.append(xl)
                else:
-                    #No easy way to compute the shape
+                    # Not implemented yet
                    outshp.append(shape_i(i)(node.outputs[0]))
                i += 1
            else:
@@ -2443,31 +2381,48 @@ class SubtensorPrinter:
 pprint.assign(lambda pstate, r: r.owner and isinstance(r.owner.op, Subtensor), SubtensorPrinter())
 def setsubtensor(x, y, idx_list, inplace=False):
-    """
+    print >> sys.stderr, "tensor.setsubtensor is deprecated - please use set_subtensor"
-    setsubtensor is meant to replicate the following numpy behaviour: x[i,j,k] = y
+    the_op = IncSubtensor(idx_list, inplace, set_instead_of_inc=True)
+    return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
+def incsubtensor(x, y, idx_list, inplace=False):
+    print >> sys.stderr, "tensor.setsubtensor is deprecated - please use set_subtensor"
+    the_op = IncSubtensor(idx_list, inplace, set_instead_of_inc=False)
+    return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
+def set_subtensor(x, y):
+    """Return x with the given subtensor overwritten by y.
+    Example: To replicate the numpy expression "r[10:] = 5", type
+    >>> new_r = set_subtensor(r[10:], 5)
    :param x: symbolic variable for the lvalue of = operation
    :param y: symbolic variable for the rvalue of = operation
-    :param idx_list: tuple of length x.dim, containing indices with which to index x.
-    :param inplace: boolean to declare whether the operation is in place or not (False unless
-     called from within an optimization)
-    :Details: idx_list can be a tuple containing a mixture of numeric constants, symbolic
+    :see: theano.tensor.basic.setsubtensor
-     scalar values and standard numpy slice objects. i.e: 
-     idx_list=(1,2,3), idx_list=(1,b,3) where b is an iscalar variable, 
-     idx_list=(slice(start,stop,step),b,3) equivalent to x[start:stop:step, b, 3]
    """
-    the_op = IncSubtensor(idx_list, inplace, True)
+    return inc_subtensor(x, y, set_instead_of_inc=True)
-    return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
-def incsubtensor(x, y, idx_list, inplace=False):
+def inc_subtensor(x, y, set_instead_of_inc=False):
-    """
+    """Return x with the given subtensor incremented by y.
-    incsubtensor is meant to replicate the following numpy behaviour: x[i,j,k] += y
+    :param x: the symbolic result of a Subtensor operation.
+    :param y: the amount by which to increment ths subtensor in question
+    Example: To replicate the numpy expression "r[10:] += 5", type
+    >>> new_r = inc_subtensor(r[10:], 5)
    :see: theano.tensor.basic.setsubtensor
    """
-    the_op = IncSubtensor(idx_list, inplace, False)
+    # retrieve idx_list from x.owner
-    return the_op(x, y, *Subtensor.collapse(idx_list, lambda entry: isinstance(entry, Variable)))
+    if not isinstance(x.owner.op, Subtensor):
+        raise TypeError('x must be result of a subtensor operation')
+    the_op = IncSubtensor(x.owner.op.idx_list, inplace, set_instead_of_inc)
+    real_x = x.owner.inputs[0]
+    real_idxargs = x.owner.inputs[1:]
+    return the_op(real_x, y, *real_idxargs)
 class IncSubtensor(Op):
    """Increment a subtensor.
@@ -3125,6 +3080,8 @@ class Reshape(Op):
            raise ValueError('Cannot reshape input of shape %s to shape %s' % (x.shape,shp))
    def grad(self, (x, shp), (g_out,)):
        return [reshape(g_out, shape(x), ndim=x.ndim), None]
+    def infer_shape(self, node, ishapes):
+        return [node.inputs[1]]
 def reshape(x, newshape, ndim=None, name=None):
    if ndim is None:
@@ -3155,7 +3112,10 @@ class Flatten(Op):
    def perform(self, node, (x,), (out,)):
        outdim = self.outdim
        if outdim == 1:
+            try:
                out[0] = x.reshape(x.size)
+            except AttributeError:
+                out[0] = x.reshape((numpy.prod(x.shape),))
        elif outdim == len(x.shape):
            out[0] = x
        else:
@@ -3865,8 +3825,34 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False):
 class numeric_grad:
    """WRITEME"""
+    # Note on step sizes and tolerances:
+    #
+    # There is a relationship between the step size and the function value and the measurement
+    # error that is incurred due to rounding.  The finite difference we measure is
+    # delta = f(x0) - f(x0+eps) 
+    # 
+    # For maximum precision, f should be close to zero.
+    # For every power of 2 that f departs from zero, we lose a bit of precision in delta.
+    #
+    # Even in this case of maximum accuracy, there is a tradeoff between stepsize and
+    # measurement error.
+    # Taking small steps allows us to measure large derivatives accuractly, but longer steps
+    # are required to measure small derivatives accurately.  However longer steps introduce
+    # bias into our measurement in general for non-linear functions.
+    #
+    # It would be interesting to have a version of numeric grad that used an adaptive stepsize.
+    # 
+    # For now, we use a heuristic that catches very bad gradients, but is not perfectly
+    # accurate.
    type_eps = {'float64': 1e-7,
-            'float32': 3e-3}
+            'float32': 3e-3,
+            numpy.dtype('float64'):1e-7,
+            numpy.dtype('float32'):3e-3}
+    type_tol = {'float64': 1e-4,
+            numpy.dtype('float64'):1e-4,
+            'float32': 1e-1,
+            numpy.dtype('float32'):1e-1}
    def __init__(self, f, pt, eps=None):
        """Return the gradient of f at pt.
@@ -3942,12 +3928,29 @@ class numeric_grad:
            self.gf = self.gf[0]
    @staticmethod
-    def abs_rel_err(a,b,eps=1.0e-10):
+    def abs_rel_err(a,b,tol=None):
-        """Return a small number when a and b are close, relative to how big they are"""
+        """Return a small number when a and b are close, relative to how big they are.
-        return abs(a-b) / (abs(a)+abs(b)+eps)
-    def max_err(self, g_pt):
+        Formula used: a - b / (abs(a) + abs(b) + tol)
-        """Return the biggest relative error between g_pt and self.gf"""
+        `tol` defaults to relatively generous value that is suitable for catching completely
+        wrong gradients. (`self.type_tol`
+        """
+        if tol is None:
+            tol = __builtin__.max(numeric_grad.type_tol[a.dtype], numeric_grad.type_tol[b.dtype])
+        return abs(a-b) / (abs(a)+abs(b)+tol)
+    def abs_rel_errors(self, g_pt):
+        """Return the abs rel error of gradient estimate `g_pt`
+        `g_pt` must be a list of ndarrays of the same length as self.gf, otherwise a ValueError
+        is raised.
+        Corresponding ndarrays in `g_pt` and `self.gf` must have the same shape or ValueError
+        is raised.
+        """
        if len(g_pt) != len(self.gf):
            raise ValueError('argument has wrong number of elements', len(g_pt))
        errs = []
@@ -3955,7 +3958,12 @@ class numeric_grad:
            if a.shape != b.shape:
                raise ValueError('argument element %i has wrong shape %s' %(i,str((a.shape,
                    b.shape))))
-            errs.append(numpy.max(numeric_grad.abs_rel_err(a,b)))
+            errs.append(numeric_grad.abs_rel_err(a,b))
+        return errs
+    def max_err(self, g_pt):
+        """Return the biggest relative error between g_pt and self.gf"""
+        errs = [numpy.max(e) for e in self.abs_rel_errors(g_pt)]
        if numpy.all(numpy.isfinite(errs)):
            return numpy.max(errs), numpy.argmax(errs)
        else:
@@ -4058,7 +4066,19 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast
        max_err, max_err_pos = num_grad.max_err(analytic_grad)
        if  max_err > tol:
-            raise Exception(verify_grad.E_grad, (max_err, tol, max_err_pos))
+            raise verify_grad.E_grad(tol, num_grad, analytic_grad)
+class GradientError(Exception):
+    """This error is raised when a gradient is calculated, but incorrect."""
+    def __init__(self, tol, num_grad, analytic_grad):
+        self.num_grad = num_grad
+        self.analytic_grad = analytic_grad
+        self.tol = tol
+    def __str__(self):
+        max_errs = [numpy.max(e) for e in self.num_grad.abs_rel_errors(self.analytic_grad)]
+        return "GradientError: numeric gradient and analytic gradient differ than %f (%s)" %(
+                self.tol, max_errs)
+    def abs_rel_errors(self):
+        return self.num_grad.abs_rel_errors(self.analytic_grad)
-verify_grad.E_grad = 'gradient error exceeded tolerance'
+verify_grad.E_grad = GradientError
-"""This error is raised when a gradient is calculated, but incorrect."""
--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -618,7 +618,7 @@ class Elemwise(Op):
            errormsg = 'Failed calling ufunc for op', self.scalar_op,\
                        'for params of shape', [arg.shape for arg in ufunc_args]
            e.args = e.args + errormsg
-            raise e
+            raise
        if ufunc.nout == 1: variables = [variables]
        for variable, storage in zip(variables, output_storage):
            if storage[0].shape:

--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -438,13 +438,13 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        y_idx = tensor.as_tensor_variable(y_idx)
        if x.type.ndim != 2 \
                or x.type.dtype not in ['float32', 'float64']:
-            raise ValueError('x must be 2-d tensor of floats')
+            raise ValueError('x must be 2-d tensor of floats', x.type)
        if b.type.ndim != 1 \
                or x.type.dtype not in ['float32', 'float64']:
-            raise ValueError('b must be 1-d tensor of floats')
+            raise ValueError('b must be 1-d tensor of floats', b.type)
        if y_idx.type.ndim != 1 \
                or y_idx.type.dtype not in ['int8', 'int16', 'int32', 'int64']:
-            raise ValueError('y_idx must be 1-d tensor of ints')
+            raise ValueError('y_idx must be 1-d tensor of ints', y_idx.type)
 #       TODO: Is this correct? It used to be y, not y_idx
        nll = tensor.TensorType(x.type.dtype,

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -180,11 +180,12 @@ def local_dimshuffle_lift(node):
        new_order = [x == 'x' and 'x' or inode.op.new_order[x] for x in op.new_order]
        inplace = op.inplace and inode.op.inplace
        iinput = inode.inputs[0]
-        if new_order == range(len(new_order)):
+        if new_order == range(len(new_order)) and (len(new_order) == iinput.type.ndim):
            return [iinput]
        else:
            return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs
 register_canonicalize(local_dimshuffle_lift)
 register_specialize(local_dimshuffle_lift)
@@ -519,8 +520,8 @@ def local_fill_to_alloc(node):
            # TODO: cut out un-necessary dimshuffles of v
            rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
-        if rval[0].type != node.outputs[0].type:
+        #if rval[0].type != node.outputs[0].type:
-            print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
+            #print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
        assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type,
                'orig', node.outputs[0].type,
@@ -1048,9 +1049,41 @@ def local_reshape_chain(node):
    if not opt.check_chain(node, T.Reshape, T.Reshape):
        return False
+    # TODO: this can permit a failing program to run by eliminating the the lower
+    #       reshape
    return [node.op(node.inputs[0].owner.inputs[0], node.inputs[1])]
 register_canonicalize(local_reshape_chain)
+if 0:
+    # TODO: Test that this optimziation works.
+    @register_canonicalize
+    @gof.local_optimizer([])
+    def local_scalar_reshape(node):
+        """Eliminate reshape Ops whose inputs and outputs are scalars """
+        if isinstance(node.op, T.Reshape):
+            x, shp = node.inputs
+            if x.ndim == 0 and T.get_vector_length(shp)==0:
+                return [x]
+if 0:
+    # TODO: Finish writing and testing this optimization.
+    #       The idea is that if we can prove the output to this sum
+    #       has a zero-size dimension, then it can be replaced by an appropriately typed and
+    #       broadcasted zero.
+    @register_canonicalize
+    @gof.local_optimizer([])
+    def local_sum_over_empty(node):
+        if isinstance(node.op, T.Sum):
+            y, = node.outputs
+            y_shape = node.env.shape_feature.shape_of[y]
+            def tmp(thing):
+                try: 
+                    return T.get_constant_value(thing)
+                except (TypeError, ValueError), e:
+                    print e, thing.owner.inputs[0]
+                    return None
+            print 'LOCAL SUM EMPTY', [tmp(s) for s in y_shape]
 ##################
 # Middleman cuts #
@@ -1063,11 +1096,18 @@ def local_fill_cut(node):
    If c.type == a.type.
    """
+    # this optimization is essentially for getting broadcasting to replace fill. 
+    # This is always possible when using a Compound Elemwise operation, 
+    # but it is not always possible without one (consider filling a large matrix with a scalar,
+    # and then adding another scalar.  The only numbers that count are the two scalars, but we
+    # can't ignore the large matrix because it gives the shape of the result.
    if not opt.check_chain(node, T.Elemwise):
        return False
    output = node.outputs[0]
    try:
+        #reference is some input with the same type as the input but that is not produced by a fill
        reference = [input
                     for input in node.inputs
                     if input.type == output.type and (not input.owner or input.owner.op != T.fill)][0]
@@ -1086,7 +1126,12 @@ def local_fill_cut(node):
    if new_inputs == node.inputs:
        return False
-    return node.op.make_node(*new_inputs).outputs
+    rval = node.op(*new_inputs)
+    if isinstance(rval, gof.Variable):
+        return rval.owner.outputs
+    else:
+        return rval[0].owner.outputs
 register_canonicalize(local_fill_cut)

--- a/theano/tensor/tests/test_complex.py
+++ b/theano/tensor/tests/test_complex.py
 import unittest
 import theano
 from theano.tensor import *
+from theano.tests import unittest_tools as utt
 class TestRealImag(unittest.TestCase):
@@ -15,9 +16,60 @@ class TestRealImag(unittest.TestCase):
        x= dvector()
        rng = numpy.random.RandomState(23)
        xval = rng.randn(10)
-        assert numpy.all( 0 == theano.function([x], imag(x))(xval))
+        try:
-        assert numpy.all( xval == theano.function([x], real(x))(xval))
+            numpy.all( 0 == theano.function([x], imag(x))(xval))
+            assert 0
+        except TypeError:
+            pass
+        try:
+            numpy.all( xval == theano.function([x], real(x))(xval))
+            assert 0
+        except TypeError:
+            pass
    def test_cast(self):
        x= zvector()
        self.failUnlessRaises(TypeError, cast, x, 'int32')
+    def test_complex(self):
+        rng = numpy.random.RandomState(2333)
+        m = fmatrix()
+        c = complex(m[0], m[1])
+        assert c.type == cvector
+        r,i = [real(c), imag(c)]
+        assert r.type == fvector
+        assert i.type == fvector
+        f = theano.function([m], [r,i] )
+        mval = numpy.asarray(rng.randn(2,5), dtype='float32')
+        rval, ival = f(mval)
+        assert numpy.all(rval == mval[0]), (rval,mval[0])
+        assert numpy.all(ival == mval[1]), (ival, mval[1])
+    def test_complex_grads(self):
+        def f(m):
+            c = complex(m[0], m[1])
+            return .5 * real(c) + .9 * imag(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
+    def test_polar_grads(self):
+        def f(m):
+            c = complex_from_polar(abs(m[0]), m[1])
+            return .5 * real(c) + .9 * imag(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
+    def test_abs_grad(self):
+        def f(m):
+            c = complex(m[0], m[1])
+            return .5 * abs(c)
+        rng = numpy.random.RandomState(9333)
+        mval = numpy.asarray(rng.randn(2,5))
+        utt.verify_grad(f, [mval])
--- a/theano/tests/unittest_tools.py
+++ b/theano/tests/unittest_tools.py
@@ -63,3 +63,15 @@ def verify_grad(op, pt, n_tests=2, rng=None, *args, **kwargs):
        rng = numpy.random
    T.verify_grad(op, pt, n_tests, rng, *args, **kwargs)
+#
+# This supports the following syntax:
+# 
+# try:
+#     verify_grad(...)
+# except verify_grad.E_grad, e:
+#     print e.num_grad.gf
+#     print e.analytic_grad
+#     ...
+#
+#
+verify_grad.E_grad = T.verify_grad.E_grad