Merge pull request #899 from goodfeli/rebase_fix_grad

Rebase fix grad

Merge pull request #899 from goodfeli/rebase_fix_grad
d95e876d · nouiz · f35c2fef · 742225a4 · d95e876d · d95e876d
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -98,6 +98,31 @@ following methods:
  lifetime of self.  Op instances should be immutable in this
  sense.
+.. function:: connection_pattern():
+  Optional (but in extremely rare cases needed to have it work with
+   {tensor,sparse}.grad).
+  Returns a list of bools the same length as the op's inputs list.
+  True signifies that the elements of an input have an effect on its
+  output.
+  False signifies that they do not--in other words, the op acts only
+  one the input's metadata such as its shape.
+  If no connection_pattern is implemented, tensor.grad will assume
+  it is a list containing only True.
+  Failing to implement this function for an op that needs it can
+  result in tensor.grad erroneously reporting that a gradient is
+  undefined. Returning 0 for this input in the grad method is not
+  the same as specifying that the elements of this input are not
+  connected to the output. If the gradient with respect to the
+  op's output is NaN but the elements of the input are not connected
+  to it, then the NaN never enters into the expression for the
+  gradient.
 .. function:: grad(inputs, output_gradients)
  Optional (but needed to have it work with {tensor,sparse}.grad()).
@@ -106,31 +131,62 @@ following methods:
  symbolically in this method. Both ``inputs`` and ``output_gradients``
  are lists of symbolic Theano Variables and those must be operated on using 
  Theano's symbolic language. The grad method must return a list containing 
-  one Variable (or ``None``) for each input. Each returned Variable represents 
+  one Variable for each input. Each returned Variable represents 
  the gradient with respect to that input computed based on the symbolic gradients with
  respect to each output.
-  If the output is not differentiable with respect to any inputs,
+  If the output is not differentiable with respect to an input
-  then this method should be defined to return ``[None for i in
+  then this method should be defined to return a variable of type
-  inputs]``. If this method is not defined, then Theano assumes it has been
+  NullType for that input.
+  If an element of output_gradient is of type theano.gradient.DisconnectedType,
+  it means that the cost is not a function of this output. If any of the
+  op's inputs participate in the computation of only disconnected outputs,
+  then Op.grad should return DisconnectedType variables for those inputs.
+  If the grad method is not defined, then Theano assumes it has been
  forgotten.  Symbolic differentiation will fail on a graph that
  includes this Op.
-  It must be understood that the grad method is not meant to return the
+  It must be understood that the Op's grad method is not meant to return the
-  gradient of the Op's output but rather the gradient of some other scalar 
+  gradient of the Op's output. theano.tensor.grad computes gradients; Op.grad
-  criterion C with respect to the Op's input.
+  is a helper function that computes terms that appear in gradients.
+  If an Op has a single vector-valued output y and a single vector-valued input x,
+  then the grad method will be passed x and a second vector z. Define J to be
+  the Jacobian of y with respect to x. The Op's grad method should return
+  dot(J.T,z). When theano.tensor.grad calls the grad method, it will set z to
+  be the gradient of the cost C with respect to y. If this op is the only op
+  that acts on x, then dot(J.T,z) is the gradient of C with respect to x.
+  If there are other ops that act on x, theano.tensor.grad will have to add up
+  the terms of x's gradient contributed by the other op's grad method.
+  In practice, an op's input and output are rarely implemented as single vectors.
+  Even if an op's output consists of a list containing a scalar, a sparse matrix,
+  and a 4D tensor, you can think of these objects as being formed by rearranging
+  a vector. Likewise for the input. In this view, the values computed by the grad
+  method still represent a Jacobian-vector product.
+  In practice, it is probably not a good idea to explicitly construct the Jacobian,
+  which might be very large and very sparse. However, the returned value should
+  be equal to the Jacobian-vector product.
+  So long as you implement this product correctly, you need not understand what
+  theano.tensor.grad is doing, but for the curious the mathematical justification
+  is as follows:
  In essence, the grad method must simply implement through symbolic Variables
  and operations the chain rule of differential calculus. The chain rule
-  is the mathematical procedure that allows to calculate the total derivative
+  is the mathematical procedure that allows one to calculate the total derivative
  :math:`\frac{d C}{d x}` of the final scalar symbolic Variable C with respect to a
-  primitive symbolic Variable x found in the list ``inputs``,
+  primitive symbolic Variable x found in the list ``inputs``.
-  based on the knowledge of the total derivative :math:`\frac{d C}{d f}` of
+  The grad method does this using ``output_gradients`` which provides the total
-  C with respect to a symbolic Variable that is returned by the Op (this is provided
+  derivative :math:`\frac{d C}{d f}` of C with respect to a symbolic Variable
+  that is returned by the Op (this is provided
  in ``output_gradients``), as well as the knowledge of the total derivative :math:`\frac{d f}{d x}` of the
  latter with respect to the primitive Variable (this has to be computed).
-  In Mathematics, the total derivative of a scalar variable (C) with respect to a vector of
+  In mathematics, the total derivative of a scalar variable (C) with respect to a vector of
  scalar variables (x), i.e. the gradient, is customarily represented as the
  row vector of the partial derivatives, whereas the total derivative of a vector of
  scalar variables (f) with respect to another (x), is customarily represented by the matrix of

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -150,24 +150,6 @@ def std_fgraph(input_specs, output_specs, accept_inplace = False):
 std_fgraph.features = [gof.toolbox.PreserveNames]
-class UncomputableFeature(gof.Feature):
-    """A feature that ensures the graph never contains any
-    uncomputable nodes. This check must be made at compile time
-    rather than runtime in order to make sure that NaN nodes are
-    not optimized out. It must be done as a Feature so that
-    the fgraph will continually check that optimizations have
-    not introduce any uncomputable nodes."""
-    def on_attach(self, fgraph):
-        for node in fgraph.nodes:
-            return self.on_import(fgraph, node)
-    def on_import(self, fgraph, node):
-        gof.op.raise_if_uncomputable(node)
-std_fgraph.features.append(UncomputableFeature)
 class AliasedMemoryError(Exception):
    """Memory is aliased that should not be"""
    pass

--- a/theano/gof/fg.py
+++ b/theano/gof/fg.py
@@ -11,7 +11,7 @@ import toolbox
 from python25 import all
 from theano import config
 import warnings
+NullType = None
 class InconsistencyError(Exception):
    """
@@ -211,6 +211,9 @@ class FunctionGraph(utils.object2):
    ### import ###
    def __import_r__(self, variables):
+        global NullType
+        if NullType is None:
+            from null_type import NullType
        # Imports the owners of the variables
        r_owner_done = set(self.nodes)
        for node in [r.owner for r in variables if r.owner is not None]:
@@ -219,6 +222,8 @@ class FunctionGraph(utils.object2):
                self.__import__(node)
        for r in variables:
            if r.owner is None and not isinstance(r, graph.Constant) and r not in self.inputs:
+                if isinstance(r.type,NullType):
+                    raise TypeError("Computation graph contains a NaN. "+r.type.why_null)
                raise MissingInputError("Undeclared input", r)
            if not getattr(r, 'fgraph', None) is self:
                self.__setup_r__(r)

--- a/theano/gof/null_type.py
+++ b/theano/gof/null_type.py
+from theano.gof.type import Type
+class NullType(Type):
+    """
+    A type that allows no values. Used to represent expressions
+    that are undefined, either because they do not exist mathematically
+    or because the code to generate the expression has not been
+    implemented yet.
+    """
+    def __init__(self, why_null='(no explanation given)'):
+        """
+            why_null: A string explaining why this variable
+                    can't take on any values
+        """
+        self.why_null = why_null
+    def filter(self, data, strict=False, allow_downcast=None):
+        raise ValueError("No values may be assigned to a NullType")
+    def filter_variable(self, other):
+        raise ValueError("No values may be assigned to a NullType")
+    def may_share_memory(a, b):
+        return False
+    def values_eq(a, b, force_same_dtype=True):
+        raise ValueError("NullType has no values to compare")
--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -609,59 +609,6 @@ class Op(utils.object2, PureOp, CLinkerOp):
        rval.lazy = False
        return rval
-class UncomputableOp(Op):
-    """
-        An Op representing an expression that cannot be computed.
-        theano.function checks that the subgraph it implements
-        does not contain these ops, and that optimization does not
-        introduce any such ops.
-        theano.tensor.grad checks the graphs it returns to ensure
-        they do not contain these ops.
-    """
-    def __init__(self, exc, msg=""):
-        """
-        exc: the exception type to raise if a subgraph contains
-             this op.
-        msg: the message to include in the exception.
-        """
-        self.exc = exc
-        self.msg = msg
-    def __eq__(self, other):
-        return type(self) == type(other)
-    def __hash__(self):
-        return hash((type(self)))
-    def __str__(self):
-        return "Uncomputable{%s,%s}"%(self.exc,self.msg)
-    def make_node(self,x):
-        if x is None:
-            x = graph.Constant(theano.gof.type.generic,None)
-        return graph.Apply(self, [x], [x.type()] )
-    def perform(self, node, inputs, out_storage):
-        """ This should never be called"""
-        raise AssertionError("A BadGradOp should never be compiled, "+\
-                "and certainly not executed.")
-        #Note: essentially, this op should just be NaNs_like(inputs[0])
-        #but 0 * BadGradOp(x) + y optimizes to just y
-        #so until we develop a way of symbolically representing a variable
-        #that is always NaN and implement the logic for 0 * NaN = NaN, etc.
-        #the only way we can guarantee correctness of a theano function
-        #is to guarantee that its initial subgraph contained no BadGradOps
-    def raise_exc(self):
-        raise self.exc(self.msg)
-def raise_if_uncomputable(node):
-    if node is not None:
-        if isinstance(node.op, UncomputableOp):
-            node.op.raise_exc()
 def get_test_value(v):
    """
    Extract test value from `v`. Raises AttributeError if there is none.

--- a/theano/gradient.py
+++ b/theano/gradient.py
@@ -20,9 +20,18 @@ from theano import gof
 from theano.gof import Variable
 from theano.gof.python25 import all
 import theano.gof.utils
+from theano.gof.null_type import NullType
+from theano.printing import min_informative_str
+# we can't do "import theano.tensor"
+# tensor depends on theano.compile
+# theano.compile depends on theano.gradient (this file)
+# the reason theano.compile depends on theano.gradient
+# is that theano.compile.builders contains the op from graph
+# functionality and it uses theano.gradient to implement
+# the new op's grad method
+tensor = None
 _msg_retType = 'op.grad(...) returned a non-list'
-_msg_badlen = 'op.grad(...) returned wrong number of gradients'
 def format_as(use_list, use_tuple, outputs):
@@ -54,171 +63,7 @@ def format_as(use_list, use_tuple, outputs):
        return outputs
-def grad_sources_inputs(sources, graph_inputs, warn_type=True):
+def grad_not_implemented(op, x_pos, x, comment=""):
-    """
-    A gradient source is a pair (``v``, ``g_v``), in which ``v`` is
-    a `Variable`, and ``g_v`` is a `Variable` that is a gradient wrt
-    ``v``. More specifically, ``g_v`` is the gradient of an external
-    scalar cost, ``cost`` (that is not explicitly used), wrt ``v``.
-    This function traverses the graph backward from the ``r`` sources,
-    calling ``op.grad(...)`` for all ops with some non-None gradient
-    on an output, to compute gradients of ``cost`` wrt intermediate
-    variables and ``graph_inputs``.
-    The ``op.grad(...)`` functions are called like this:
-    .. code-block:: python
-        op.grad(op.inputs[:], [total_gradient(v) for v in op.outputs])
-    This call to ``op.grad`` should return a list or tuple: one symbolic
-    gradient per input. These gradients represent the gradients of
-    the same implicit ``cost`` mentionned above, wrt ``op.inputs``.  Note
-    that this is **not** the same as the gradient of ``op.outputs`` wrt
-    ``op.inputs``.
-    If ``op`` has a single input, then ``op.grad`` should return a list
-    or tuple of length 1.
-    For each input wrt to which ``op`` is not differentiable, it should
-    return ``None`` instead of a `Variable` instance.
-    If a source ``r`` receives a gradient from another source ``r2``,
-    then the effective gradient on ``r`` is the sum of both gradients.
-    :type sources: list of pairs of Variable: (v, gradient-on-v) to
-                   initialize the total_gradient dictionary
-    :param sources: gradients to back-propagate using chain rule
-    :type graph_inputs: list of Variable
-    :param graph_inputs: variables considered to be constant
-        (do not backpropagate through them)
-    :type warn_type: bool
-    :param warn_type: True will trigger warnings via the logging module when
-       the gradient on an expression has a different type than the original
-       expression
-    :rtype: dictionary whose keys and values are of type Variable
-    :return: mapping from each Variable encountered in the backward
-        traversal to the gradient with respect to that Variable.
-    It is assumed that there is some objective J shared between all members of
-    sources, so that for each v, gradient-on-v is the gradient of J with
-    respect to v
-    """
-    gmap = {}
-    for (r, g_r) in sources:
-        if not hasattr(r, 'type'):
-            raise TypeError('sources must be Variables', r)
-        if g_r is not None:
-            if r in gmap:
-                gmap[r] = gmap[r] + g_r
-            else:
-                gmap[r] = g_r
-    graph_outputs = gof.utils.uniq([r for r, g in sources])
-    if graph_inputs is None:
-        graph_inputs = gof.graph.inputs(graph_outputs)
-    for node in gof.graph.io_toposort(graph_inputs,
-                                      graph_outputs).__reversed__():
-        g_outputs = [gmap.get(o, None) for o in node.outputs]
-        #if all output gradients are None, continue
-        if all(map(lambda x: x is None, g_outputs)): continue
-        #Disable all grad operation on complex. verify_grad don't
-        #support them and we don't know we want to handle them.
-        for var in node.inputs + node.outputs:
-            if (hasattr(var.type, 'dtype') and "complex" in var.type.dtype):
-                raise Exception("We do not support grad/Rop/Lop/verify_grad"
-                                " on complex.")
-        output_arg = g_outputs
-        input_arg = node.inputs
-        # Each Op's grad function requires inputs and output_grads
-        # If the Op destroys any input, but the grad expression uses it,
-        # then chances are the resulting graph will have a dependency
-        # cycle.  We avoid this cycle by passing (symbolic) copies of
-        # each destroyed input.
-        try:
-            dinputs = [node.inputs[x[0]] for x in node.op.destroy_map.values()]
-        except AttributeError:
-            dinputs = []
-        new_input_arg = []
-        for input in input_arg:
-            if input in dinputs and hasattr(input, 'copy'):
-                new_input_arg.append(input.copy())
-            else:
-                new_input_arg.append(input)
-        input_arg = new_input_arg
-        #note that this function is not in a try-except block
-        # the rationale:
-        #  If the op implements grad, then any exception should be passed to
-        #  the caller
-        #  If the op doesn't implement grad, this entire function should fail.
-        #  Other possibilities:
-        #    * return a partial back-prop
-        #
-        op_grad = node.op.grad(input_arg, output_arg)
-        if not isinstance(op_grad, (list, tuple)):
-            raise ValueError(_msg_retType, node.op)
-        g_inputs = op_grad
-        assert isinstance(g_inputs, (list, tuple))
-        if len(g_inputs) != len(node.inputs):
-            raise ValueError(_msg_badlen,
-                    node.op,
-                    len(g_inputs),
-                    len(node.inputs))
-        for ii, (r, g_r) in enumerate(zip(node.inputs, g_inputs)):
-            if warn_type:
-                if g_r and (getattr(r, 'type', 0) != getattr(g_r, 'type', 1)):
-                    r_type = getattr(r, 'type', None)
-                    g_r_type = getattr(g_r, 'type', None)
-                    _logger.warning('%s.grad returned a different type (%s) '
-                            'for input %i of type (%s)',
-                            node.op, g_r_type, ii, r_type)
-            if g_r is not None:
-                assert r is not None
-                if r in gmap:
-                    gmap[r] = gmap[r] + g_r
-                else:
-                    gmap[r] = g_r
-    return gmap
-class GradNotImplementedOp(gof.op.UncomputableOp):
-    """ An UncomputableOp representing a gradient that hasn't been implemented yet.
-    """
-    def __init__(self, op, x_pos, comment = ""):
-        """
-            op: A theano op  whose grad is not implemented for some input
-            x_pos: An int, giving the index in the op's input list of
-                a variable for which the gradient is not implemented
-                (if op has unimplemented gradients for several inputs,
-                it must still return a separate UnimplementedGradOp for
-                each)
-            comment: An optional comment explaining why the gradient isn't
-                implemented.
-        """
-        assert isinstance(op, gof.Op)
-        assert isinstance(x_pos, int)
-        assert x_pos >= 0
-        super(GradNotImplementedOp,self).__init__(NotImplementedError,
-            "%s does not implement its gradient with respect to input %d. %s" \
-            % (str(type(op)), x_pos, comment))
-def grad_not_implemented(op, x_pos, x, comment = ""):
    """
    Return an un-computable symbolic variable of type `x.type`.
@@ -232,40 +77,14 @@ def grad_not_implemented(op, x_pos, x, comment = ""):
    gradient is not implemented.
    """
-    return GradNotImplementedOp(op, x_pos, comment)(x)
+    return (NullType(
+        (
+            "This variable is Null because the grad method for "
+            "input %s (%s) of the %s op is not implemented. %s"
+        ) % (x_pos, x, op, comment)))()
-class GradUndefinedError(Exception):
-    """ An exception raised upon attempts to use an undefined gradient.
-    """
-class GradUndefinedOp(gof.op.UncomputableOp):
+def grad_undefined(op, x_pos, x, comment=""):
-    """ An UncomputableOp representing a gradient that is mathematically
-        undefined.
-    """
-    def __init__(self, op, x_pos, comment = ""):
-        """
-            op: A theano op  whose grad is mathematically undefined for
-                some input
-            x_pos: An int, giving the index in the op's input list of
-                a variable for which the gradient is undefined
-                (if op has undefined gradients for several inputs,
-                it must still return a separate GradUndefinedOp for
-                each)
-            comment: An optional comment explaining why the gradient isn't
-                defined.
-        """
-        assert isinstance(op, gof.Op)
-        assert isinstance(x_pos, int)
-        assert x_pos >= 0
-        super(GradUndefinedOp,self).__init__(GradUndefinedError,
-            "%s does not implement its gradient with respect to input %d. %s" \
-            % (str(type(op)), x_pos, comment))
-def grad_undefined(op, x_pos, x, comment = ""):
    """
    Return an un-computable symbolic variable of type `x.type`.
@@ -279,9 +98,49 @@ def grad_undefined(op, x_pos, x, comment = ""):
    gradient is not defined.
    """
-    return GradUndefinedOp(op, x_pos, comment)(x)
+    return (NullType(
+        (
+            "This variable is Null because the grad method for "
+            "input %s (%s) of the %s op is mathematically undefined. %s"
+        ) % (x_pos, x, op, comment)))()
+class DisconnectedType(theano.gof.type.Type):
+    """ A type indicating that a variable is a result
+        of taking the gradient of c with respect to x
+        when c is not a function of x.
+        A symbolic placeholder for 0, but to convey
+        the extra information that this gradient is 0
+        because it is disconnected.
+    """
+    def filter(self, data, strict=False, allow_downcast=None):
+        raise AssertionError(
+            (
+                "If you're assigning to a DisconnectedType you're"
+                " doing something wrong. It should only be used as"
+                " a symbolic placeholder."
+            ))
+    def fiter_variable(self, other):
+        raise AssertionError(
+            (
+                "If you're assigning to a DisconnectedType you're"
+                " doing something wrong. It should only be used as"
+                " a symbolic placeholder."
+            ))
+    def may_share_memory(a, b):
+        return False
+    def value_eq(a, b, force_same_dtype=True):
+        raise AssertionError(
+            (
+                "If you're assigning to a DisconnectedType you're"
+                " doing something wrong. It should only be used as"
+                " a symbolic placeholder."
+            ))
 ########################
@@ -418,7 +277,7 @@ def Rop(f, wrt, eval_points):
 def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
-         disconnected_inputs='raise'):
+        disconnected_inputs='raise'):
    """
    Computes the L operation on `f` wrt to `wrt` evaluated at points given
    in `eval_points`. Mathematically this stands for the jacobian of `f` wrt
@@ -453,10 +312,24 @@ def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
    if not isinstance(f, (list, tuple)):
        f = [f]
-    inputs = gof.graph.inputs(f)
+    # make copies of f and grads so we don't modify the client's copy
+    f = list(f)
+    grads = list(eval_points)
+    for elem in consider_constant:
+        assert elem not in f
+        f.append(elem)
+        grads.append(elem.zeros_like())
+    if not isinstance(wrt, (list, tuple)):
+        wrt = [wrt]
+    arg1 = zip(f, eval_points)
+    arg2 = list(wrt)
    gmap = grad_sources_inputs(
-        zip(f, eval_points),
+        arg1,
-        list(inputs) + list(consider_constant),
+        arg2,
        warn_type=warn_type)
    # Note : If p is not in gmap there can be several reasons, among which
@@ -466,17 +339,16 @@ def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
    # such subtle cases can be fixed by a more careful implementation of the
    # gradient, but for now Theano needs to throw an exception, and make the
    # user aware that it does not know how to compute that gradient
-    if not isinstance(wrt, (list, tuple)):
-        wrt = [wrt]
    ret = []
    for p in wrt:
        if p in gmap:
            ret.append(gmap[p])
        else:
-            message = ("Lop method was asked to compute the gradient "
+            message = (
-                    "with respect to a variable that is not part of "
+                "Lop method was asked to compute the gradient "
-                    "the computational graph of the cost, or is used "
+                "with respect to a variable that is not part of "
-                    "only by a non-differentiable operator: %s" % p)
+                "the computational graph of the cost, or is used "
+                "only by a non-differentiable operator: %s" % p)
            if disconnected_inputs == 'ignore':
                pass
            elif disconnected_inputs == 'warn':
@@ -484,9 +356,10 @@ def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
            elif disconnected_inputs == 'raise':
                raise ValueError(message)
            else:
-                raise ValueError("Invalid value for keyword "
+                raise ValueError(
-                        "'disconnected_inputs', valid values are "
+                    "Invalid value for keyword "
-                        "'ignore', 'warn' and 'raise'.")
+                    "'disconnected_inputs', valid values are "
+                    "'ignore', 'warn' and 'raise'.")
            ret.append(p.zeros_like())
    return format_as(using_list, using_tuple, ret)
@@ -497,7 +370,7 @@ def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
 #########################
 def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
-         disconnected_inputs='raise'):
+        disconnected_inputs='raise', add_names=True):
    """
    :type cost: Scalar (0-dimensional) Variable.
    :type wrt: Variable or list of Variables.
@@ -518,6 +391,11 @@ def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
        - 'warn': consider the gradient zero, and print a warning.
        - 'raise': raise an exception.
+    :type add_names: bool
+    :param add_names: If True, variables generated by grad will be named
+        (d<cost.name>/d<wrt.name>) provided that both cost and wrt have
+        names
    :rtype: Variable or list/tuple of Variables (depending upon `wrt`)
    :return: symbolic expression of gradient of `cost` with respect to `wrt`.
@@ -526,14 +404,23 @@ def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
             It returns an object of same type as `wrt`: a list/tuple
             or Variable in all cases.
-    This function is a wrapper around the more general function
-    `theano.gradient.grad_sources_inputs``.
    """
+    global tensor
+    if tensor is None:
+        from theano import tensor
+    if isinstance(cost.type, NullType):
+        raise ValueError("Can't differentiate a NaN cost."
+            "cost is NaN because " + \
+                cost.type.why_null)
+    if cost.ndim != 0:
+        raise TypeError("cost must be a scalar.")
    if consider_constant is None:
        consider_constant = []
    else:
-        #error checking on consider_constant: verify that it is a collection
+        # error checking on consider_constant: verify that it is a collection
        # of theano variables
        # this is important, if someone accidentally passes a nested data
        # structure with theano variables at the leaves, only the root will
@@ -546,47 +433,34 @@ def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
                raise TypeError('Elements of consider_constant must be '
                                'variables, but got ' + str(type(elem)))
-    if not isinstance(cost, Variable):
+    using_list = isinstance(wrt, list)
-        raise TypeError(('In grad(), cost argument should be '
+    using_tuple = isinstance(wrt, tuple)
-                         'a Variable.'), cost)
+    if not using_list and not using_tuple:
+        wrt = [wrt]
-    if cost.type.ndim:
+    var_to_node_to_idx = _populate_var_to_node_to_idx([cost])
-        raise TypeError(
-                'In theano.gradient.grad, "cost" argument should be a scalar,'
-                ' but ndim is %i (should be 0). If you want to compute the'
-                ' gradient of the sum of cost, you should use cost.sum().'
-                % cost.type.ndim)
+    # build a dict mapping var to the gradient of cost with respect to var
+    grad_dict = {}
+    # by default, the gradient of the cost is 1
    if g_cost is None:
-        from theano import tensor
        g_cost = tensor.ones_like(cost)
-    inputs = gof.graph.inputs([cost])
+    grad_dict[cost] = g_cost
-    gmap = grad_sources_inputs(
-        [(cost, g_cost)],
+    # the gradient of the constants is 0
-        list(inputs) + list(consider_constant),
+    for const in consider_constant:
-        warn_type=warn_type)
+        grad_dict[const] = DisconnectedType()()
-    # Note : If p is not in gmap there can be several reasons, among which
+    # variables that do not influence the cost have zero gradient.
-    # is the fact that p might not be part of the computational graph. A
+    # if wrt is such a variable, populate the grad_dict with this info
-    # simple example is that for a+b for e.g. a[0] is not part of the graph,
+    # so that wrt not being in var_to_node_to_idx won't cause an error below
-    # so Theano does not know how to compute TT.grad(TT.sum(a+b), a[0])
+    # according to the flag, possibly raise an error if wrt is disconnected
-    # such subtle cases can be fixed by a more careful implementation of the
+    for elem in wrt:
-    # gradient, but for now Theano needs to throw an exception, and make the
+        if elem not in var_to_node_to_idx and elem is not cost:
-    # user aware that it does not know how to compute that gradient
-    using_list = isinstance(wrt, list)
-    using_tuple = isinstance(wrt, tuple)
-    if not isinstance(wrt, (list, tuple)):
-        wrt = [wrt]
-    ret = []
-    for p in wrt:
-        if p in gmap:
-            ret.append(gmap[p])
-        else:
            message = ("grad method was asked to compute the gradient "
                    "with respect to a variable that is not part of "
                    "the computational graph of the cost, or is used "
-                    "only by a non-differentiable operator: %s" % p)
+                    "only by a non-differentiable operator: %s" % elem)
            if disconnected_inputs == 'ignore':
                pass
            elif disconnected_inputs == 'warn':
@@ -597,20 +471,331 @@ def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
                raise ValueError("Invalid value for keyword "
                        "'disconnected_inputs', valid values are "
                        "'ignore', 'warn' and 'raise'.")
-            ret.append(p.zeros_like())
+            grad_dict[elem] = DisconnectedType()()
-        if cost.name is not None and p.name is not None \
+    cost_name = None
-                and ret[-1].name is None:
+    if add_names:
-            ret[-1].name = '(d%s/d%s)' % (cost.name, p.name)
+        cost_name = cost.name
-    # new_vars is meant to be a list of all variables created
+    rval = _populate_grad_dict(var_to_node_to_idx,
-    # by this call to grad(), which will be visible to the caller
+            grad_dict, wrt, warn_type,
-    # after we return.
+            cost_name)
-    new_vars = gof.graph.ancestors(ret,
-            blockers=gof.graph.ancestors([cost]) + list(wrt))
-    map(gof.op.raise_if_uncomputable, [v.owner for v in new_vars])
-    return format_as(using_list, using_tuple, ret)
+    for i in xrange(len(rval)):
+        if isinstance(rval[i].type, DisconnectedType):
+            rval[i] = wrt[i].zeros_like()
+    if using_tuple:
+        rval = tuple(rval)
+    elif not using_list:
+        rval, = rval
+    return rval
+def _populate_var_to_node_to_idx(outputs):
+    """
+        Common code shared between grad and grad_sources_inputs
+        outputs: a list of nodes we want to take gradients of
+        returns:
+            var_to_node_to_idx: a dictionary mapping a variable to
+                a second dictionary.
+                the second dictionary maps apply nodes acting on
+                this variable to the variable's index in the apply
+                node's input list
+    """
+    # var_to_node_to_idx[var][node] = [i,j] means node has
+    # var as input at positions i and j
+    var_to_node_to_idx = {}
+    # set of variables or nodes that have been added to their parents
+    accounted_for = set([])
+    def account_for(var):
+        if var in accounted_for:
+            return
+        accounted_for.add(var)
+        if var.owner is not None:
+            node = var.owner
+            if node not in accounted_for:
+                accounted_for.add(node)
+                for i, ipt in enumerate(node.inputs):
+                    if ipt not in var_to_node_to_idx:
+                        var_to_node_to_idx[ipt] = {}
+                    node_to_idx = var_to_node_to_idx[ipt]
+                    if node not in node_to_idx:
+                        node_to_idx[node] = []
+                    idx = node_to_idx[node]
+                    assert i not in idx
+                    idx.append(i)
+                    account_for(ipt)
+    for output in outputs:
+        account_for(output)
+    return var_to_node_to_idx
+def _populate_grad_dict(var_to_node_to_idx,
+        grad_dict, wrt, warn_type, cost_name=None):
+    """
+        Common code shared between grad_sources_inputs and grad
+        var_to_node_to_idx: a dictionary mapping a variable to
+                a second dictionary.
+                the second dictionary maps apply nodes acting on
+                this variable to the variable's index in the apply
+                node's input list
+        grad_dict: a dictionary mapping variables to their gradients
+                   should be populated by grad or grad_sources_inputs
+                        grad should set gradients to DisconnectedType()() for
+                        variables to be considered constant, set the
+                        gradient for the cost variable to g_cost, etc.
+                        both should set the gradient for disconnected
+                        inputs to a variable with type DisconnectedType()
+        wrt: the minimal set of variables that must be included in grad_dict
+        warn_type: if True, log a warning when a gradient term for a variable
+                    has a different type from that variable
+        cost_name: The name of the cost being differentiated, optional.
+                    used to name the grad with respect to x as
+                    (d<cost_name>/dx)
+        returns: a list of gradients corresponding to wrt
+    """
+    # build a dict mapping node to the terms node contributes to each of
+    # its inputs' gradients
+    term_dict = {}
+    # populate term_dict[node] and return it
+    def access_term_cache(node):
+        if node not in term_dict:
+            inputs = node.inputs
+            # Each Op's grad function requires inputs and output_grads
+            # If the Op destroys any input, but the grad expression uses it,
+            # then chances are the resulting graph will have a dependency
+            # cycle. We avoid this cycle by passing (symbolic) copies of
+            # each destroyed input.
+            try:
+                dinputs = [node.inputs[x[0]] for x in
+                        node.op.destroy_map.values()]
+            except AttributeError:
+                dinputs = []
+            def try_to_copy_if_needed(var):
+                if var in dinputs and hasattr(var, 'copy'):
+                    return var.copy()
+                return var
+            inputs = [try_to_copy_if_needed(ipt) for ipt in inputs]
+            output_grads = [access_grad_cache(var) for var in node.outputs]
+            if False in [isinstance(g.type, DisconnectedType)
+                    for g in output_grads]:
+                # Some outputs of this op are connected to the cost so we must
+                # call the ops grad method
+                input_grads = node.op.grad(inputs, output_grads)
+                if input_grads is None:
+                    raise TypeError("%s.grad returned NoneType, "
+                            "expected iterable." % str(node.op))
+                if len(input_grads) != len(inputs):
+                    raise ValueError(("%s returned the wrong number of" +\
+                            " gradient terms.") % str(node.op))
+            else:
+                # All outputs of this op are disconnected so we can skip
+                # Calling the op's grad method and report that the inputs
+                # are disconnected
+                # (The op's grad method could do this too, but this saves the
+                # implementer the trouble of worrying about this case)
+                input_grads = [DisconnectedType()() for ipt in inputs]
+            # must convert to list in case the op returns a tuple
+            # we won't be able to post-process out the Nones if it does that
+            term_dict[node] = list(input_grads)
+            for i in xrange(len(term_dict[node])):
+                if term_dict[node][i] is None:
+                    # we don't know what None means. in the past it has been
+                    # used to
+                    # mean undefined, zero, or disconnected. So for now we
+                    # assume it is
+                    # zero. Assuming it is zero prevents
+                    # us from disconnecting NaNs above.
+                    # eventually we should disallow this
+                    # return type and force all ops
+                    # to return the correct thing
+                    # raise AssertionError('%s returned None for' +\
+                    # ' a gradient term, '
+                    #        'this is prohibited' % node.op)
+                    term_dict[node][i] = node.inputs[i].zeros_like()
+                if warn_type:
+                    g_r_type = term_dict[node][i].type
+                    r_type = inputs[i].type
+                    if g_r_type != r_type:
+                        _logger.warning(
+                            '%s.grad returned a different type (%s) '
+                            'for input %i of type (%s)',
+                            node.op, g_r_type, i, r_type)
+        return term_dict[node]
+    # populate grad_dict[var] and return it
+    def access_grad_cache(var):
+        if var not in grad_dict:
+            if var in var_to_node_to_idx:
+                terms = []
+                node_to_idx = var_to_node_to_idx[var]
+                for node in node_to_idx:
+                    for idx in node_to_idx[node]:
+                        if hasattr(node.op, 'connection_pattern'):
+                            pattern = node.op.connection_pattern()
+                            if not pattern[idx]:
+                                continue
+                        term = access_term_cache(node)[idx]
+                        if not isinstance(term, gof.Variable):
+                            raise TypeError("%s.grad returned %s, expected"
+                                    " Variable instance." % (str(node.op),
+                                        type(term)))
+                        if isinstance(term.type, NullType):
+                            raise TypeError("tensor.grad "
+                                "encountered a NaN. " +\
+                                    term.type.why_null)
+                        terms.append(term)
+                #the next line is like sum(terms) but doesn't add an
+                #extraneous TensorConstant(0)
+                grad_dict[var] = reduce(lambda x,y: x+y, terms)
+                if cost_name is not None and var.name is not None:
+                    grad_dict[var].name = '(d%s/d%s)' % (cost_name, var.name)
+            else:
+                # this variable isn't connected to the cost in the computational
+                # graph
+                grad_dict[var] = DisconnectedType()()
+        return grad_dict[var]
+    rval = [access_grad_cache(elem) for elem in wrt]
+    return rval
+def grad_sources_inputs(sources, graph_inputs, warn_type=True):
+    """
+    Used to compute the gradient of a cost with respect to all the
+    variables between graph_input and cost, but in the special
+    case where you don't know the cost, you only know its gradient
+    on a set of intermediate values.
+    A gradient source is a pair (``v``, ``g_v``), in which ``v`` is
+    a `Variable`, and ``g_v`` is a `Variable` that is a gradient wrt
+    ``v``. More specifically, ``g_v`` is the gradient of an external
+    scalar cost, ``cost`` (that is not explicitly used), wrt ``v``.
+    This function traverses the graph backward from the ``r`` sources,
+    calling ``op.grad(...)`` for all ops with some non-None gradient
+    on an output, to compute gradients of ``cost`` wrt intermediate
+    variables and ``graph_inputs``.
+    The ``op.grad(...)`` functions are called like this:
+    .. code-block:: python
+        op.grad(op.inputs[:], [total_gradient(v) for v in op.outputs])
+    This call to ``op.grad`` should return a list or tuple: one symbolic
+    gradient per input. These gradients represent the gradients of
+    the same implicit ``cost`` mentionned above, wrt ``op.inputs``.  Note
+    that this is **not** the same as the gradient of ``op.outputs`` wrt
+    ``op.inputs``.
+    If ``op`` has a single input, then ``op.grad`` should return a list
+    or tuple of length 1.
+    For each input wrt to which ``op`` is not differentiable, it should
+    return ``None`` instead of a `Variable` instance.
+    If a source ``r`` receives a gradient from another source ``r2``,
+    then the effective gradient on ``r`` is the sum of both gradients.
+    :type sources: list of pairs of Variable: (v, gradient-on-v) to
+                   initialize the total_gradient dictionary
+    :param sources: gradients to back-propagate using chain rule
+    :type graph_inputs: list of Variable
+    :param graph_inputs: variables considered to be constant
+        (do not backpropagate through them)
+    :type warn_type: bool
+    :param warn_type: True will trigger warnings via the logging module when
+       the gradient on an expression has a different type than the original
+       expression
+    :rtype: dictionary whose keys and values are of type Variable
+    :return: mapping from each Variable encountered in the backward
+        traversal to the gradient with respect to that Variable.
+    It is assumed that there is some objective J shared between all members of
+    sources, so that for each v, gradient-on-v is the gradient of J with
+    respect to v
+    """
+    outputs, output_grads = zip(*sources)
+    for output_grad in output_grads:
+        if not hasattr(output_grad, 'type'):
+            raise TypeError('output grads must be theano variables.'
+                    'Ambiguous whether %s should be made into tensor'
+                    ' or sparse theano variable' % str(type(output_grad)))
+    if graph_inputs is None:
+        graph_inputs = gof.graph.inputs(outputs)
+    wrt = graph_inputs
+    var_to_node_to_idx = _populate_var_to_node_to_idx(outputs)
+    # build a dict mapping var to the gradient of cost with respect to var
+    grad_dict = {}
+    # by default, the gradient of the cost is 1
+    for output, output_grad in sources:
+        grad_dict[output] = output_grad
+    # variables that do not influence the cost have zero gradient.
+    # if wrt is such a variable, populate the grad_dict with this info
+    # so that wrt not being in var_to_node_to_idx won't cause an error below
+    # according to the flag, possibly raise an error if wrt is disconnected
+    for elem in wrt:
+        if elem not in var_to_node_to_idx and elem not in outputs:
+            grad_dict[elem] = DisconnectedType()()
+    _populate_grad_dict(var_to_node_to_idx,
+            grad_dict, wrt, warn_type)
+    # post-process out the DisconnectedTypes
+    for key in grad_dict:
+        if isinstance(grad_dict[key].type, DisconnectedType):
+            if hasattr(key, 'zeros_like'):
+                grad_dict[key] = key.zeros_like()
+    return grad_dict
 class numeric_grad(object):
@@ -902,7 +1087,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
            as_tensor_variable(p).broadcastable)(name='input %i' % i)
        for i, p in enumerate(pt)]
-    #fun can be either a function or an actual Op instance
+    # fun can be either a function or an actual Op instance
    o_output = fun(*tensor_pt)
    if isinstance(o_output, list):
@@ -929,6 +1114,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
        return plain
    t_r = shared(random_projection())
+    t_r.name = 'random_projection'
    # random projection of o onto t_r
    # This sum() is defined above, it's not the builtin sum.
@@ -936,7 +1122,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
    cost_fn = function(tensor_pt, cost)
-    #todo-- determine if this is actually needed
+    # todo-- determine if this is actually needed
    g_cost = as_tensor_variable(1.0, name='g_cost')
    if cast_to_output_type:
        g_cost = cast(g_cost, o_output.dtype)
@@ -958,10 +1144,11 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
                num_grad.max_err(analytic_grad, abs_tol, rel_tol)
        if max_abs_err > abs_tol and max_rel_err > rel_tol:
            raise verify_grad.E_grad(max_arg, max_err_pos,
                    max_abs_err, max_rel_err, abs_tol, rel_tol)
-        #get new random projection for next test
+        # get new random projection for next test
        if test_num < n_tests - 1:
            t_r.set_value(random_projection(), borrow=True)

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -456,7 +456,7 @@ def test_elemwise_composite_support_code():
    P = T.exp(-(Y - U) ** 2)
    epsilon = numpy.asarray(0.001, dtype="float32")
    NLL = -T.mean(T.log(P + epsilon))  # SupportCodeError
-    G = T.grad(NLL, wrt=[W])
+    G = theano.gradient.grad(NLL, wrt=[W])
    backup = theano.config.warn.identify_1pexp_bug
    theano.config.warn.identify_1pexp_bug = False
@@ -468,6 +468,7 @@ def test_elemwise_composite_support_code():
    topo = f_grad.maker.fgraph.toposort()
    assert sum([isinstance(node.op, T.Elemwise) for node in topo]) == 1
+    #I suspect this was failing in the original branch too
    assert sum([isinstance(node.op, tcn.GpuElemwise) for node in topo]) == 1

--- a/theano/sandbox/test_neighbours.py
+++ b/theano/sandbox/test_neighbours.py
@@ -258,7 +258,7 @@ class T_Images2Neibs(unittest_tools.InferShapeTester):
        def fn(images):
            return images2neibs(images, (3, 3), mode='wrap_centered')
-        self.assertRaises(NotImplementedError, unittest_tools.verify_grad,
+        self.assertRaises(TypeError, unittest_tools.verify_grad,
                          fn, [images_val], mode=self.mode)
@@ -276,7 +276,7 @@ class T_Images2Neibs(unittest_tools.InferShapeTester):
        # are not the same.
        def fn(images):
            return images2neibs(images, (2, 2), (1, 1))
-        self.assertRaises(NotImplementedError,
+        self.assertRaises(TypeError,
                          unittest_tools.verify_grad, fn, [images_val],
                          mode=self.mode)

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -488,6 +488,9 @@ class _scalar_py_operators:
    def __rmod__(self,other): return mod(other,self)
    def __rpow__(self,other): return pow(other,self)
+    def zeros_like(self):
+        return ScalarConstant(Scalar(str(self.type.dtype)), 0)
 class ScalarVariable(_scalar_py_operators, Variable):
    pass

--- a/theano/scan_module/scan_op.py
+++ b/theano/scan_module/scan_op.py
@@ -29,6 +29,8 @@ from theano import gof
 from theano.tensor import TensorType
 from theano import tensor
 from theano.tensor.opt import Shape_i
+from theano.gradient import grad_undefined
+from theano.gradient import DisconnectedType
 #from theano.sandbox import cuda
 from theano.compile.profiling import ScanProfileStats
@@ -431,7 +433,7 @@ class Scan(PureOp):
                    aux_txt += str(k) + ','
                aux_txt += '},%s,%s}'
        else:
-            aux_txt +='{%s,%s}'
+            aux_txt += '{%s,%s}'
        aux_txt = aux_txt % (name, gpu_str, str(self.name))
        return aux_txt
@@ -1161,6 +1163,17 @@ class Scan(PureOp):
    ### GRAD FUNCTION
    def grad(self, args, g_outs):
+        # This discards information about whether incoming gradients are 0
+        # or disconnected from the cost
+        # TODO: upgrade scan op to report disconnection correctly
+        def strip_disconnected(g):
+            if isinstance(g.type, DisconnectedType):
+                return None
+            return g
+        g_outs = [strip_disconnected(g) for g in g_outs]
        # 1. forward pass - get the outputs after applying scan
        scan_outputs = self(*args)
        # 2. make sure they are given as a list
@@ -1512,7 +1525,7 @@ class Scan(PureOp):
        if type(outputs) not in (list, tuple):
            outputs = [outputs]
        # Re-order the gradients correctly
-        gradients = [None]
+        gradients = [grad_undefined(self, 0, args[0], 'Number of steps')]
        offset = (self.n_mit_mot +
                  self.n_mit_sot +
@@ -1522,8 +1535,16 @@ class Scan(PureOp):
        end = self.n_mit_mot + self.n_mit_sot + self.n_sit_sot
        gradients += [x[::-1] for x in outputs[:end]]
-        gradients += [None for x in xrange(self.n_shared_outs)]
+        start = len(gradients)
-        gradients += [None for x in xrange(self.n_nit_sot)]
+        gradients += [
+                grad_undefined(self, x + start, args[x + start],
+                    'Shared Variable with update')
+                for x in xrange(self.n_shared_outs)]
+        start = len(gradients)
+        gradients += [
+                grad_undefined(self, x + start, args[x + start],
+                    'Dimension of memory buffer for output')
+                for x in xrange(self.n_nit_sot)]
        begin = end
        end = begin + n_sitsot_outs
@@ -1547,7 +1568,8 @@ class Scan(PureOp):
            rop_self_outputs = self_outputs
        if self.info['n_shared_outs'] > 0:
            rop_self_outputs = rop_self_outputs[:-self.info['n_shared_outs']]
-        rop_outs = tensor.Rop(rop_self_outputs, rop_of_inputs, inner_eval_points)
+        rop_outs = tensor.Rop(rop_self_outputs, rop_of_inputs,
+             inner_eval_points)
        if type(rop_outs) not in (list, tuple):
            rop_outs = [rop_outs]
        # Step 2. Figure out what corresponds to what in the scan
@@ -1653,7 +1675,7 @@ class Scan(PureOp):
        scan_sit_sot = inputs[b:e] + clean_eval_points
        inner_sit_sot = self_inputs[ib:ie] + inner_eval_points[ib:ie]
-        #Shared outs ...
+        # Shared outs ...
        b = e
        e = e + self.n_shared_outs
        ib = ie
@@ -1738,7 +1760,7 @@ class Scan(PureOp):
        b = e + self.n_nit_sot
        e = e + self.n_nit_sot * 2
        final_outs += outputs[b:e]
-        final_outs += [None]*self.n_shared_outs
+        final_outs += [None] * self.n_shared_outs
        return final_outs

--- a/theano/scan_module/tests/test_scan.py
+++ b/theano/scan_module/tests/test_scan.py
@@ -1816,10 +1816,12 @@ class T_Scan(unittest.TestCase):
    def test_scan_extra_inputs_hessian(self):
        x = theano.tensor.vector('x')
        A = theano.tensor.matrix('A')
-        fc1 = theano.shared(0.5)
+        fc1 = theano.shared(0.5, name = 'fc1')
-        fc2 = theano.shared(0.9)
+        fc2 = theano.shared(0.9, name = 'fc2')
        y = fc1 * theano.dot(x * x, theano.dot(A, x))
+        y.name = 'y'
        gy = theano.tensor.grad(y, x)
+        gy.name = 'gy'
        hy, updates = theano.scan(
            lambda i, gy, x: theano.tensor.grad(gy[i] * fc2, x),
            sequences=theano.tensor.arange(gy.shape[0]),
@@ -1829,7 +1831,9 @@ class T_Scan(unittest.TestCase):
        vx = numpy.array([1., 1.], dtype=theano.config.floatX)
        vA = numpy.array([[1., 1.], [1., 0.]], dtype=theano.config.floatX)
        vR = numpy.array([[3.6, 1.8], [1.8, 0.9]], dtype=theano.config.floatX)
-        assert numpy.allclose(f(vx, vA), vR)
+        out = f(vx, vA)
+        assert numpy.allclose(out, vR)
    def test_cloning_no_replace_strict_copy_inputs(self):
        # This has nothing to do with scan, but it refers to the clone
@@ -3479,14 +3483,15 @@ def test_compute_test_value():
    backup = theano.config.compute_test_value
    theano.config.compute_test_value = 'raise'
    try:
-        x = tensor.vector()
+        x = tensor.vector('x')
        xv = numpy.ones(3, dtype=theano.config.floatX)
        x.tag.test_value = xv
-        y = theano.shared(numpy.arange(3, dtype=theano.config.floatX))
+        y = theano.shared(numpy.arange(3, dtype=theano.config.floatX), name='y')
        z, _ = theano.scan(
                fn=lambda u, v: u + v,
                sequences=[x, y])
        assert not _
+        z.name='z'
        # The gradient computation used to crash before 6af465e.
        g = tensor.grad(z.sum(), x)
        #f = theano.function([x], g)

--- a/theano/sparse/basic.py
+++ b/theano/sparse/basic.py
@@ -7,7 +7,6 @@ http://www-users.cs.umn.edu/~saad/software/SPARSKIT/paper.ps
 # TODO
 # Automatic methods for determining best sparse format?
-from itertools import izip
 import sys
 import numpy
@@ -16,14 +15,14 @@ import scipy.sparse
 from theano import gof, tensor, compile, scalar, config
 from theano.gof.python25 import all
-from theano.tensor import blas
+from theano.gradient import DisconnectedType
 from theano.sparse.utils import hash_from_sparse
 import theano.tests.unittest_tools as utt
 sparse_formats = ['csc', 'csr']
-#TODO: move this decorator to the compile submodule
+# TODO: move this decorator to the compile submodule
 def register_specialize(lopt, *tags, **kwargs):
    compile.optdb['specialize'].register((kwargs and kwargs.pop('name')) or
                                         lopt.__name__, lopt, 'fast_run',
@@ -256,7 +255,7 @@ def sp_zeros_like(x):
    :return: The same as `x` with zero entries
             for all element.
    """
-    #TODO: don't restrict to CSM formats
+    # TODO: don't restrict to CSM formats
    _, _, indptr, shape = csm_properties(x)
    return CSM(format=x.format)(numpy.array([], dtype=x.type.dtype),
                                numpy.array([]), tensor.zeros_like(indptr),
@@ -291,7 +290,7 @@ class _sparse_py_operators:
    def __rmul__(left, right):
        return mul(left, right)
-    #extra pseudo-operator symbols
+    # extra pseudo-operator symbols
    def __dot__(left, right):
        return structured_dot(left, right)
@@ -299,12 +298,12 @@ class _sparse_py_operators:
    def __rdot__(right, left):
        return structured_dot(left, right)
-    #N.B. THIS IS COMMENTED OUT ON PURPOSE!!!
+    # N.B. THIS IS COMMENTED OUT ON PURPOSE!!!
    #     Discussion with Fred & James (at least, and maybe others before)
    #     we decided that casting from a sparse to dense should be explicit
    #     because it's usually something you just want to be pretty careful
    #     about, and not to do by accident.
-    #def _as_TensorVariable(self):
+    # def _as_TensorVariable(self):
    #    return dense_from_sparse(self)
    shape = property(lambda self: tensor.shape(dense_from_sparse(self)))
@@ -441,7 +440,7 @@ class SparseType(gof.Type):
        if strict:
            raise TypeError("%s is not sparse, or not the right dtype (is %s, "
                            "expected %s)" % (value, value.dtype, self.dtype))
-        #The input format could be converted here
+        # The input format could be converted here
        if allow_downcast:
            sp = self.format_cls[self.format](value, dtype=self.dtype)
        else:
@@ -488,7 +487,7 @@ class SparseType(gof.Type):
        return "Sparse[%s, %s]" % (str(self.dtype), str(self.format))
    def values_eq_approx(self, a, b, eps=1e-6):
-        #WARNING: equality comparison of sparse matrices is not fast or easy
+        # WARNING: equality comparison of sparse matrices is not fast or easy
        # we definitely do not want to be doing this un-necessarily during
        # a FAST_RUN computation..
        if not scipy.sparse.issparse(a) or not scipy.sparse.issparse(b):
@@ -504,7 +503,7 @@ class SparseType(gof.Type):
        return max(diff.data) < eps
    def values_eq(self, a, b):
-        #WARNING: equality comparison of sparse matrices is not fast or easy
+        # WARNING: equality comparison of sparse matrices is not fast or easy
        # we definitely do not want to be doing this un-necessarily during
        # a FAST_RUN computation..
        return scipy.sparse.issparse(a) \
@@ -619,14 +618,25 @@ class CSMProperties(gof.Op):
            out[0][0] = csm.data[self.kmap]
        if str(csm.data.dtype) == 'int32':
            out[0][0] = theano._asarray(out[0][0], dtype='int32')
-        #backport
+        # backport
-        #out[0][0] = csm.data if self.kmap is None else csm.data[self.kmap]
+        # out[0][0] = csm.data if self.kmap is None else csm.data[self.kmap]
        out[1][0] = theano._asarray(csm.indices, dtype='int32')
        out[2][0] = theano._asarray(csm.indptr, dtype='int32')
        out[3][0] = theano._asarray(csm.shape, dtype='int32')
    def grad(self, (csm,), g):
-        assert [gg is None for gg in g[1:]]
+        # g[1:] is all integers, so their Jacobian in this op
+        # is 0. We thus don't need to worry about what their values
+        # are.
+        # if g[0] is disconnected, then this op doesn't contribute
+        # any gradient anywhere. but we know that at least one of
+        # g[1:] is connected, or this grad method wouldn't have been
+        # called, so we should report zeros
+        if isinstance(g[0].type, DisconnectedType):
+            return [csm.zeros_like()]
        data, indices, indptr, shape = csm_properties(csm)
        return [CSM(csm.format)(g[0], indices, indptr, shape)]
 # don't make this a function or it breaks some optimizations below
@@ -662,10 +672,10 @@ class CSM(gof.Op):
    :param data: One dimensionnal tensor representing
                 the data of the sparse to construct.
-    :param indices: One dimensionnal tensor of integers
+    :param indices: One dimensional tensor of integers
                    representing the indices of the sparse
                    matrix to construct.
-    :param indptr: One dimensionnal tensor of integers
+    :param indptr: One dimensional tensor of integers
                   representing the indice pointer for
                   the sparse matrix to construct.
    :param shape: One dimensionnal tensor of integers
@@ -673,9 +683,9 @@ class CSM(gof.Op):
                  matrix to construct.
    :return: A sparse matrix having the properties
-             speficied by the inputs.
+             specified by the inputs.
-    :note: The grad method returns a dense vector, so it provide
+    :note: The grad method returns a dense vector, so it provides
           a regular grad.
    """
@@ -774,10 +784,10 @@ class CSM(gof.Op):
    def grad(self, (x_data, x_indices, x_indptr, x_shape), (g_out,)):
        g_data, g_indices, g_indptr, g_shape = csm_properties(g_out)
-        #unpack the data vector and wrap it as a 1d TensorType
+        # unpack the data vector and wrap it as a 1d TensorType
        g_data = csm_grad(self.kmap)(x_data, x_indices, x_indptr, x_shape,
            g_data, g_indices, g_indptr, g_shape)
-        return [g_data, None, None, None]
+        return [g_data, DisconnectedType()(), DisconnectedType()(), DisconnectedType()()]
    def infer_shape(self, node, shapes):
        if self.kmap is None:
@@ -1195,7 +1205,7 @@ class GetItemScalar(gof.op.Op):
            if isinstance(ind, slice):
                raise Exception("GetItemScalar called with a slice as index!")
-            #in case of indexing using int instead of theano variable
+            # in case of indexing using int instead of theano variable
            elif isinstance(ind, int):
                ind = theano.tensor.constant(ind)
                input_op += [ind]
@@ -2026,7 +2036,7 @@ class MulSD(gof.op.Op):
    def make_node(self, x, y):
        x, y = as_sparse_variable(x), tensor.as_tensor_variable(y)
-        #upcast the tensor. Is the cast of sparse done implemented?
+        # upcast the tensor. Is the cast of sparse done implemented?
        dtype = scalar.upcast(x.type.dtype, y.type.dtype)
        if y.type.dtype != dtype:
            y = tensor.cast(y, dtype)
@@ -2049,7 +2059,7 @@ class MulSD(gof.op.Op):
        elif len(y.shape) == 2:
            # if we have enough memory to fit y, maybe we can fit x.asarray()
            # too?
-            #TODO: change runtime from O(M*N) to O(nonzeros)
+            # TODO: change runtime from O(M*N) to O(nonzeros)
            M, N = x.shape
            assert x.shape == y.shape
@@ -2810,7 +2820,7 @@ class StructuredDot(gof.Op):
            raise ValueError('shape mismatch in StructuredDot.perform',
                             (a.shape, b.shape))
-        #variable = a.dot(b)  # deprecated
+        # variable = a.dot(b)  # deprecated
        variable = a * b
        if isinstance(node.outputs[0].type, SparseType):
            assert _is_sparse(variable)
@@ -2843,8 +2853,8 @@ class StructuredDot(gof.Op):
                raise Exception("a.shape=%s, b.shape=%s, variable.shape=%s "
                                " ??? I have no idea why")
-        #The cast is needed as otherwise we hit the bug mentioned into
+        # The cast is needed as otherwise we hit the bug mentioned into
-        #theano._asarray function documentation.
+        # theano._asarray function documentation.
        out[0] = theano._asarray(variable, str(variable.dtype))
    def grad(self, (a, b), (g_out,)):
@@ -3229,7 +3239,7 @@ class SamplingDot(gof.op.Op):
        if not _is_sparse_variable(p):
            raise TypeError(p)
-        #TODO: use it.
+        # TODO: use it.
        dtype_out = scalar.upcast(x.type.dtype, y.type.dtype, p.type.dtype)
        return gof.Apply(self, [x, y, p], [p.type()])

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -25,6 +25,7 @@ from theano.tensor.utils import hash_from_ndarray
 from theano.scalar import ComplexError, IntegerDivisionError
 import theano.scalar.sharedvar
 from theano.gradient import grad_undefined
+from theano.gradient import DisconnectedType
 ### set up the external interface
 from elemwise import Elemwise, DimShuffle, CAReduce, Sum
@@ -32,7 +33,7 @@ from elemwise import Elemwise, DimShuffle, CAReduce, Sum
 import logging
 _logger = logging.getLogger("theano.tensor.basic")
-#This is needed as we will hide it later
+# This is needed as we will hide it later
 python_complex = complex
 python_any = any
 python_all = all
@@ -47,6 +48,7 @@ continuous_dtypes = map(str, scal.continuous_types)
 discrete_dtypes = map(str, scal.discrete_types)
 all_dtypes = map(str, scal.all_types)
 class ShapeError(Exception):
    """Raised when the shape cannot be computed."""
    pass
@@ -108,7 +110,7 @@ if 0:
        transfert the value on the gpu
        """
        if hasattr(x, '_as_CudaNdarrayVariable'):
-            #TODO: pass name and ndim arguments
+            # TODO: pass name and ndim arguments
            return x._as_CudaNdarrayVariable()
        return as_tensor_variable(x, name, ndim)
@@ -142,7 +144,7 @@ def as_tensor_variable(x, name=None, ndim=None):
        return x._as_TensorVariable()  # TODO: pass name and ndim arguments
    if isinstance(x, gof.Apply):
-        #TODO: use Apply's default output mechanism
+        # TODO: use Apply's default output mechanism
        if len(x.outputs) != 1:
            raise ValueError(
                "It is ambiguous which output of a multi-output Op has"
@@ -161,7 +163,7 @@ def as_tensor_variable(x, name=None, ndim=None):
            return x
        else:
            if (x.type.ndim > ndim):
-                #TODO: strip off leading broadcastable dimensions
+                # TODO: strip off leading broadcastable dimensions
                raise ValueError(
                    'TensorType could not be cast to have %i dimensions' %
                    ndim, x.type)
@@ -369,7 +371,7 @@ def constant_or_value(x, rtype, name=None, ndim=None, dtype=None):
        if len(bcastable) < ndim:
            bcastable = [True] * (ndim - len(bcastable)) + bcastable
        elif len(bcastable) > ndim:
-            #TODO: strip off dimensions of size 1
+            # TODO: strip off dimensions of size 1
            raise ValueError(
                'ndarray could not be cast to constant with %i dimensions' %
                ndim)
@@ -394,6 +396,7 @@ def constant(x, name=None, ndim=None, dtype=None):
    return constant_or_value(x, rtype=TensorConstant, name=name, ndim=ndim,
                             dtype=dtype)
 def _obj_is_wrappable_as_tensor(x):
    try:
        constant(x)
@@ -405,7 +408,7 @@ def _obj_is_wrappable_as_tensor(x):
 def _wrap_tensor_into_member(x):
    return compile.module.Member(constant(x))
 compile.module.register_wrapper(_obj_is_wrappable_as_tensor,
-                                _wrap_tensor_into_member, no_warn = True)
+                                _wrap_tensor_into_member, no_warn=True)
 if int(config.tensor.cmp_sloppy) > 1:
@@ -427,15 +430,15 @@ elif int(config.tensor.cmp_sloppy):
    float64_rtol = 1e-4
    float64_atol = 1e-3
 else:
-    #If you change those value in test don't forget to put them back
+    # If you change those value in test don't forget to put them back
-    #when the test end.  Don't forget the case when the test fail.
+    # when the test end.  Don't forget the case when the test fail.
    float32_atol = 1e-5
    float32_rtol = 1e-5
    # defaults in numpy.allclose
    float64_rtol = 1.0000000000000001e-05
    float64_atol = 1e-8
-    #more strict. Atleast float32 precision.
+    # more strict. Atleast float32 precision.
    float64_rtol = 1.0000000000000001e-06
@@ -494,9 +497,9 @@ def get_constant_value(v):
            shape, val = v.owner.inputs
            # fill(a,b) fills the shape of 'a' filled with 'b'
            return get_constant_value(val)
-        #Don't act as the constant_folding optimization here as this
+        # Don't act as the constant_folding optimization here as this
-        #fct is used too early in the optimization phase.  This would
+        # fct is used too early in the optimization phase.  This would
-        #mess with the stabilization optimization.
+        # mess with the stabilization optimization.
        if isinstance(v.owner.op, Elemwise) and isinstance(
            v.owner.op.scalar_op, scal.Cast):
            const = get_constant_value(v.owner.inputs[0])
@@ -529,7 +532,7 @@ def get_constant_value(v):
                ret = v.owner.inputs[0].owner.inputs[
                    v.owner.op.idx_list[0] + 1]
                ret = get_constant_value(ret)
-                #join can cast implicitly its input in some case.
+                # join can cast implicitly its input in some case.
                return theano._asarray(ret, dtype=v.type.dtype)
            if (v.owner.inputs[0].owner and
                isinstance(v.owner.inputs[0].owner.op,
@@ -542,7 +545,7 @@ def get_constant_value(v):
                ret = v.owner.inputs[0].owner.inputs[v.owner.op.idx_list[0]]
                ret = get_constant_value(ret)
-                #MakeVector can cast implicitly its input in some case.
+                # MakeVector can cast implicitly its input in some case.
                return theano._asarray(ret, dtype=v.type.dtype)
            # This is needed when we take the grad as the Shape op
@@ -747,8 +750,8 @@ class TensorType(Type):
        This function is used internally as part of C code generation.
        """
-        #TODO: add more type correspondances for e.g. int32, int64, float32,
+        # TODO: add more type correspondances for e.g. int32, int64, float32,
-        #complex64, etc.
+        # complex64, etc.
        try:
            return {
                'float32': (float, 'npy_float32', 'NPY_FLOAT32'),
@@ -786,7 +789,7 @@ class TensorType(Type):
    @staticmethod
    def values_eq(a, b, force_same_dtype=True):
-        #TODO: check to see if the shapes must match
+        # TODO: check to see if the shapes must match
        #      for now, we err on safe side...
        if a.shape != b.shape:
            return False
@@ -863,14 +866,14 @@ class TensorType(Type):
                # Find places where both a and b have inf of the same sign.
                both_inf = a_inf * numpy.isinf(b)
-                #cmp_elemwise is weird when we have inf and -inf.
+                # cmp_elemwise is weird when we have inf and -inf.
-                #set it to False
+                # set it to False
                cmp_elemwise = numpy.where(
                        both_inf & cmp_elemwise,
                        a == b,
                        cmp_elemwise)
-                #check the sign of the inf
+                # check the sign of the inf
                both_inf = numpy.where(both_inf, (a == b), both_inf)
                if allow_remove_inf:
@@ -1244,21 +1247,21 @@ tensor4s, ftensor4s, dtensor4s, itensor4s, ltensor4s = _multi(
 class _tensor_py_operators:
-    #UNARY
+    # UNARY
    def __abs__(self):
        return abs_(self)
    def __neg__(self):
        return neg(self)
-    #CASTS
+    # CASTS
    #### REMOVED THESE BECAUSE PYTHON appears to require __int__ to return
    #### an int. -JB 20081112
    #def __int__(self): return convert_to_int32(self)
    #def __float__(self): return convert_to_float64(self)
    #def __complex__(self): return convert_to_complex128(self)
-    #COMPARISONS
+    # COMPARISONS
    _is_nonzero = True
    def __lt__(self, other):
@@ -1294,7 +1297,7 @@ class _tensor_py_operators:
        else:
            raise TypeError("Variable does not support boolean operations.")
-    #BITWISE
+    # BITWISE
    def __invert__(self):
        return invert(self)
@@ -1316,16 +1319,16 @@ class _tensor_py_operators:
    def __rxor__(self, other):
        return xor(other, self)
-    #def __iand__(self, other):
+    # def __iand__(self, other):
    #    return _and_inplace(self, other)
    #
-    #def __ior__(self, other):
+    # def __ior__(self, other):
    #    return _or_inplace(self, other)
    #
    #def __ixor__(self, other):
    #    return _xor_inplace(self, other)
-    #ARITHMETIC - NORMAL
+    # ARITHMETIC - NORMAL
    def __add__(self, other):
        try:
            return add(self, other)
@@ -1439,7 +1442,7 @@ class _tensor_py_operators:
    def __rpow__(self, other):
        return pow(other, self)
-    #TRANSPOSE
+    # TRANSPOSE
    T = property(lambda self: transpose(self))
    def transpose(self, *axes):
@@ -1502,10 +1505,9 @@ class _tensor_py_operators:
        """
        if ndim is not None:
-            if not isinstance(ndim,int):
+            if not isinstance(ndim, int):
                raise ValueError("Expected ndim to be an integer, is "\
-                        +str(type(ndim)))
+                        + str(type(ndim)))
        return reshape(self, shape, ndim=ndim)
@@ -1542,7 +1544,7 @@ class _tensor_py_operators:
    def astype(self, dtype):
        return cast(self, dtype)
-    #SLICING
+    # SLICING
    # Do not define __getslice__ here:
    # When calling t[1:], for instance, the arguments passed to __getslice__
    # are (1, sys.maxsize), which is a pain to deal with, and can even not be
@@ -1602,7 +1604,7 @@ class _tensor_py_operators:
                return Subtensor(args)(self, *Subtensor.collapse(args,
                    lambda entry: isinstance(entry, Variable)))
-    #COPYING
+    # COPYING
    def copy(self):
        return tensor_copy(self)
@@ -1629,7 +1631,7 @@ class _tensor_py_operators:
    dtype = property(lambda self: self.type.dtype)
    """ The dtype of this tensor.  """
-    #extra pseudo-operator symbols
+    # extra pseudo-operator symbols
    def __dot__(left, right):
        return dot(left, right)
@@ -1649,7 +1651,7 @@ class _tensor_py_operators:
            raise NotImplementedError()
        if numpy.isinf(L):
            raise NotImplementedError()
-        #optimizations will/should catch cases like L=1, L=2
+        # optimizations will/should catch cases like L=1, L=2
        return pow(pow(abs_(self), L).sum(axis=axis), 1.0 / L)
    def mean(self, axis=None, dtype=None, keepdims=False):
@@ -1668,7 +1670,7 @@ class _tensor_py_operators:
        """See `theano.tensor.max`"""
        return max(self, axis, keepdims=keepdims)
-    #TO TRUMP NUMPY OPERATORS
+    # TO TRUMP NUMPY OPERATORS
    __array_priority__ = 1000
    def get_constant_value(self):
@@ -1697,7 +1699,7 @@ class TensorConstantSignature(tuple):
        except Exception:
            return False
-        #N.B. compare shape to ensure no broadcasting in ==
+        # N.B. compare shape to ensure no broadcasting in ==
        if t0 != t1 or d0.shape != d1.shape:
            return False
@@ -1802,7 +1804,6 @@ class TensorConstant(_tensor_py_operators, Constant):
 TensorType.Constant = TensorConstant
 Tensor = TensorType
@@ -1816,6 +1817,7 @@ elemwise.TensorConstant = TensorConstant
 # Utilities
 #########################
 def _redefine(real_symbol_value, module='tensor'):
    """Replace the value associated with a function symbol.
@@ -1872,7 +1874,7 @@ def _scal_elemwise_with_nfunc(nfunc, nin, nout):
        if getattr(symbol, '__doc__', False):
            rval.__doc__ = symbol.__doc__ + '\n' + rval.__doc__
-        #for the meaning of this see the ./epydoc script
+        # for the meaning of this see the ./epydoc script
        # it makes epydoc display rval as if it were a function, not an object
        rval.__epydoc_asRoutine = symbol
        rval.__module__ = 'tensor'
@@ -1965,7 +1967,7 @@ class ScalarFromTensor(Op):
 scalar_from_tensor = ScalarFromTensor()
-#to be removed as we get the epydoc routine-documenting thing going
+# to be removed as we get the epydoc routine-documenting thing going
 #-JB 20080924
 def _conversion(real_value, name):
    __oplist_tag(real_value, 'casting')
@@ -2061,6 +2063,7 @@ def cast(x, dtype):
 # Unary Operations
 ##########################
 class Shape(Op):
    """
    L{Op} to return the shape of a matrix.
@@ -2077,13 +2080,13 @@ class Shape(Op):
        return self.__class__.__name__
    def make_node(self, x):
-        #Must work for all type that have a shape attribute.
+        # Must work for all type that have a shape attribute.
-        #This will fail at execution time.
+        # This will fail at execution time.
        x = as_tensor_variable(x)
-        #Each type variable should implement their .shape attribute
+        # Each type variable should implement their .shape attribute
-        #and have the fct infer_shape() implemented in the op that convert
+        # and have the fct infer_shape() implemented in the op that convert
-        #the type to TensorVariable to have the optimization working
+        # the type to TensorVariable to have the optimization working
-        #correctly.
+        # correctly.
        return Apply(self, [x], [lvector()])
    def perform(self, node, inp, out_):
@@ -2094,8 +2097,21 @@ class Shape(Op):
    def infer_shape(self, node, in_shapes):
        return [[len(in_shapes[0])]]
+    def connection_pattern(self):
+        # the grad returns the gradient with respect to the
+        # elements of a tensor variable
+        # the elements of the tensor variable do not participate
+        # in the computation of the shape, so they are not really
+        # part of the graph
+        return [False]
    def grad(self, inp, grads):
-        return [grad_undefined(self,0,inp[0])]
+        # the grad returns the gradient with respect to the
+        # elements of a tensor variable
+        # the elements of the tensor variable do not participate
+        # in the computation of the shape, so they are not really
+        # part of the graph
+        return [None]
    def R_op(self, inputs, eval_points):
        return [None]
@@ -2113,7 +2129,7 @@ def old_shape(a):
    shape at graph-execution time.
    """
    va = as_tensor_variable(a)
-    #print 'HERE', va, va.type
+    # print 'HERE', va, va.type
    if None in va.type.shape:
        # Some shape components are unknown at this time
        return _shape(va)
@@ -2314,9 +2330,21 @@ class MaxAndArgmax(Op):
        x, axis = inp
        g_max, g_max_idx = grads
-        # Check to see if the gradient on max is None
+        g_max_disconnected = isinstance(g_max.type, DisconnectedType)
-        if g_max is None:
+        g_max_idx_disconnected = isinstance(g_max_idx.type, DisconnectedType)
-            return None, None
+        # if the op is totally disconnected, so are its inputs
+        if g_max_disconnected and g_max_idx_disconnected:
+            return [DisconnectedType()(), DisconnectedType()()]
+        axis_grad = grad_undefined(self, 1, axis,
+                "argmax is not defined for non-integer axes so"
+                " argmax(x, axis+eps) is undefined")
+        # if the max is disconnected but the argmax is not,
+        # the gradient on its inputs is zero
+        if g_max_disconnected:
+            return [x.zeros_like(), axis_grad]
        xmax = max(x, axis)
        # Raise the g_max and xmax to the same number of dim as the input.
@@ -2336,7 +2364,7 @@ class MaxAndArgmax(Op):
        # Set the grad to the correct position.
        g_x = eq(xmax_pad, x) * g_max_pad
-        return g_x, grad_undefined(self, 1, axis)
+        return g_x, axis_grad
    def __str__(self):
        return self.__class__.__name__
@@ -2458,7 +2486,7 @@ def min(x, axis=None, keepdims=False):
    if str_x_type.startswith('float') or str_x_type in int_dtypes:
        return -max(-x, axis=axis, keepdims=keepdims)
    else:
-        #Be careful about unsigned integers, complex
+        # Be careful about unsigned integers, complex
        raise NotImplementedError()
@@ -2479,7 +2507,7 @@ def argmin(x, axis=None, keepdims=False):
    if str_x_type.startswith('float') or str_x_type in int_dtypes:
        return argmax(-x, axis=axis, keepdims=keepdims)
    else:
-        #Be careful about unsigned integers, complex
+        # Be careful about unsigned integers, complex
        raise NotImplementedError()
@@ -2707,7 +2735,7 @@ def sqr(a):
    """square of a"""
-#alias to sqr, included to maintain similarity with numpy interface
+# alias to sqr, included to maintain similarity with numpy interface
 square = sqr
@@ -2849,7 +2877,8 @@ def complex_from_polar(abs, angle):
 # Misc
 ##########################
-#fill, _fill_inplace = _elemwise(scal.second, 'fill',
+# fill, _fill_inplace = _elemwise(scal.second, 'fill',
    #"""fill WRITEME (elemwise)""")
 @_scal_elemwise
 def second(a, b):
@@ -2917,7 +2946,7 @@ class Eye(gof.Op):
        return [out_shape]
    def grad(self, inp, grads):
-        return [ grad_undefined(self,i,inp[i]) for i in xrange(3) ]
+        return [grad_undefined(self, i, inp[i]) for i in xrange(3)]
    def __eq__(self, other):
        return type(self) == type(other) and self.dtype == other.dtype
@@ -3092,7 +3121,7 @@ class Alloc(gof.Op):
                out[0] = numpy.empty(sh, dtype=v.dtype)
                out[0][...] = v  # broadcast v to fill us up
        else:
-            #reuse the allocated memory.
+            # reuse the allocated memory.
            out[0][...] = v  # broadcast v to fill us up
    def c_code(self, node, name, inp, out, sub):
@@ -3280,12 +3309,12 @@ class Mean(elemwise.CAReduce):
        if self.axis is not None:
            return super(Op, self).c_code(node, name, inames, onames, sub)
        ret = elemwise.CAReduce.c_code(self, node, name, inames, onames, sub)
-        #TODO: c_code perform support only axis is None
+        # TODO: c_code perform support only axis is None
        return ret + """
  *((double *)PyArray_DATA(%s)) /= PyArray_SIZE(%s);
  """ % (onames[0], inames[0])
-#TODO: implement the grad. When done and tested, you can make this the default
+# TODO: implement the grad. When done and tested, you can make this the default
 # version.
 #    def grad(self, (x,), (gout,)):
 #      import pdb;pdb.set_trace()
@@ -3379,28 +3408,33 @@ def var(input, axis=None, keepdims=False):
    if isinstance(axis, int):
        axis = [axis]
-    #compute the axis-wise mean
+    # compute the axis-wise mean
    mean_input = mean(input, axis, keepdims=True)
-    #center the input
+    # center the input
    centered_input = input - mean_input
-    #return the mean sqr
+    # return the mean sqr
    return mean((centered_input ** 2), axis, keepdims=keepdims)
 @constructor
 def std(input, axis=None, keepdims=False):
    """
-    Computes the standard deviation along the given axis(es) of a tensor `input`.
+    Computes the standard deviation along the given axis(es)
+    of a tensor `input`.
-    :param axis: Compute the standard deviation along this axis of the tensor.
+    :param axis: Compute the standard deviation along this
+                axis of the tensor.
                 None means all axes (like numpy).
    :type axis: None or int or (list of int) (see `Sum`)
-    :param keepdims: If this is set to True, the axes which are reduced are
+    :param keepdims: If this is set to True, the axes
-        left in the result as dimensions with size one. With this option,
+        which are reduced are
-        the result will broadcast correctly against the original tensor.
+        left in the result as dimensions with size one.
+        With this option,
+        the result will broadcast correctly against the
+        original tensor.
    """
    return sqrt(var(input=input, axis=axis, keepdims=keepdims))
@@ -3423,8 +3457,8 @@ if 0:
            type = TensorType(dtype=input.type.dtype,
                              broadcastable=broadcastable)
-            #backport
+            # backport
-            #type = TensorType(dtype=input.type.dtype,
+            # type = TensorType(dtype=input.type.dtype,
            #                  broadcastable=[
            #                      False if i==axis else x
            #                      for i, x in enumerate(input.broadcastable)])
@@ -3859,7 +3893,7 @@ class Subtensor(Op):
            exception.subtensor_invalid = True
            raise exception
-        #infer the broadcasting pattern
+        # infer the broadcasting pattern
        padded = (idx_list
                + [slice(None, None, None)] * (x.type.ndim - len(idx_list)))
        broadcastable = [bc for p, bc in zip(padded, x.type.broadcastable)
@@ -3942,7 +3976,7 @@ class Subtensor(Op):
        return type(self) == type(other) and self.idx_list == other.idx_list
    def __hash__(self):
-        #TODO: optimize by cache this hash value
+        # TODO: optimize by cache this hash value
        msg = []
        for entry in self.idx_list:
            if isinstance(entry, slice):
@@ -3951,8 +3985,8 @@ class Subtensor(Op):
                msg += [entry]
        idx_list = tuple(msg)
-        #backport
+        # backport
-        #idx_list = tuple((entry.start, entry.stop, entry.step)
+        # idx_list = tuple((entry.start, entry.stop, entry.step)
        #                 if isinstance(entry, slice)
        #                 else entry
        #                 for entry in self.idx_list)
@@ -3989,7 +4023,7 @@ class Subtensor(Op):
        fail = sub['fail']
        init_cmds = []  # initialization for subtensor_spec
        is_slice = []
-        #TODO: change that, it might lead to unexpected results,
+        # TODO: change that, it might lead to unexpected results,
        # see assembla-#767
        NONE_CODE = maxsize - 1
@@ -4040,7 +4074,7 @@ class Subtensor(Op):
        for entry in idx_list:
            init_entry(entry)
-        #make sure we used all inputs
+        # make sure we used all inputs
        assert input_pos() == len(inputs), input_pos()
        assert len(is_slice) <= node.inputs[0].ndim, node.inputs[0].ndim
@@ -4213,7 +4247,7 @@ class Subtensor(Op):
        }
        PyArray_UpdateFlags(xview, NPY_C_CONTIGUOUS|NPY_F_CONTIGUOUS);
        """ % locals()
-        #print rval
+        # print rval
        return rval
    @staticmethod
@@ -4398,7 +4432,7 @@ class IncSubtensor(Op):
                msg += [entry]
        idx_list = tuple(msg)
-        #backport
+        # backport
        #idx_list = tuple((entry.start, entry.stop, entry.step)
        #                 if isinstance(entry, slice)
        #                 else entry
@@ -4675,7 +4709,7 @@ class Split(Op):
    def perform(self, node, inputs, outputs):
        """WRITEME"""
        x, axis, splits = inputs
-        #in python 2.4, x.shape[numpy.asarray(1)] don't work.
+        # in python 2.4, x.shape[numpy.asarray(1)] don't work.
        if sys.version_info[0:2] == (2, 4) and axis.size == 1:
            axis = int(axis)
@@ -5376,7 +5410,6 @@ class Reshape(Op):
            raise ValueError('Cannot reshape input of shape %s to shape %s' %
                             (x.shape, shp))
    def grad(self, inp, grads):
        x, shp = inp
        g_out, = grads
@@ -5399,7 +5432,7 @@ class Reshape(Op):
        # The following expression leads to cycles in feature_shape,
        # because it tries to replace the Shape_i node by the switch
        # statement, which depends on Shape_i.
-        #return [tuple([switch(eq(node.inputs[1][i], -1),
+        # return [tuple([switch(eq(node.inputs[1][i], -1),
        #                      theano.tensor.opt.Shape_i(i)(node.outputs[0]),
        #                      node.inputs[1][i])
        #                    for i in xrange(self.ndim)]
@@ -5462,7 +5495,8 @@ class Reshape(Op):
                        %(shp)s->data + ii * %(shp)s->strides[0]))[0];
            }
            Py_XDECREF(%(z)s);
-            %(z)s = (PyArrayObject *) PyArray_Newshape(%(x)s, &newshape, PyArray_CORDER);
+            %(z)s = (PyArrayObject *) PyArray_Newshape(%(x)s, &newshape,
+                PyArray_CORDER);
            if (!%(z)s)
            {
                PyErr_Format(PyExc_ValueError,
@@ -5557,7 +5591,7 @@ def flatten(x, outdim=1):
 #     """
 #     Calculates the gradient of the Tile Op.
 #     """
-#     #this is so weird, I can't think of how to make this a general thing.
+#     # this is so weird, I can't think of how to make this a general thing.
 #     def make_node(self, x, reps, g_out):
 #         return gof.Apply(self, [x, reps, g_out], [x.type()])
 #
@@ -5645,11 +5679,11 @@ def tile(x, reps, ndim=None):
    TODO: expand this.
    """
-    try: 
+    try:
-        assert python_all([int(i) == i for i in iter(reps)]) 
+        assert python_all([int(i) == i for i in iter(reps)])
-    except (TypeError, AssertionError): 
+    except (TypeError, AssertionError):
-        raise ValueError("reps argument to tile must be a constant (e.g. " 
+        raise ValueError("reps argument to tile must be a constant (e.g. "
-        "tuple, list of integers)") 
+        "tuple, list of integers)")
    if len(reps) != x.ndim:
        raise ValueError("len(reps) != x.ndim not currently supported")
    elif (ndim is not None) and ndim != x.ndim:
@@ -5663,7 +5697,7 @@ def tile(x, reps, ndim=None):
        ndim = len(reps)
    # backport
-    # ndim = len(reps) if ndim is None else ndim #not sure if len(shp) is going
+    # ndim = len(reps) if ndim is None else ndim # not sure if len(shp) is going
    # to work.
    if ndim not in tile.op:
        tile.op[ndim] = Tile(ndim)
@@ -6146,7 +6180,7 @@ class AdvancedSubtensor(Op):
    def make_node(self, x, *inputs):
        x = as_tensor_variable(x)
-        #FIXME
+        # FIXME
        # Note (9 Jul 2012): what does this 'FIXME' mean? Possibly that the
        # current implementation must be generalized? Please specify.
        if x.ndim == 2 and len(inputs) == 2:
@@ -6209,7 +6243,7 @@ class AdvancedSubtensor(Op):
                    'are too big (>= 2^32 elements). It is possible that '
                    'out[0] (%s), with shape %s, is not correctly filled.'
                    % (out[0], out[0].shape))
-        #return
+        # return
        #raise NotImplementedError()
    def grad(self, inputs, grads):
@@ -6232,8 +6266,8 @@ class AdvancedIncSubtensor(Op):
    def __init__(self, inplace=False, set_instead_of_inc=False):
        self.inplace = inplace
        self.set_instead_of_inc = set_instead_of_inc
-        #The assert is needed as in the pass the first argument was
+        # The assert is needed as in the pass the first argument was
-        #something else that was not used.
+        # something else that was not used.
        assert isinstance(inplace, bool)
        if self.inplace:
            raise NotImplementedError('In place computation is not'
@@ -6325,6 +6359,7 @@ advanced_inc_subtensor = AdvancedIncSubtensor()
 #
 # TODO: Dotinv should go here, Eigs, Svd, etc.
 class Dot(Op):
    """Compute matrix-matrix, matrix-vector products and vector inner-products.
@@ -6351,7 +6386,7 @@ class Dot(Op):
        numpy_semantics = 0
        if numpy_semantics:
-            #numpy defines dot for tensor pairs with any rank
+            # numpy defines dot for tensor pairs with any rank
            if len(inputs) != 2:
                raise TypeError(
                    "Wrong number of inputs for %s (got %i, expected 2)" %
@@ -6712,7 +6747,7 @@ def tensordot(x, y=None, axes=2):
    return tensordot.op[axes](x, y)
-#TODO: tensordot should be function as described in rst docs.
+# TODO: tensordot should be function as described in rst docs.
 def outer(x, y):

--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
@@ -98,26 +98,26 @@ class Conv3D(theano.Op):
        if 'name' in dir(dCdH) and dCdH.name is not None:
            dCdH_name = dCdH.name
        else:
-            dCdH_name = 'anon'
+            dCdH_name = 'anon_dCdH'
        if 'name' in dir(V) and V.name is not None:
            V_name = V.name
        else:
-            V_name = 'anon'
+            V_name = 'anon_V'
        if 'name' in dir(W) and W.name is not None:
            W_name = W.name
        else:
-            W_name = 'anon'
+            W_name = 'anon_W'
        if 'name' in dir(b) and b.name is not None:
            b_name = b.name
        else:
-            b_name = 'anon'
+            b_name = 'anon_b'
-        dCdV.name = 'Conv3D_dCdV.dCdH='+dCdH_name+',V='+V_name
+        dCdV.name = 'Conv3D_dCdV(dCdH='+dCdH_name+',V='+V_name+')'
-        dCdW.name = 'Conv3D_dCdW.dCdH='+dCdH_name+',V='+V_name+',W='+W_name
+        dCdW.name = 'Conv3D_dCdW(dCdH='+dCdH_name+',V='+V_name+',W='+W_name+')'
-        dCdb.name = 'Conv3D_dCdb.dCdH='+dCdH_name+',V='+V_name+',W='+W_name+',b='+b_name
+        dCdb.name = 'Conv3D_dCdb(dCdH='+dCdH_name+',V='+V_name+',W='+W_name+',b='+b_name+')'

--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -56,22 +56,22 @@ class ConvTransp3D(theano.Op):
        if 'name' in dir(dCdR) and dCdR.name is not None:
            dCdR_name = dCdR.name
        else:
-            dCdR_name = 'anon'
+            dCdR_name = 'anon_dCdR'
        if 'name' in dir(H) and H.name is not None:
            H_name = H.name
        else:
-            H_name = 'anon'
+            H_name = 'anon_H'
        if 'name' in dir(W) and W.name is not None:
            W_name = W.name
        else:
-            W_name = 'anon'
+            W_name = 'anon_W'
        if 'name' in dir(b) and b.name is not None:
            b_name = b.name
        else:
-            b_name = 'anon'
+            b_name = 'anon_b'
        dCdW.name = 'ConvTransp3D_dCdW.H='+H_name+',dCdR='+dCdR_name+',W='+W_name

--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
@@ -780,9 +780,19 @@ class ConvOp(OpenMPOp):
            # build a "node", that should be equivalent to the one given by
            # self.make_node, but using conv3D instead of self.
+            shuffled_inputs = inputs.dimshuffle(0, 2, 3, 'x', 1)
+            if inputs.name is not None:
+                shuffled_inputs.name = 'shuffle_for_conv3D(%s)' % inputs.name
+            flipped_kerns = kerns[:, :, ::-1, ::-1]
+            if kerns.name is not None:
+                flipped_kerns.name = 'flipped(%s)' % kerns.name
+            shuffled_kerns = flipped_kerns.dimshuffle(0, 2, 3, 'x', 1)
+            if flipped_kerns.name is not None:
+                shuffled_kerns.name = 'shuffled_for_conv3D(%s)' % flipped_kerns.name
            tmp_node = theano.tensor.nnet.conv3D(
-                V=inputs.dimshuffle(0, 2, 3, 'x', 1),
+                V = shuffled_inputs,
-                W=kerns[:, :, ::-1, ::-1].dimshuffle(0, 2, 3, 'x', 1),
+                W= shuffled_kerns,
                b=theano.tensor.alloc(numpy.asarray(0, dtype=kerns.dtype),
                                      kerns.shape[0]),
                d=(self.dx, self.dy, 1))

--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -14,6 +14,7 @@ from theano.compile import optdb
 from theano.gof import Apply
 from theano.tensor.nnet.sigm import sigmoid, softplus
+from theano.gradient import DisconnectedType
 ############
@@ -76,6 +77,10 @@ class SoftmaxWithBias(gof.Op):
    def grad(self, inp, grads):
        x, b = inp
        g_sm, = grads
+        if isinstance(g_sm.type, DisconnectedType):
+            return [ DisconnectedType()(), DisconnectedType()() ]
        sm = softmax_with_bias(x, b)
        dx = softmax_grad(g_sm, sm)
        db = tensor.sum(dx, axis=0)
@@ -710,21 +715,40 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
    def grad(self, inp, grads):
        x, b, y_idx = inp
        g_nll, g_sm, g_am = grads
-        if g_am is not None:
-            raise NotImplementedError()
-        elif g_sm is not None:
+        dx_terms = []
-            # There is a gradient w.r.t. the softmax's output itself.
+        db_terms = []
-            if g_nll is not None or g_am is not None:
+        d_idx_terms = []
-                raise NotImplementedError()
-            return softmax_with_bias.grad((x, b, ), (g_sm, )) + (None, )
-        else:
+        if not isinstance(g_nll.type, DisconnectedType):
-            # There is a gradient w.r.t. the NLL.
-            assert g_nll is not None
            nll, sm = crossentropy_softmax_1hot_with_bias(x, b, y_idx)
-            #dx = CrossentropySoftmax1HotWithBiasDx()(g_nll, sm, y_idx)
            dx = crossentropy_softmax_1hot_with_bias_dx(g_nll, sm, y_idx)
            db = tensor.sum(dx, axis=[0])
-            return dx, db, None
+            dx_terms.append(dx)
+            db_terms.append(db)
+        if not isinstance(g_sm.type, DisconnectedType):
+            dx, db = softmax_with_bias.grad((x, b), (g_sm, ))
+            dx_terms.append(dx)
+            db_terms.append(db)
+        if not isinstance(g_am.type, DisconnectedType):
+            dx_terms.append(x.zeros_like())
+            db_terms.append(b.zeros_like())
+            d_idx_terms.append(y_idx.zeros_like())
+        def fancy_sum( terms ):
+            if len(terms) == 0:
+                return DisconnectedType()()
+            rval = terms[0]
+            for term in terms[1:]:
+                rval = rval + term
+            return rval
+        return [ fancy_sum(terms) for terms in
+                [dx_terms, db_terms, d_idx_terms ] ]
    def c_headers(self):
        return ['<iostream>', '<cmath>']

--- a/theano/tensor/nnet/tests/test_conv.py
+++ b/theano/tensor/nnet/tests/test_conv.py
@@ -18,7 +18,9 @@ class TestConv2D(utt.InferShapeTester):
    def setUp(self):
        super (TestConv2D, self).setUp()
        self.input = T.dtensor4('input')
+        self.input.name = 'default_V'
        self.filters = T.dtensor4('filters')
+        self.filters.name = 'default_filters'
    def validate(self, image_shape, filter_shape,
                 border_mode='valid', subsample=(1, 1),
@@ -34,7 +36,7 @@ class TestConv2D(utt.InferShapeTester):
            N_filter_shape = [T.get_constant_value(T.
                as_tensor_variable(x)) for x in filter_shape]
-        if not input:
+        if input is None:
            input = self.input
        if not filters:
            filters = self.filters
@@ -44,11 +46,16 @@ class TestConv2D(utt.InferShapeTester):
        # we create a symbolic function so that verify_grad can work
        def sym_conv2d(input, filters):
            # define theano graph and function
-            return conv.conv2d(input, filters, image_shape, filter_shape,
+            input.name = 'input'
+            filters.name = 'filters'
+            rval =  conv.conv2d(input, filters, image_shape, filter_shape,
                          border_mode, subsample, unroll_batch=unroll_batch,
                          unroll_kern=unroll_kern, unroll_patch=unroll_patch)
+            rval.name = 'conv_output'
+            return rval
        output = sym_conv2d(input, filters)
+        output.name = 'conv2d(%s,%s)' % (input.name, filters.name)
        theano_conv = theano.function([input, filters], output)
        # initialize input and compute result

--- a/theano/tensor/nnet/tests/test_conv3d.py
+++ b/theano/tensor/nnet/tests/test_conv3d.py
@@ -121,33 +121,49 @@ class TestConv3D(utt.InferShapeTester):
        mode.check_py_code = False
        self.W = shared(N.ndarray(shape=(1, 1, 1, 1, 1), dtype=floatX))
+        self.W.name = 'W'
        self.b = shared(N.zeros(1, dtype=floatX))
+        self.b.name = 'b'
        self.rb = shared(N.zeros(1, dtype=floatX))
+        self.rb.name = 'rb'
        self.V = shared(N.ndarray(shape=(1, 1, 1, 1, 1), dtype=floatX))
+        self.V.name = 'V'
        self.d = shared(N.ndarray(shape=(3, ), dtype=int))
+        self.d.name = 'd'
        self.H = conv3D(self.V, self.W, self.b, self.d)
+        self.H.name = 'H'
        self.H_func = function([], self.H, mode=mode)
        self.H_shape_func = function([], self.H.shape, mode=mode)
        self.RShape = T.vector(dtype='int64')
+        self.RShape.name = 'RShape'
        self.otherH = T.TensorType(floatX,
                        (False, False, False, False, False))(name='otherH')
        self.transp = convTransp3D(self.W, self.rb, self.d,
                                   self.otherH, self.RShape)
+        self.transp.name = 'transp'
        self.transp_func = function([self.otherH, self.RShape],
                                    self.transp, mode=mode)
        self.R = convTransp3D(self.W, self.rb, self.d, self.H, self.RShape)
+        self.R.name = 'R'
        self.R_func = function([self.RShape], self.R, mode=mode)
        self.R_shape_func = function([self.RShape], self.R.shape)
-        self.reconsObj = T.sum(T.sqr(self.V - self.R))
+        diff = self.V - self.R
+        diff.name = 'diff'
+        sqr = T.sqr(diff)
+        sqr.name = 'sqr'
+        self.reconsObj = T.sum(sqr)
+        self.reconsObj.name = 'reconsObj'
        self.reconsObjFunc = function([self.RShape], self.reconsObj, mode=mode)
+        W_grad = T.grad(self.reconsObj, self.W)
        self.gradientsFunc = function([self.RShape],
-                        [T.grad(self.reconsObj, self.W), T.grad(self.reconsObj,
+                        [W_grad, T.grad(self.reconsObj,
                        self.H), T.grad(self.reconsObj, self.V),
                         T.grad(self.reconsObj, self.b)], mode=mode)

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -2832,16 +2832,16 @@ class Canonizer(gof.LocalOptimizer):
        # this canonized graph...  if so, we do nothing and wait for
        # them to be transformed.
        def _bypass_dimshuffle(n):
-            if isinstance(n.op, DimShuffle) and len(n.outputs[0].clients) <= 1:
+            if (isinstance(getattr(n, 'op', None), DimShuffle) and
-                return _bypass_dimshuffle(n.outputs[0].clients.__iter__(
+                len(n.outputs[0].clients) <= 1):
-                        ).next()[0])
+                return _bypass_dimshuffle(n.outputs[0].clients[0][0])
            else:
                return n
        for c, c_idx in out.clients:
            if c == 'output':
                continue
-            if _bypass_dimshuffle(c).op in [self.main, self.inverse,
+            if getattr(_bypass_dimshuffle(c), 'op', '') in [
-                                            self.reciprocal]:
+                self.main, self.inverse, self.reciprocal]:
                return False
        # Here we make the canonical version of the graph around this node

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -2023,6 +2023,10 @@ class T_max_and_argmax(unittest.TestCase):
        because there is no differentiable path from cost to the input and
        not because of an error of the grad method of the op
        """
+        raise KnownFailureTest("The desired behavior of the grad method in this case is currently under debate. In any case, the result should be to return NaN or 0, not to report a disconnected input.")
        x = matrix()
        cost = argmax(x, axis=0).sum()
        value_error_raised = False
@@ -2220,6 +2224,7 @@ class T_argmin_argmax(unittest.TestCase):
    def test_grad_argmin(self):
        data = rand(2, 3)
        n = as_tensor_variable(data)
+        n.name = 'n'
        #test grad of argmin
        utt.verify_grad(lambda v: argmin(v, axis=-1), [data])
@@ -2231,7 +2236,9 @@ class T_argmin_argmax(unittest.TestCase):
        utt.verify_grad(lambda v: argmin(v.flatten()), [data])
        try:
-            grad(argmin(n, axis=-1), n)
+            cost = argmin(n, axis=-1)
+            cost.name = None
+            g = grad(cost, n)
            raise Exception('Expected an error')
        except TypeError:
            pass
@@ -4375,6 +4382,7 @@ class test_grad(unittest.TestCase):
        o = test_grad.O()
        a1 = o.make_node()
        g0,g1 = grad(a1.outputs[0], a1.inputs)
+        g0.name = None
        self.assertTrue(o.gval0 is g0)
        self.assertTrue(o.gval1 is g1)
@@ -4435,10 +4443,8 @@ class test_grad(unittest.TestCase):
        v = vector()
        m = matrix()
        # grad(v,...) and grad(m,...) should fail
-        self.assertRaises(TypeError, grad, v, s)
+        self.assertRaises(TypeError, grad, v, v)
-        self.assertRaises(TypeError, grad, v, m)
+        self.assertRaises(TypeError, grad, m, m)
-        self.assertRaises(TypeError, grad, m, s)
-        self.assertRaises(TypeError, grad, m, v)
 class T_op_cache(unittest.TestCase):
    def setUp(self):

--- a/theano/tests/test_gradient.py
+++ b/theano/tests/test_gradient.py
@@ -10,19 +10,22 @@ from theano.gradient import grad_sources_inputs
 from theano import gradient
 from theano.tensor.nnet.Conv3D import conv3D
 from theano import config
+import numpy as np
+one = theano.tensor.as_tensor_variable(1.)
 def _grad_sources_inputs(*args):
    # warn_type was introduced after this code, it complains throughout for nothing.
    return grad_sources_inputs(warn_type=False, *args)
 class test_grad_sources_inputs(unittest.TestCase):
    def test_retNone1(self):
        """Test that it is not ok to return None from op.grad()"""
        class retNone(gof.op.Op):
            def make_node(self):
-                inputs = [gof.generic()]
+                inputs = [theano.tensor.vector()]
-                outputs = [gof.generic()]
+                outputs = [theano.tensor.vector()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inp, grads):
                x, = inp
@@ -30,240 +33,118 @@ class test_grad_sources_inputs(unittest.TestCase):
                pass
        a = retNone().make_node()
        try:
-            _grad_sources_inputs([(a.out, 1)], None)
+            _grad_sources_inputs([(a.out, one)], None)
-        except ValueError, e:
+        except TypeError, e:
-            self.assertTrue(e[0] is gradient._msg_retType)
            return
        self.fail()
-    def test_retNone1_b(self):
-        """Test that it is ok to return [None] from op.grad()"""
-        class retNone(gof.op.Op):
-            def make_node(self, *inputs):
-                outputs = [gof.generic()]
-                return gof.Apply(self, inputs, outputs)
-            def grad(self, inp, grads):
-                return [None]
-        i = gof.generic()
-        a = retNone().make_node(i)
-        g = _grad_sources_inputs([(a.out, 1)], None)
-        self.assertTrue(not i in g)
    def test_wrong_rval_len1(self):
-        """Test that it is not ok to return the wrong number of gradients"""
+        """Test that it is not ok to return the wrong number of gradient terms"""
        class retNone(gof.op.Op):
            def make_node(self, *inputs):
-                outputs = [gof.generic()]
+                outputs = [theano.tensor.vector()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inputs, grads):
                return [None]
-        i = gof.generic()
+        i = theano.tensor.vector()
-        j = gof.generic()
+        j = theano.tensor.vector()
        a1 = retNone().make_node(i)
-        g = _grad_sources_inputs([(a1.out, 1)], None)
+        g = _grad_sources_inputs([(a1.out, one)], None)
        a2 = retNone().make_node(i,j)
        try:
-            g = _grad_sources_inputs([(a2.out, 1)], None)
+            g = _grad_sources_inputs([(a2.out, one)], None)
        except ValueError, e:
-            self.assertTrue(e[0] is gradient._msg_badlen)
            return
        self.fail()
-    def test_stop_on_all_none(self):
-        """Test that op.grad() is not called when output grads are all None"""
-        class retNone(gof.op.Op):
-            def __init__(self, tst):
-                self.tst = tst
-            def make_node(self, *inputs):
-                outputs = [gof.generic()]
-                return gof.Apply(self, inputs, outputs)
-            def grad(self, inputs, grads):
-                self.tst.fail()
-        i = gof.generic()
-        a1 = retNone(self).make_node(i)
-        g = _grad_sources_inputs([(a1.out, None)], None)
    def test_1in_1out(self):
        """Test grad is called correctly for a 1-to-1 op"""
-        gval = gof.generic()
+        gval = theano.tensor.matrix()
        class O(gof.op.Op):
            def make_node(self):
-                inputs = [gof.generic()]
+                inputs = [theano.tensor.matrix()]
-                outputs = [gof.generic()]
+                outputs = [theano.tensor.matrix()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inp, grads):
                return gval,
        a1 = O().make_node()
-        g = _grad_sources_inputs([(a1.outputs[0], 1)], None)
+        g = _grad_sources_inputs([(a1.outputs[0], one)], None)
        self.assertTrue(g[a1.inputs[0]] is gval)
    def test_1in_Nout(self):
        """Test grad is called correctly for a 1-to-many op"""
-        gval = gof.generic()
+        gval = theano.tensor.matrix()
        class O(gof.op.Op):
            def make_node(self):
-                inputs = [gof.generic()]
+                inputs = [theano.tensor.matrix()]
-                outputs = [gof.generic(),gof.generic()]
+                outputs = [theano.tensor.scalar(),theano.tensor.scalar()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inp, grads):
                x, = inp
                gz1, gz2 = grads
                return gval,
        a1 = O().make_node()
-        g = _grad_sources_inputs([(a1.outputs[0], 1)], None)
+        g = _grad_sources_inputs([(a1.outputs[0], one)], None)
        self.assertTrue(g[a1.inputs[0]] is gval)
    def test_Nin_1out(self):
        """Test grad is called correctly for a many-to-1 op"""
-        gval0 = gof.generic()
+        gval0 = theano.tensor.scalar()
-        gval1 = gof.generic()
+        gval1 = theano.tensor.scalar()
        class O(gof.op.Op):
            def make_node(self):
-                inputs = [gof.generic(),gof.generic()]
+                inputs = [theano.tensor.scalar(), theano.tensor.scalar()]
-                outputs = [gof.generic()]
+                outputs = [theano.tensor.matrix()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inp, grads):
                x0, x1 = inp
                gz, = grads
                return (gval0, gval1)
        a1 = O().make_node()
-        g = _grad_sources_inputs([(a1.outputs[0], 1)], None)
+        g = _grad_sources_inputs([(a1.outputs[0], one)], None)
        self.assertTrue(g[a1.inputs[0]] is gval0)
        self.assertTrue(g[a1.inputs[1]] is gval1)
    def test_Nin_Nout(self):
        """Test grad is called correctly for a many-to-many op"""
-        gval0 = gof.generic()
+        gval0 = theano.tensor.matrix()
-        gval1 = gof.generic()
+        gval1 = theano.tensor.matrix()
        class O(gof.op.Op):
            def make_node(self):
-                inputs = [gof.generic(),gof.generic()]
+                inputs = [theano.tensor.matrix(),theano.tensor.matrix()]
-                outputs = [gof.generic(),gof.generic()]
+                outputs = [theano.tensor.matrix(),theano.tensor.matrix()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inp, grads):
                return gval0, gval1
        a1 = O().make_node()
-        g = _grad_sources_inputs([(a1.outputs[0], 1)], None)
+        g = _grad_sources_inputs([(a1.outputs[0], one)], None)
        self.assertTrue(g[a1.inputs[0]] is gval0)
        self.assertTrue(g[a1.inputs[1]] is gval1)
    def test_some_None_ograds(self):
        """Test grad is called when some output gradients are None"""
        class O(gof.op.Op):
            def __init__(self, tst):
                self.tst = tst
            def make_node(self, *inputs):
-                outputs = [gof.generic(),gof.generic()]
+                outputs = [theano.tensor.matrix(),theano.tensor.matrix()]
                return gof.Apply(self, inputs, outputs)
            def grad(self, inputs, g_out):
-                return [1]
+                return [one]
-        i = gof.generic()
+        i = theano.tensor.matrix()
        a1 = O(self).make_node(i)
-        g = grad_sources_inputs([(a1.outputs[0], 1)], None, warn_type=False)
+        g = grad_sources_inputs([(a1.outputs[0], one)], None, warn_type=False)
-        self.assertTrue(g[i] is 1)
+        self.assertTrue(g[i] is one)
-    def test_some_None_igrads(self):
-        """Test that traversal works properly when an op return some None"""
-        class O(gof.op.Op):
-            def __init__(self, tst, grad_ok):
-                self.tst = tst
-                self.grad_ok = grad_ok
-            def make_node(self, *inputs):
-                outputs = [gof.generic(),gof.generic()]
-                return gof.Apply(self, inputs, outputs)
-            def grad(self, inputs, g_out):
-                if not self.grad_ok:
-                    self.tst.fail()
-                else:
-                    return [1, None]
-        i = gof.generic()
-        j = gof.generic()
-        k = gof.generic()
-        a1 = O(self, True).make_node(i,j)
-        a2 = O(self, True).make_node(a1.outputs[1], k)
-        g = grad_sources_inputs([(a2.outputs[0], 1)], None, warn_type=False)
-        self.assertTrue(g[i] is 1 and j not in g and k not in g)
-        a1 = O(self, True).make_node(i,j)
-        a2 = O(self, True).make_node(k, a1.outputs[1])
-        g = _grad_sources_inputs([(a2.outputs[0], 1)], None)
-        self.assertTrue(g[k] is 1 and i not in g and j not in g)
-    def test_inputs(self):
-        """Test that passing inputs shortens the traversal"""
-        class O(gof.op.Op):
-            def __init__(self, tst, grad_ok):
-                self.tst = tst
-                self.grad_ok = grad_ok
-            def make_node(self, *inputs):
-                outputs = [gof.generic(),gof.generic()]
-                return gof.Apply(self, inputs, outputs)
-            def grad(self, inputs, grads):
-                g0, g1 = grads
-                if not self.grad_ok:
-                    self.tst.fail()
-                else:
-                    if g1:
-                        return [g0, g0+g1]
-                    else:
-                        return [g0, g0]
-        i = gof.generic()
-        j = gof.generic()
-        k = gof.generic()
-        a1 = O(self, True).make_node(i,j)
-        a2 = O(self, True).make_node(k,a1.outputs[1])
-        g = _grad_sources_inputs([(a2.outputs[0], 1), (a1.outputs[1],4),
-            (a1.outputs[0], 3), (a1.outputs[0], 3)], a1.outputs)
-        self.assertTrue(g[a2.inputs[0]] == 1)
-        self.assertTrue(g[a2.inputs[1]] == 5)
-        self.assertTrue(g[a1.outputs[0]] == 6)
-        self.assertTrue(g[a1.outputs[1]] == 5)
-        self.assertTrue(a1.inputs[0] not in g)
-        self.assertTrue(a1.inputs[1] not in g)
-    def test_multiple_sources(self):
-        """Test that passing multiple sources works"""
-        class O(gof.op.Op):
-            def __init__(self, tst, grad_ok):
-                self.tst = tst
-                self.grad_ok = grad_ok
-            def make_node(self, *inputs):
-                outputs = [gof.generic(),gof.generic()]
-                return gof.Apply(self, inputs, outputs)
-            def grad(self, inputs, grads):
-                g0, g1 = grads
-                if not self.grad_ok:
-                    self.tst.fail()
-                else:
-                    if g1:
-                        return [g0, g0+g1]
-                    else:
-                        return [g0, g0]
-        i = gof.generic()
-        j = gof.generic()
-        k = gof.generic()
-        a1 = O(self,True).make_node(i,j)
-        a2 = O(self,True).make_node(k,a1.outputs[1])
-        g = _grad_sources_inputs([(a2.outputs[0], 1), (a1.outputs[1],4),
-            (a1.outputs[0], 3), (a1.outputs[0], 3)], None)
-        self.assertTrue(g[a2.inputs[0]] == 1)
-        self.assertTrue(g[a2.inputs[1]] == 5)
-        self.assertTrue(g[a1.outputs[0]] == 6)
-        self.assertTrue(g[a1.outputs[1]] == 5)
-        self.assertTrue(g[a1.inputs[0]] == 6)
-        self.assertTrue(g[a1.inputs[1]] == 11)
 def test_unimplemented_grad_func():
-    #tests that function compilation catches unimplemented grads in the graph
+    # tests that function compilation catches unimplemented grads in the graph
    a = theano.tensor.vector()
    b = theano.gradient.grad_not_implemented(theano.tensor.add, 0, a)
    try:
-        f = theano.function([a], b)
+        f = theano.function([a], b, on_unused_input = 'ignore')
        assert 0
-        #Note: it's important that the NotImplementedGradOp is caught
+    except TypeError:
-        #at COMPILATION time, not execution time.
-        #If the uncomputable variable is, for example, multiplied by 0,
-        #it could be optimized out of the final graph.
-    except NotImplementedError:
        pass
 def test_undefined_grad_func():
@@ -271,13 +152,9 @@ def test_undefined_grad_func():
    a = theano.tensor.vector()
    b = theano.gradient.grad_undefined(theano.tensor.add, 0, a)
    try:
-        f = theano.function([a],b)
+        f = theano.function([a],b, on_unused_input = 'ignore')
        assert 0
-        #Note: it's important that the GradUndefinedOp is cauhgt at
+    except TypeError:
-        #COMPILATION time, not execution time.
-        #If the uncomputable variable is, for example, multiplied by0,
-        #it could be optimized out of the final graph
-    except theano.gradient.GradUndefinedError:
        pass
 def test_unimplemented_grad_grad():
@@ -296,7 +173,7 @@ def test_unimplemented_grad_grad():
    try:
        g = theano.gradient.grad(b,a)
        assert False
-    except NotImplementedError:
+    except TypeError:
        pass
 def test_undefined_grad_grad():
@@ -314,7 +191,7 @@ def test_undefined_grad_grad():
    try:
        g = theano.gradient.grad(Z.sum(),d)
        assert False
-    except theano.gradient.GradUndefinedError:
+    except TypeError:
        pass
 def test_grad_name():
@@ -325,5 +202,97 @@ def test_grad_name():
    g = theano.tensor.grad(f,x)
    assert g.name == '(df/dx)'
+def test_grad_duplicate_input():
+    #test that the grad works when a variable
+    #appears in more than one place in a node's input list
+    def output(x):
+        return (x*x)
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    theano.tests.unittest_tools.verify_grad(output,[vx])
+def test_grad_quadratic():
+    #test the gradient on a tiny graph
+    def cost(x,A):
+        return theano.tensor.dot(x,theano.tensor.dot(A,x))
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    vA = rng.randn(2,2)
+    theano.tests.unittest_tools.verify_grad(cost,[vx,vA])
+def test_grad_quadratic_vector():
+    #test the gradient on a small graph
+    def output(x,A):
+        return theano.tensor.dot(x*x,A)
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    vA = rng.randn(2,2)
+    theano.tests.unittest_tools.verify_grad(output,[vx,vA])
+def test_grad_cubic():
+    #test the gradient on a bigger graph
+    def cost(x,A):
+        return theano.tensor.dot(x*x,theano.tensor.dot(A,x))
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    vA = rng.randn(2,2)
+    theano.tests.unittest_tools.verify_grad(cost,[vx,vA])
+def test_grad_grad_quadratic():
+    #test the gradient on a graph constructed using the gradient
+    def output(x,A):
+        orig_cost = theano.tensor.dot(x,theano.tensor.dot(A,x))
+        return theano.gradient.grad(orig_cost, x)
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    vA = rng.randn(2,2)
+    theano.tests.unittest_tools.verify_grad(output,[vx,vA])
+def test_grad_grad_cubic():
+    #test the gradient on a bigger graph constructed using the gradient
+    def output(x,A):
+        orig_cost = theano.tensor.dot(x*x,theano.tensor.dot(A,x))
+        return theano.gradient.grad(orig_cost, x)
+    rng = np.random.RandomState([2012,8,28])
+    vx = rng.randn(2)
+    vA = rng.randn(2,2)
+    theano.tests.unittest_tools.verify_grad(output,[vx,vA])
 if __name__ == '__main__':
    unittest.main()