Merge pull request #899 from goodfeli/rebase_fix_grad

Rebase fix grad

Merge pull request #899 from goodfeli/rebase_fix_grad
d95e876d · nouiz · f35c2fef · 742225a4 · d95e876d · d95e876d
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -98,6 +98,31 @@ following methods:
  lifetime of self.  Op instances should be immutable in this
  sense.

+.. function:: connection_pattern():
+
+  Optional (but in extremely rare cases needed to have it work with
+   {tensor,sparse}.grad).
+
+  Returns a list of bools the same length as the op's inputs list.
+
+  True signifies that the elements of an input have an effect on its
+  output.
+
+  False signifies that they do not--in other words, the op acts only
+  one the input's metadata such as its shape.
+
+  If no connection_pattern is implemented, tensor.grad will assume
+  it is a list containing only True.
+
+  Failing to implement this function for an op that needs it can
+  result in tensor.grad erroneously reporting that a gradient is
+  undefined. Returning 0 for this input in the grad method is not
+  the same as specifying that the elements of this input are not
+  connected to the output. If the gradient with respect to the
+  op's output is NaN but the elements of the input are not connected
+  to it, then the NaN never enters into the expression for the
+  gradient.
+
 .. function:: grad(inputs, output_gradients)

  Optional (but needed to have it work with {tensor,sparse}.grad()).
@@ -106,31 +131,62 @@ following methods:
  symbolically in this method. Both ``inputs`` and ``output_gradients``
  are lists of symbolic Theano Variables and those must be operated on using 
  Theano's symbolic language. The grad method must return a list containing 
-  one Variable (or ``None``) for each input. Each returned Variable represents 
+  one Variable for each input. Each returned Variable represents 
  the gradient with respect to that input computed based on the symbolic gradients with
  respect to each output.

-  If the output is not differentiable with respect to any inputs,
-  then this method should be defined to return ``[None for i in
-  inputs]``. If this method is not defined, then Theano assumes it has been
+  If the output is not differentiable with respect to an input
+  then this method should be defined to return a variable of type
+  NullType for that input.
+
+  If an element of output_gradient is of type theano.gradient.DisconnectedType,
+  it means that the cost is not a function of this output. If any of the
+  op's inputs participate in the computation of only disconnected outputs,
+  then Op.grad should return DisconnectedType variables for those inputs.
+
+  If the grad method is not defined, then Theano assumes it has been
  forgotten.  Symbolic differentiation will fail on a graph that
  includes this Op.

-  It must be understood that the grad method is not meant to return the
-  gradient of the Op's output but rather the gradient of some other scalar 
-  criterion C with respect to the Op's input.
+  It must be understood that the Op's grad method is not meant to return the
+  gradient of the Op's output. theano.tensor.grad computes gradients; Op.grad
+  is a helper function that computes terms that appear in gradients.
+  
+  If an Op has a single vector-valued output y and a single vector-valued input x,
+  then the grad method will be passed x and a second vector z. Define J to be
+  the Jacobian of y with respect to x. The Op's grad method should return
+  dot(J.T,z). When theano.tensor.grad calls the grad method, it will set z to
+  be the gradient of the cost C with respect to y. If this op is the only op
+  that acts on x, then dot(J.T,z) is the gradient of C with respect to x.
+  If there are other ops that act on x, theano.tensor.grad will have to add up
+  the terms of x's gradient contributed by the other op's grad method.
+
+  In practice, an op's input and output are rarely implemented as single vectors.
+  Even if an op's output consists of a list containing a scalar, a sparse matrix,
+  and a 4D tensor, you can think of these objects as being formed by rearranging
+  a vector. Likewise for the input. In this view, the values computed by the grad
+  method still represent a Jacobian-vector product.
+
+  In practice, it is probably not a good idea to explicitly construct the Jacobian,
+  which might be very large and very sparse. However, the returned value should
+  be equal to the Jacobian-vector product.
+
+  So long as you implement this product correctly, you need not understand what
+  theano.tensor.grad is doing, but for the curious the mathematical justification
+  is as follows:

  In essence, the grad method must simply implement through symbolic Variables
  and operations the chain rule of differential calculus. The chain rule
-  is the mathematical procedure that allows to calculate the total derivative
+  is the mathematical procedure that allows one to calculate the total derivative
  :math:`\frac{d C}{d x}` of the final scalar symbolic Variable C with respect to a
-  primitive symbolic Variable x found in the list ``inputs``,
-  based on the knowledge of the total derivative :math:`\frac{d C}{d f}` of
-  C with respect to a symbolic Variable that is returned by the Op (this is provided
+  primitive symbolic Variable x found in the list ``inputs``.
+  The grad method does this using ``output_gradients`` which provides the total
+  derivative :math:`\frac{d C}{d f}` of C with respect to a symbolic Variable
+  that is returned by the Op (this is provided
  in ``output_gradients``), as well as the knowledge of the total derivative :math:`\frac{d f}{d x}` of the
  latter with respect to the primitive Variable (this has to be computed).

-  In Mathematics, the total derivative of a scalar variable (C) with respect to a vector of
+  In mathematics, the total derivative of a scalar variable (C) with respect to a vector of
  scalar variables (x), i.e. the gradient, is customarily represented as the
  row vector of the partial derivatives, whereas the total derivative of a vector of
  scalar variables (f) with respect to another (x), is customarily represented by the matrix of

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -150,24 +150,6 @@ def std_fgraph(input_specs, output_specs, accept_inplace = False):

 std_fgraph.features = [gof.toolbox.PreserveNames]

-class UncomputableFeature(gof.Feature):
-    """A feature that ensures the graph never contains any
-    uncomputable nodes. This check must be made at compile time
-    rather than runtime in order to make sure that NaN nodes are
-    not optimized out. It must be done as a Feature so that
-    the fgraph will continually check that optimizations have
-    not introduce any uncomputable nodes."""
-
-    def on_attach(self, fgraph):
-        for node in fgraph.nodes:
-            return self.on_import(fgraph, node)
-
-    def on_import(self, fgraph, node):
-        gof.op.raise_if_uncomputable(node)
-
-std_fgraph.features.append(UncomputableFeature)
-
-
 class AliasedMemoryError(Exception):
    """Memory is aliased that should not be"""
    pass

--- a/theano/gof/fg.py
+++ b/theano/gof/fg.py
@@ -11,7 +11,7 @@ import toolbox
 from python25 import all
 from theano import config
 import warnings
-
+NullType = None

 class InconsistencyError(Exception):
    """
@@ -211,6 +211,9 @@ class FunctionGraph(utils.object2):
    ### import ###

    def __import_r__(self, variables):
+        global NullType
+        if NullType is None:
+            from null_type import NullType
        # Imports the owners of the variables
        r_owner_done = set(self.nodes)
        for node in [r.owner for r in variables if r.owner is not None]:
@@ -219,6 +222,8 @@ class FunctionGraph(utils.object2):
                self.__import__(node)
        for r in variables:
            if r.owner is None and not isinstance(r, graph.Constant) and r not in self.inputs:
+                if isinstance(r.type,NullType):
+                    raise TypeError("Computation graph contains a NaN. "+r.type.why_null)
                raise MissingInputError("Undeclared input", r)
            if not getattr(r, 'fgraph', None) is self:
                self.__setup_r__(r)

--- a/theano/gof/null_type.py
+++ b/theano/gof/null_type.py
+from theano.gof.type import Type
+
+
+class NullType(Type):
+    """
+
+    A type that allows no values. Used to represent expressions
+    that are undefined, either because they do not exist mathematically
+    or because the code to generate the expression has not been
+    implemented yet.
+
+    """
+
+    def __init__(self, why_null='(no explanation given)'):
+        """
+            why_null: A string explaining why this variable
+                    can't take on any values
+        """
+        self.why_null = why_null
+
+    def filter(self, data, strict=False, allow_downcast=None):
+        raise ValueError("No values may be assigned to a NullType")
+
+    def filter_variable(self, other):
+        raise ValueError("No values may be assigned to a NullType")
+
+    def may_share_memory(a, b):
+        return False
+
+    def values_eq(a, b, force_same_dtype=True):
+        raise ValueError("NullType has no values to compare")
--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -609,59 +609,6 @@ class Op(utils.object2, PureOp, CLinkerOp):
        rval.lazy = False
        return rval

-class UncomputableOp(Op):
-    """
-        An Op representing an expression that cannot be computed.
-        theano.function checks that the subgraph it implements
-        does not contain these ops, and that optimization does not
-        introduce any such ops.
-        theano.tensor.grad checks the graphs it returns to ensure
-        they do not contain these ops.
-    """
-
-    def __init__(self, exc, msg=""):
-        """
-        exc: the exception type to raise if a subgraph contains
-             this op.
-        msg: the message to include in the exception.
-        """
-
-        self.exc = exc
-        self.msg = msg
-
-    def __eq__(self, other):
-        return type(self) == type(other)
-
-    def __hash__(self):
-        return hash((type(self)))
-
-    def __str__(self):
-        return "Uncomputable{%s,%s}"%(self.exc,self.msg)
-
-    def make_node(self,x):
-        if x is None:
-            x = graph.Constant(theano.gof.type.generic,None)
-        return graph.Apply(self, [x], [x.type()] )
-
-    def perform(self, node, inputs, out_storage):
-        """ This should never be called"""
-        raise AssertionError("A BadGradOp should never be compiled, "+\
-                "and certainly not executed.")
-        #Note: essentially, this op should just be NaNs_like(inputs[0])
-        #but 0 * BadGradOp(x) + y optimizes to just y
-        #so until we develop a way of symbolically representing a variable
-        #that is always NaN and implement the logic for 0 * NaN = NaN, etc.
-        #the only way we can guarantee correctness of a theano function
-        #is to guarantee that its initial subgraph contained no BadGradOps
-
-    def raise_exc(self):
-        raise self.exc(self.msg)
-
-def raise_if_uncomputable(node):
-    if node is not None:
-        if isinstance(node.op, UncomputableOp):
-            node.op.raise_exc()
-
 def get_test_value(v):
    """
    Extract test value from `v`. Raises AttributeError if there is none.

--- a/theano/gradient.py
+++ b/theano/gradient.py
--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -456,7 +456,7 @@ def test_elemwise_composite_support_code():
    P = T.exp(-(Y - U) ** 2)
    epsilon = numpy.asarray(0.001, dtype="float32")
    NLL = -T.mean(T.log(P + epsilon))  # SupportCodeError
-    G = T.grad(NLL, wrt=[W])
+    G = theano.gradient.grad(NLL, wrt=[W])

    backup = theano.config.warn.identify_1pexp_bug
    theano.config.warn.identify_1pexp_bug = False
@@ -468,6 +468,7 @@ def test_elemwise_composite_support_code():

    topo = f_grad.maker.fgraph.toposort()
    assert sum([isinstance(node.op, T.Elemwise) for node in topo]) == 1
+    #I suspect this was failing in the original branch too
    assert sum([isinstance(node.op, tcn.GpuElemwise) for node in topo]) == 1



--- a/theano/sandbox/test_neighbours.py
+++ b/theano/sandbox/test_neighbours.py
@@ -258,7 +258,7 @@ class T_Images2Neibs(unittest_tools.InferShapeTester):
        def fn(images):
            return images2neibs(images, (3, 3), mode='wrap_centered')

-        self.assertRaises(NotImplementedError, unittest_tools.verify_grad,
+        self.assertRaises(TypeError, unittest_tools.verify_grad,
                          fn, [images_val], mode=self.mode)


@@ -276,7 +276,7 @@ class T_Images2Neibs(unittest_tools.InferShapeTester):
        # are not the same.
        def fn(images):
            return images2neibs(images, (2, 2), (1, 1))
-        self.assertRaises(NotImplementedError,
+        self.assertRaises(TypeError,
                          unittest_tools.verify_grad, fn, [images_val],
                          mode=self.mode)


--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -488,6 +488,9 @@ class _scalar_py_operators:
    def __rmod__(self,other): return mod(other,self)
    def __rpow__(self,other): return pow(other,self)

+    def zeros_like(self):
+        return ScalarConstant(Scalar(str(self.type.dtype)), 0)
+

 class ScalarVariable(_scalar_py_operators, Variable):
    pass

--- a/theano/scan_module/scan_op.py
+++ b/theano/scan_module/scan_op.py
@@ -29,6 +29,8 @@ from theano import gof
 from theano.tensor import TensorType
 from theano import tensor
 from theano.tensor.opt import Shape_i
+from theano.gradient import grad_undefined
+from theano.gradient import DisconnectedType
 #from theano.sandbox import cuda
 from theano.compile.profiling import ScanProfileStats

@@ -431,7 +433,7 @@ class Scan(PureOp):
                    aux_txt += str(k) + ','
                aux_txt += '},%s,%s}'
        else:
-            aux_txt +='{%s,%s}'
+            aux_txt += '{%s,%s}'
        aux_txt = aux_txt % (name, gpu_str, str(self.name))
        return aux_txt

@@ -1161,6 +1163,17 @@ class Scan(PureOp):

    ### GRAD FUNCTION
    def grad(self, args, g_outs):
+
+        # This discards information about whether incoming gradients are 0
+        # or disconnected from the cost
+        # TODO: upgrade scan op to report disconnection correctly
+        def strip_disconnected(g):
+            if isinstance(g.type, DisconnectedType):
+                return None
+            return g
+
+        g_outs = [strip_disconnected(g) for g in g_outs]
+
        # 1. forward pass - get the outputs after applying scan
        scan_outputs = self(*args)
        # 2. make sure they are given as a list
@@ -1512,7 +1525,7 @@ class Scan(PureOp):
        if type(outputs) not in (list, tuple):
            outputs = [outputs]
        # Re-order the gradients correctly
-        gradients = [None]
+        gradients = [grad_undefined(self, 0, args[0], 'Number of steps')]

        offset = (self.n_mit_mot +
                  self.n_mit_sot +
@@ -1522,8 +1535,16 @@ class Scan(PureOp):

        end = self.n_mit_mot + self.n_mit_sot + self.n_sit_sot
        gradients += [x[::-1] for x in outputs[:end]]
-        gradients += [None for x in xrange(self.n_shared_outs)]
-        gradients += [None for x in xrange(self.n_nit_sot)]
+        start = len(gradients)
+        gradients += [
+                grad_undefined(self, x + start, args[x + start],
+                    'Shared Variable with update')
+                for x in xrange(self.n_shared_outs)]
+        start = len(gradients)
+        gradients += [
+                grad_undefined(self, x + start, args[x + start],
+                    'Dimension of memory buffer for output')
+                for x in xrange(self.n_nit_sot)]
        begin = end

        end = begin + n_sitsot_outs
@@ -1547,7 +1568,8 @@ class Scan(PureOp):
            rop_self_outputs = self_outputs
        if self.info['n_shared_outs'] > 0:
            rop_self_outputs = rop_self_outputs[:-self.info['n_shared_outs']]
-        rop_outs = tensor.Rop(rop_self_outputs, rop_of_inputs, inner_eval_points)
+        rop_outs = tensor.Rop(rop_self_outputs, rop_of_inputs,
+             inner_eval_points)
        if type(rop_outs) not in (list, tuple):
            rop_outs = [rop_outs]
        # Step 2. Figure out what corresponds to what in the scan
@@ -1653,7 +1675,7 @@ class Scan(PureOp):
        scan_sit_sot = inputs[b:e] + clean_eval_points
        inner_sit_sot = self_inputs[ib:ie] + inner_eval_points[ib:ie]

-        #Shared outs ...
+        # Shared outs ...
        b = e
        e = e + self.n_shared_outs
        ib = ie
@@ -1738,7 +1760,7 @@ class Scan(PureOp):
        b = e + self.n_nit_sot
        e = e + self.n_nit_sot * 2
        final_outs += outputs[b:e]
-        final_outs += [None]*self.n_shared_outs
+        final_outs += [None] * self.n_shared_outs

        return final_outs


--- a/theano/scan_module/tests/test_scan.py
+++ b/theano/scan_module/tests/test_scan.py
@@ -1816,10 +1816,12 @@ class T_Scan(unittest.TestCase):
    def test_scan_extra_inputs_hessian(self):
        x = theano.tensor.vector('x')
        A = theano.tensor.matrix('A')
-        fc1 = theano.shared(0.5)
-        fc2 = theano.shared(0.9)
+        fc1 = theano.shared(0.5, name = 'fc1')
+        fc2 = theano.shared(0.9, name = 'fc2')
        y = fc1 * theano.dot(x * x, theano.dot(A, x))
+        y.name = 'y'
        gy = theano.tensor.grad(y, x)
+        gy.name = 'gy'
        hy, updates = theano.scan(
            lambda i, gy, x: theano.tensor.grad(gy[i] * fc2, x),
            sequences=theano.tensor.arange(gy.shape[0]),
@@ -1829,7 +1831,9 @@ class T_Scan(unittest.TestCase):
        vx = numpy.array([1., 1.], dtype=theano.config.floatX)
        vA = numpy.array([[1., 1.], [1., 0.]], dtype=theano.config.floatX)
        vR = numpy.array([[3.6, 1.8], [1.8, 0.9]], dtype=theano.config.floatX)
-        assert numpy.allclose(f(vx, vA), vR)
+        out = f(vx, vA)
+
+        assert numpy.allclose(out, vR)

    def test_cloning_no_replace_strict_copy_inputs(self):
        # This has nothing to do with scan, but it refers to the clone
@@ -3479,14 +3483,15 @@ def test_compute_test_value():
    backup = theano.config.compute_test_value
    theano.config.compute_test_value = 'raise'
    try:
-        x = tensor.vector()
+        x = tensor.vector('x')
        xv = numpy.ones(3, dtype=theano.config.floatX)
        x.tag.test_value = xv
-        y = theano.shared(numpy.arange(3, dtype=theano.config.floatX))
+        y = theano.shared(numpy.arange(3, dtype=theano.config.floatX), name='y')
        z, _ = theano.scan(
                fn=lambda u, v: u + v,
                sequences=[x, y])
        assert not _
+        z.name='z'
        # The gradient computation used to crash before 6af465e.
        g = tensor.grad(z.sum(), x)
        #f = theano.function([x], g)

--- a/theano/sparse/basic.py
+++ b/theano/sparse/basic.py
@@ -7,7 +7,6 @@ http://www-users.cs.umn.edu/~saad/software/SPARSKIT/paper.ps
 # TODO
 # Automatic methods for determining best sparse format?

-from itertools import izip
 import sys

 import numpy
@@ -16,14 +15,14 @@ import scipy.sparse

 from theano import gof, tensor, compile, scalar, config
 from theano.gof.python25 import all
-from theano.tensor import blas
+from theano.gradient import DisconnectedType
 from theano.sparse.utils import hash_from_sparse
 import theano.tests.unittest_tools as utt

 sparse_formats = ['csc', 'csr']


-#TODO: move this decorator to the compile submodule
+# TODO: move this decorator to the compile submodule
 def register_specialize(lopt, *tags, **kwargs):
    compile.optdb['specialize'].register((kwargs and kwargs.pop('name')) or
                                         lopt.__name__, lopt, 'fast_run',
@@ -256,7 +255,7 @@ def sp_zeros_like(x):
    :return: The same as `x` with zero entries
             for all element.
    """
-    #TODO: don't restrict to CSM formats
+    # TODO: don't restrict to CSM formats
    _, _, indptr, shape = csm_properties(x)
    return CSM(format=x.format)(numpy.array([], dtype=x.type.dtype),
                                numpy.array([]), tensor.zeros_like(indptr),
@@ -291,7 +290,7 @@ class _sparse_py_operators:
    def __rmul__(left, right):
        return mul(left, right)

-    #extra pseudo-operator symbols
+    # extra pseudo-operator symbols

    def __dot__(left, right):
        return structured_dot(left, right)
@@ -299,12 +298,12 @@ class _sparse_py_operators:
    def __rdot__(right, left):
        return structured_dot(left, right)

-    #N.B. THIS IS COMMENTED OUT ON PURPOSE!!!
+    # N.B. THIS IS COMMENTED OUT ON PURPOSE!!!
    #     Discussion with Fred & James (at least, and maybe others before)
    #     we decided that casting from a sparse to dense should be explicit
    #     because it's usually something you just want to be pretty careful
    #     about, and not to do by accident.
-    #def _as_TensorVariable(self):
+    # def _as_TensorVariable(self):
    #    return dense_from_sparse(self)

    shape = property(lambda self: tensor.shape(dense_from_sparse(self)))
@@ -441,7 +440,7 @@ class SparseType(gof.Type):
        if strict:
            raise TypeError("%s is not sparse, or not the right dtype (is %s, "
                            "expected %s)" % (value, value.dtype, self.dtype))
-        #The input format could be converted here
+        # The input format could be converted here
        if allow_downcast:
            sp = self.format_cls[self.format](value, dtype=self.dtype)
        else:
@@ -488,7 +487,7 @@ class SparseType(gof.Type):
        return "Sparse[%s, %s]" % (str(self.dtype), str(self.format))

    def values_eq_approx(self, a, b, eps=1e-6):
-        #WARNING: equality comparison of sparse matrices is not fast or easy
+        # WARNING: equality comparison of sparse matrices is not fast or easy
        # we definitely do not want to be doing this un-necessarily during
        # a FAST_RUN computation..
        if not scipy.sparse.issparse(a) or not scipy.sparse.issparse(b):
@@ -504,7 +503,7 @@ class SparseType(gof.Type):
        return max(diff.data) < eps

    def values_eq(self, a, b):
-        #WARNING: equality comparison of sparse matrices is not fast or easy
+        # WARNING: equality comparison of sparse matrices is not fast or easy
        # we definitely do not want to be doing this un-necessarily during
        # a FAST_RUN computation..
        return scipy.sparse.issparse(a) \
@@ -619,14 +618,25 @@ class CSMProperties(gof.Op):
            out[0][0] = csm.data[self.kmap]
        if str(csm.data.dtype) == 'int32':
            out[0][0] = theano._asarray(out[0][0], dtype='int32')
-        #backport
-        #out[0][0] = csm.data if self.kmap is None else csm.data[self.kmap]
+        # backport
+        # out[0][0] = csm.data if self.kmap is None else csm.data[self.kmap]
        out[1][0] = theano._asarray(csm.indices, dtype='int32')
        out[2][0] = theano._asarray(csm.indptr, dtype='int32')
        out[3][0] = theano._asarray(csm.shape, dtype='int32')

    def grad(self, (csm,), g):
-        assert [gg is None for gg in g[1:]]
+
+        # g[1:] is all integers, so their Jacobian in this op
+        # is 0. We thus don't need to worry about what their values
+        # are.
+
+        # if g[0] is disconnected, then this op doesn't contribute
+        # any gradient anywhere. but we know that at least one of
+        # g[1:] is connected, or this grad method wouldn't have been
+        # called, so we should report zeros
+        if isinstance(g[0].type, DisconnectedType):
+            return [csm.zeros_like()]
+
        data, indices, indptr, shape = csm_properties(csm)
        return [CSM(csm.format)(g[0], indices, indptr, shape)]
 # don't make this a function or it breaks some optimizations below
@@ -662,10 +672,10 @@ class CSM(gof.Op):

    :param data: One dimensionnal tensor representing
                 the data of the sparse to construct.
-    :param indices: One dimensionnal tensor of integers
+    :param indices: One dimensional tensor of integers
                    representing the indices of the sparse
                    matrix to construct.
-    :param indptr: One dimensionnal tensor of integers
+    :param indptr: One dimensional tensor of integers
                   representing the indice pointer for
                   the sparse matrix to construct.
    :param shape: One dimensionnal tensor of integers
@@ -673,9 +683,9 @@ class CSM(gof.Op):
                  matrix to construct.

    :return: A sparse matrix having the properties
-             speficied by the inputs.
+             specified by the inputs.

-    :note: The grad method returns a dense vector, so it provide
+    :note: The grad method returns a dense vector, so it provides
           a regular grad.
    """

@@ -774,10 +784,10 @@ class CSM(gof.Op):

    def grad(self, (x_data, x_indices, x_indptr, x_shape), (g_out,)):
        g_data, g_indices, g_indptr, g_shape = csm_properties(g_out)
-        #unpack the data vector and wrap it as a 1d TensorType
+        # unpack the data vector and wrap it as a 1d TensorType
        g_data = csm_grad(self.kmap)(x_data, x_indices, x_indptr, x_shape,
            g_data, g_indices, g_indptr, g_shape)
-        return [g_data, None, None, None]
+        return [g_data, DisconnectedType()(), DisconnectedType()(), DisconnectedType()()]

    def infer_shape(self, node, shapes):
        if self.kmap is None:
@@ -1195,7 +1205,7 @@ class GetItemScalar(gof.op.Op):
            if isinstance(ind, slice):
                raise Exception("GetItemScalar called with a slice as index!")

-            #in case of indexing using int instead of theano variable
+            # in case of indexing using int instead of theano variable
            elif isinstance(ind, int):
                ind = theano.tensor.constant(ind)
                input_op += [ind]
@@ -2026,7 +2036,7 @@ class MulSD(gof.op.Op):
    def make_node(self, x, y):
        x, y = as_sparse_variable(x), tensor.as_tensor_variable(y)

-        #upcast the tensor. Is the cast of sparse done implemented?
+        # upcast the tensor. Is the cast of sparse done implemented?
        dtype = scalar.upcast(x.type.dtype, y.type.dtype)
        if y.type.dtype != dtype:
            y = tensor.cast(y, dtype)
@@ -2049,7 +2059,7 @@ class MulSD(gof.op.Op):
        elif len(y.shape) == 2:
            # if we have enough memory to fit y, maybe we can fit x.asarray()
            # too?
-            #TODO: change runtime from O(M*N) to O(nonzeros)
+            # TODO: change runtime from O(M*N) to O(nonzeros)
            M, N = x.shape
            assert x.shape == y.shape

@@ -2810,7 +2820,7 @@ class StructuredDot(gof.Op):
            raise ValueError('shape mismatch in StructuredDot.perform',
                             (a.shape, b.shape))

-        #variable = a.dot(b)  # deprecated
+        # variable = a.dot(b)  # deprecated
        variable = a * b
        if isinstance(node.outputs[0].type, SparseType):
            assert _is_sparse(variable)
@@ -2843,8 +2853,8 @@ class StructuredDot(gof.Op):
                raise Exception("a.shape=%s, b.shape=%s, variable.shape=%s "
                                " ??? I have no idea why")

-        #The cast is needed as otherwise we hit the bug mentioned into
-        #theano._asarray function documentation.
+        # The cast is needed as otherwise we hit the bug mentioned into
+        # theano._asarray function documentation.
        out[0] = theano._asarray(variable, str(variable.dtype))

    def grad(self, (a, b), (g_out,)):
@@ -3229,7 +3239,7 @@ class SamplingDot(gof.op.Op):
        if not _is_sparse_variable(p):
            raise TypeError(p)

-        #TODO: use it.
+        # TODO: use it.
        dtype_out = scalar.upcast(x.type.dtype, y.type.dtype, p.type.dtype)

        return gof.Apply(self, [x, y, p], [p.type()])

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
@@ -98,26 +98,26 @@ class Conv3D(theano.Op):
        if 'name' in dir(dCdH) and dCdH.name is not None:
            dCdH_name = dCdH.name
        else:
-            dCdH_name = 'anon'
+            dCdH_name = 'anon_dCdH'

        if 'name' in dir(V) and V.name is not None:
            V_name = V.name
        else:
-            V_name = 'anon'
+            V_name = 'anon_V'

        if 'name' in dir(W) and W.name is not None:
            W_name = W.name
        else:
-            W_name = 'anon'
+            W_name = 'anon_W'

        if 'name' in dir(b) and b.name is not None:
            b_name = b.name
        else:
-            b_name = 'anon'
+            b_name = 'anon_b'

-        dCdV.name = 'Conv3D_dCdV.dCdH='+dCdH_name+',V='+V_name
-        dCdW.name = 'Conv3D_dCdW.dCdH='+dCdH_name+',V='+V_name+',W='+W_name
-        dCdb.name = 'Conv3D_dCdb.dCdH='+dCdH_name+',V='+V_name+',W='+W_name+',b='+b_name
+        dCdV.name = 'Conv3D_dCdV(dCdH='+dCdH_name+',V='+V_name+')'
+        dCdW.name = 'Conv3D_dCdW(dCdH='+dCdH_name+',V='+V_name+',W='+W_name+')'
+        dCdb.name = 'Conv3D_dCdb(dCdH='+dCdH_name+',V='+V_name+',W='+W_name+',b='+b_name+')'




--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -56,22 +56,22 @@ class ConvTransp3D(theano.Op):
        if 'name' in dir(dCdR) and dCdR.name is not None:
            dCdR_name = dCdR.name
        else:
-            dCdR_name = 'anon'
+            dCdR_name = 'anon_dCdR'

        if 'name' in dir(H) and H.name is not None:
            H_name = H.name
        else:
-            H_name = 'anon'
+            H_name = 'anon_H'

        if 'name' in dir(W) and W.name is not None:
            W_name = W.name
        else:
-            W_name = 'anon'
+            W_name = 'anon_W'

        if 'name' in dir(b) and b.name is not None:
            b_name = b.name
        else:
-            b_name = 'anon'
+            b_name = 'anon_b'


        dCdW.name = 'ConvTransp3D_dCdW.H='+H_name+',dCdR='+dCdR_name+',W='+W_name

--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
@@ -780,9 +780,19 @@ class ConvOp(OpenMPOp):

            # build a "node", that should be equivalent to the one given by
            # self.make_node, but using conv3D instead of self.
+            shuffled_inputs = inputs.dimshuffle(0, 2, 3, 'x', 1)
+            if inputs.name is not None:
+                shuffled_inputs.name = 'shuffle_for_conv3D(%s)' % inputs.name
+            flipped_kerns = kerns[:, :, ::-1, ::-1]
+            if kerns.name is not None:
+                flipped_kerns.name = 'flipped(%s)' % kerns.name
+            shuffled_kerns = flipped_kerns.dimshuffle(0, 2, 3, 'x', 1)
+            if flipped_kerns.name is not None:
+                shuffled_kerns.name = 'shuffled_for_conv3D(%s)' % flipped_kerns.name
+
            tmp_node = theano.tensor.nnet.conv3D(
-                V=inputs.dimshuffle(0, 2, 3, 'x', 1),
-                W=kerns[:, :, ::-1, ::-1].dimshuffle(0, 2, 3, 'x', 1),
+                V = shuffled_inputs,
+                W= shuffled_kerns,
                b=theano.tensor.alloc(numpy.asarray(0, dtype=kerns.dtype),
                                      kerns.shape[0]),
                d=(self.dx, self.dy, 1))

--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -14,6 +14,7 @@ from theano.compile import optdb
 from theano.gof import Apply

 from theano.tensor.nnet.sigm import sigmoid, softplus
+from theano.gradient import DisconnectedType


 ############
@@ -76,6 +77,10 @@ class SoftmaxWithBias(gof.Op):
    def grad(self, inp, grads):
        x, b = inp
        g_sm, = grads
+
+        if isinstance(g_sm.type, DisconnectedType):
+            return [ DisconnectedType()(), DisconnectedType()() ]
+
        sm = softmax_with_bias(x, b)
        dx = softmax_grad(g_sm, sm)
        db = tensor.sum(dx, axis=0)
@@ -710,21 +715,40 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
    def grad(self, inp, grads):
        x, b, y_idx = inp
        g_nll, g_sm, g_am = grads
-        if g_am is not None:
-            raise NotImplementedError()
-        elif g_sm is not None:
-            # There is a gradient w.r.t. the softmax's output itself.
-            if g_nll is not None or g_am is not None:
-                raise NotImplementedError()
-            return softmax_with_bias.grad((x, b, ), (g_sm, )) + (None, )
-        else:
-            # There is a gradient w.r.t. the NLL.
-            assert g_nll is not None
+
+
+        dx_terms = []
+        db_terms = []
+        d_idx_terms = []
+
+
+        if not isinstance(g_nll.type, DisconnectedType):
            nll, sm = crossentropy_softmax_1hot_with_bias(x, b, y_idx)
-            #dx = CrossentropySoftmax1HotWithBiasDx()(g_nll, sm, y_idx)
            dx = crossentropy_softmax_1hot_with_bias_dx(g_nll, sm, y_idx)
            db = tensor.sum(dx, axis=[0])
-            return dx, db, None
+            dx_terms.append(dx)
+            db_terms.append(db)
+
+        if not isinstance(g_sm.type, DisconnectedType):
+            dx, db = softmax_with_bias.grad((x, b), (g_sm, ))
+            dx_terms.append(dx)
+            db_terms.append(db)
+
+        if not isinstance(g_am.type, DisconnectedType):
+            dx_terms.append(x.zeros_like())
+            db_terms.append(b.zeros_like())
+            d_idx_terms.append(y_idx.zeros_like())
+
+        def fancy_sum( terms ):
+            if len(terms) == 0:
+                return DisconnectedType()()
+            rval = terms[0]
+            for term in terms[1:]:
+                rval = rval + term
+            return rval
+
+        return [ fancy_sum(terms) for terms in
+                [dx_terms, db_terms, d_idx_terms ] ]

    def c_headers(self):
        return ['<iostream>', '<cmath>']

--- a/theano/tensor/nnet/tests/test_conv.py
+++ b/theano/tensor/nnet/tests/test_conv.py
@@ -18,7 +18,9 @@ class TestConv2D(utt.InferShapeTester):
    def setUp(self):
        super (TestConv2D, self).setUp()
        self.input = T.dtensor4('input')
+        self.input.name = 'default_V'
        self.filters = T.dtensor4('filters')
+        self.filters.name = 'default_filters'

    def validate(self, image_shape, filter_shape,
                 border_mode='valid', subsample=(1, 1),
@@ -34,7 +36,7 @@ class TestConv2D(utt.InferShapeTester):
            N_filter_shape = [T.get_constant_value(T.
                as_tensor_variable(x)) for x in filter_shape]

-        if not input:
+        if input is None:
            input = self.input
        if not filters:
            filters = self.filters
@@ -44,11 +46,16 @@ class TestConv2D(utt.InferShapeTester):
        # we create a symbolic function so that verify_grad can work
        def sym_conv2d(input, filters):
            # define theano graph and function
-            return conv.conv2d(input, filters, image_shape, filter_shape,
+            input.name = 'input'
+            filters.name = 'filters'
+            rval =  conv.conv2d(input, filters, image_shape, filter_shape,
                          border_mode, subsample, unroll_batch=unroll_batch,
                          unroll_kern=unroll_kern, unroll_patch=unroll_patch)
+            rval.name = 'conv_output'
+            return rval

        output = sym_conv2d(input, filters)
+        output.name = 'conv2d(%s,%s)' % (input.name, filters.name)
        theano_conv = theano.function([input, filters], output)

        # initialize input and compute result

--- a/theano/tensor/nnet/tests/test_conv3d.py
+++ b/theano/tensor/nnet/tests/test_conv3d.py
@@ -121,33 +121,49 @@ class TestConv3D(utt.InferShapeTester):
        mode.check_py_code = False

        self.W = shared(N.ndarray(shape=(1, 1, 1, 1, 1), dtype=floatX))
+        self.W.name = 'W'
        self.b = shared(N.zeros(1, dtype=floatX))
+        self.b.name = 'b'
        self.rb = shared(N.zeros(1, dtype=floatX))
+        self.rb.name = 'rb'
        self.V = shared(N.ndarray(shape=(1, 1, 1, 1, 1), dtype=floatX))
+        self.V.name = 'V'
        self.d = shared(N.ndarray(shape=(3, ), dtype=int))
+        self.d.name = 'd'

        self.H = conv3D(self.V, self.W, self.b, self.d)
+        self.H.name = 'H'
        self.H_func = function([], self.H, mode=mode)
        self.H_shape_func = function([], self.H.shape, mode=mode)

        self.RShape = T.vector(dtype='int64')
+        self.RShape.name = 'RShape'

        self.otherH = T.TensorType(floatX,
                        (False, False, False, False, False))(name='otherH')
        self.transp = convTransp3D(self.W, self.rb, self.d,
                                   self.otherH, self.RShape)
+        self.transp.name = 'transp'
        self.transp_func = function([self.otherH, self.RShape],
                                    self.transp, mode=mode)

        self.R = convTransp3D(self.W, self.rb, self.d, self.H, self.RShape)
+        self.R.name = 'R'
        self.R_func = function([self.RShape], self.R, mode=mode)
        self.R_shape_func = function([self.RShape], self.R.shape)

-        self.reconsObj = T.sum(T.sqr(self.V - self.R))
+        diff = self.V - self.R
+        diff.name = 'diff'
+        sqr = T.sqr(diff)
+        sqr.name = 'sqr'
+        self.reconsObj = T.sum(sqr)
+        self.reconsObj.name = 'reconsObj'
        self.reconsObjFunc = function([self.RShape], self.reconsObj, mode=mode)

+        W_grad = T.grad(self.reconsObj, self.W)
+
        self.gradientsFunc = function([self.RShape],
-                        [T.grad(self.reconsObj, self.W), T.grad(self.reconsObj,
+                        [W_grad, T.grad(self.reconsObj,
                        self.H), T.grad(self.reconsObj, self.V),
                         T.grad(self.reconsObj, self.b)], mode=mode)


--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -2832,16 +2832,16 @@ class Canonizer(gof.LocalOptimizer):
        # this canonized graph...  if so, we do nothing and wait for
        # them to be transformed.
        def _bypass_dimshuffle(n):
-            if isinstance(n.op, DimShuffle) and len(n.outputs[0].clients) <= 1:
-                return _bypass_dimshuffle(n.outputs[0].clients.__iter__(
-                        ).next()[0])
+            if (isinstance(getattr(n, 'op', None), DimShuffle) and
+                len(n.outputs[0].clients) <= 1):
+                return _bypass_dimshuffle(n.outputs[0].clients[0][0])
            else:
                return n
        for c, c_idx in out.clients:
            if c == 'output':
                continue
-            if _bypass_dimshuffle(c).op in [self.main, self.inverse,
-                                            self.reciprocal]:
+            if getattr(_bypass_dimshuffle(c), 'op', '') in [
+                self.main, self.inverse, self.reciprocal]:
                return False

        # Here we make the canonical version of the graph around this node

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -2023,6 +2023,10 @@ class T_max_and_argmax(unittest.TestCase):
        because there is no differentiable path from cost to the input and
        not because of an error of the grad method of the op
        """
+
+        raise KnownFailureTest("The desired behavior of the grad method in this case is currently under debate. In any case, the result should be to return NaN or 0, not to report a disconnected input.")
+
+
        x = matrix()
        cost = argmax(x, axis=0).sum()
        value_error_raised = False
@@ -2220,6 +2224,7 @@ class T_argmin_argmax(unittest.TestCase):
    def test_grad_argmin(self):
        data = rand(2, 3)
        n = as_tensor_variable(data)
+        n.name = 'n'

        #test grad of argmin
        utt.verify_grad(lambda v: argmin(v, axis=-1), [data])
@@ -2231,7 +2236,9 @@ class T_argmin_argmax(unittest.TestCase):
        utt.verify_grad(lambda v: argmin(v.flatten()), [data])

        try:
-            grad(argmin(n, axis=-1), n)
+            cost = argmin(n, axis=-1)
+            cost.name = None
+            g = grad(cost, n)
            raise Exception('Expected an error')
        except TypeError:
            pass
@@ -4375,6 +4382,7 @@ class test_grad(unittest.TestCase):
        o = test_grad.O()
        a1 = o.make_node()
        g0,g1 = grad(a1.outputs[0], a1.inputs)
+        g0.name = None
        self.assertTrue(o.gval0 is g0)
        self.assertTrue(o.gval1 is g1)

@@ -4435,10 +4443,8 @@ class test_grad(unittest.TestCase):
        v = vector()
        m = matrix()
        # grad(v,...) and grad(m,...) should fail
-        self.assertRaises(TypeError, grad, v, s)
-        self.assertRaises(TypeError, grad, v, m)
-        self.assertRaises(TypeError, grad, m, s)
-        self.assertRaises(TypeError, grad, m, v)
+        self.assertRaises(TypeError, grad, v, v)
+        self.assertRaises(TypeError, grad, m, m)

 class T_op_cache(unittest.TestCase):
    def setUp(self):

--- a/theano/tests/test_gradient.py
+++ b/theano/tests/test_gradient.py