Merge pull request #910 from goodfeli/int_grad

Consistent & correct handling of integers and gradients -Documentation and implementation of a consistent way of handling gradients and integers -Type checks that ensure the gradient is always floating point and not an integer -Type checks that ensure the gradient of an integer is always undefined or 0 -An upgraded version of connection_pattern that provides theano with enough information to answer questions like "is variable x a function of variable y?" accurately

Merge pull request #910 from goodfeli/int_grad
c0c25559 · lamblin · a68ec1de · 477b55bb · c0c25559 · c0c25559
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -98,34 +98,56 @@ following methods:
  lifetime of self.  Op instances should be immutable in this
  sense.

-.. function:: connection_pattern():
+.. function:: connection_pattern( node ):

-  Optional (but in extremely rare cases needed to have it work with
-   {tensor,sparse}.grad).
+  Optional method; sometimes needed for gradient.grad to
+  work correctly.

-  Returns a list of bools the same length as the op's inputs list.
+  Returns a list of list of bools.

-  True signifies that the elements of an input have an effect on its
-  output.
+  Op.connection_pattern[input_idx][output_idx] is true if the
+  elements of inputs[input_idx] have an effect on the elements of
+  outputs[output_idx].

-  False signifies that they do not--in other words, the op acts only
-  one the input's metadata such as its shape.
+  The ``node'' parameter is needed to determine the number of
+  inputs. Some ops such as Subtensor take a variable number of
+  inputs.

-  If no connection_pattern is implemented, tensor.grad will assume
-  it is a list containing only True.
+  If no connection_pattern is specified, gradient.grad will
+  assume that all inputs have some elements connected to some
+  elements of all outputs.
+
+  This method conveys two pieces of information that are otherwise
+  not part of the theano graph:
+
+  1) Which of the op's inputs are truly ancestors of each of the
+     op's outputs. Suppose an op has two inputs, x and y, and
+     outputs f(x) and g(y). y is not really an ancestor of f, but
+     it appears to be so in the theano graph.
+  2) Whether the actual elements of each input/output are relevant
+     to a computation.
+     For example, the shape op does not read its input's elements,
+     only its shape metadata. d shape(x) / dx should thus raise
+     a disconnected input exception (if these exceptions are
+     enabled).
+     As another example, the elements of the Alloc op's outputs
+     are not affected by the shape arguments to the Alloc op.

  Failing to implement this function for an op that needs it can
-  result in tensor.grad erroneously reporting that a gradient is
-  undefined. Returning 0 for this input in the grad method is not
-  the same as specifying that the elements of this input are not
-  connected to the output. If the gradient with respect to the
-  op's output is NaN but the elements of the input are not connected
-  to it, then the NaN never enters into the expression for the
-  gradient.
+  result in two types of incorrect behavior:
+  
+  1) gradient.grad erroneously raising a TypeError reporting that
+     a gradient is undefined.
+  2) gradient.grad failing to raise a ValueError reporting that
+     an input is disconnected.
+
+  Even if connection_pattern is not implemented correctly,
+  if gradient.grad returns an expression, that expression will
+  be numerically correct.

 .. function:: grad(inputs, output_gradients)

-  Optional (but needed to have it work with {tensor,sparse}.grad()).
+  Optional (but needed to have it work with gradient.grad()).

  If the Op being defined is differentiable, its gradient may be specified 
  symbolically in this method. Both ``inputs`` and ``output_gradients``
@@ -217,6 +239,70 @@ following methods:
  Both the partial differentiation and the multiplication have to be performed by
  :func:`grad`.

+
+  Theano currently imposes the following constraints on the values returned by the grad method:
+  
+  1) They must be Variable instances.
+  2) When they are types that have dtypes, they must never have an integer dtype.
+
+  Integers are a tricky subject. Integers are the main reason for having DisconnectedType,
+  NullType or zero gradient. When you have an integer as an argument to your grad method,
+  recall the definition of a derivative to help you decide what value to return:
+
+  :math:`\frac{d f}{d x} = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon`.
+
+  Suppose your function f has an integer-valued output. For most functions you're likely
+  to implement in theano, this means your gradient should be zero, because f(x+epsilon)
+  = f(x) for almost all x. (The only other option is that the gradient could be undefined,
+  if your function is discontinuous everywhere, like the rational indicator function)
+
+  Suppose your function f has an integer-valued input. This is a little trickier, because
+  you need to think about what you mean mathematically when you make a variable integer-valued
+  in theano. Most of the time in machine learning we mean "f is a function of a real-valued
+  x, but we are only going to pass in integer-values of x". In this case, f(x+epsilon) exists,
+  so the gradient through f should be the same whether x is an integer or a floating point
+  variable. Sometimes what we mean is "f is a function of an integer-valued x, and f is only
+  defined where x is an integer." Since f(x+epsilon) doesn't exist, the gradient is undefined.
+  Finally, many times in theano, integer valued inputs don't actually affect the elements of
+  the output, only its shape.
+
+  If your function f has both an integer-valued input and an
+  integer-valued output, then both rules have to be combined:
+
+  - If f is defined at (x+epsilon), then the input gradient is
+    defined. Since f(x+epsilon) would be equal to f(x) almost
+    everywhere, the gradient should be 0 (first rule).
+
+  - If f is only defined where x is an integer, then the gradient
+    is undefined, regardless of what the gradient with respect to the
+    output is.
+
+  Examples:
+
+  1) f(x,y) = dot product between x and y. x and y are integers.
+        Since the output is also an integer, f is a step function.
+        Its gradient is zero almost everywhere, so Op.grad should return
+        zeros in the shape of x and y.
+  2) f(x,y) = dot product between x and y. x is floating point and y is an integer.
+        In this case the output is floating point. It doesn't matter that y is an integer.
+        We consider f to still be defined at f(x,y+epsilon). The gradient is exactly the
+        same as if y were floating point.
+  3) f(x,y) = argmax of x along axis y.
+        The gradient with respect to y is undefined, because f(x,y) is not defined for
+        floating point y. How could you take an argmax along a fraActional axis?
+        The gradient with respect to x is 0, because f(x+epsilon, y) = f(x) almost
+        everywhere.
+  4) f(x,y) = a vector with y elements, each of which taking on the value x
+        The grad method should return DisconnectedType()() for y, because the elements of
+        f don't depend on y. Only the shape of f depends on y. You probably also want to
+        implement a connection_pattern method to encode this.
+  5) f(x) = int(x) converts float x into an int. g(y) = float(y) converts an integer y into a float.
+        If the final cost C = 0.5 * g(y) = 0.5 g(f(x)), then the
+        gradient with respect to y will be 0.5, even if y is an
+        integer. However, the gradient with respect to x will be 0,
+        because the output of f is integer-valued.
+
+
 .. function:: infer_shape(node, shapes)

   Optional.

--- a/theano/gof/null_type.py
+++ b/theano/gof/null_type.py
@@ -29,3 +29,9 @@ class NullType(Type):

    def values_eq(a, b, force_same_dtype=True):
        raise ValueError("NullType has no values to compare")
+
+    def __eq__(self, other):
+        return type(self) == type(other)
+
+    def __hash__(self, other):
+        return hash(type(self))
--- a/theano/gradient.py
+++ b/theano/gradient.py
--- a/theano/ifelse.py
+++ b/theano/ifelse.py
@@ -4,8 +4,8 @@ linkers). It resembles the if clause of any programming language, that
 has a `then` and `else` branch, and executes either one or the other
 according to the condition provided.

-This op contrast the already existent `switch` op, that will evaluate both
-branches of the clause and afterwards pick (according to the condition)
+This op differs from the already existent `switch` op, that evaluates both
+branches of the clause and afterwards picks (according to the condition)
 which value to report. Note also that `switch` is an elemwise operation (so
 it picks each entry of a matrix according to the condition) while `ifelse`
 is a global operation with a scalar condition.
@@ -60,7 +60,7 @@ class IfElse(PureOp):

    :note:
        Other Linkers then CVM and VM are INCOMPATIBLE with this Op, and
-        will ingnore its lazy characteristic, computing both the True and
+        will ignore its lazy characteristic, computing both the True and
        False branch before picking one.

    """
@@ -212,7 +212,14 @@ class IfElse(PureOp):
                                       for t in ts])
        if_false = ([ins[0]] + [theano.tensor.zeros_like(f)
                                for f in fs] + grads)
-        return ([None] +
+
+        condition = ins[0]
+        # condition does affect the elements of the output so it is connected.
+        # For the sake of making the gradient convenient we assume that
+        # condition + epsilon always triggers the same branch as condition
+        condition_grad = condition.zeros_like().astype(theano.config.floatX)
+
+        return ([condition_grad] +
                if_true_op.make_node(*if_true).outputs +
                if_false_op.make_node(*if_false).outputs)


--- a/theano/sandbox/cuda/tests/test_mlp.py
+++ b/theano/sandbox/cuda/tests/test_mlp.py
--- a/theano/sandbox/cuda/tests/test_neighbours.py
+++ b/theano/sandbox/cuda/tests/test_neighbours.py
 # Skip test if cuda_ndarray is not available.
 from nose.plugins.skip import SkipTest
-import numpy
-
-import theano

 import theano.sandbox.cuda as cuda_ndarray
 if cuda_ndarray.cuda_available == False:

--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
@@ -2,10 +2,10 @@
 TODO: implement Images2Neibs.{perform,infer_shape}() methods

 """
-import theano
 from theano import Op, Apply
 import theano.tensor as T
 from theano.gradient import grad_not_implemented
+from theano.gradient import grad_undefined


 class Images2Neibs(Op):
@@ -59,7 +59,8 @@ class Images2Neibs(Op):
                for j in xrange(list 2 dim)
                    for k in <image column coordinates>
                        for l in <image row coordinates>
-                            output[idx,:] = flattened version of ten4[i,j,l:l+r,k:k+c]
+                            output[idx,:]
+                                 = flattened version of ten4[i,j,l:l+r,k:k+c]
                            idx += 1
            (note: the op isn't necessarily implemented internally with these
            for loops, they're just the easiest way to describe the output pattern)
@@ -90,8 +91,11 @@ class Images2Neibs(Op):
                (hasattr(neib_shape, "equals") and
                 neib_shape.equals(neib_step))):
                return [neibs2images(gz, neib_shape, x.shape, mode=self.mode),
-                        None, None]
-        return [grad_not_implemented(self, 0, x), None, None]
+                        grad_undefined(self, 1, neib_shape),
+                        grad_undefined(self, 2, neib_step)]
+        return [grad_not_implemented(self, 0, x),
+                grad_undefined(self, 1, neib_shape),
+                grad_undefined(self, 2, neib_step)]

    def c_code_cache_version(self):
        return (5,)
@@ -307,5 +311,3 @@ def neibs2images(neibs, neib_shape, original_shape, mode='valid'):
        raise NotImplementedError("neibs2images do not support mode=%s" % mode)

    return output_4d
-
-
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
--- a/theano/scan_module/scan_op.py
+++ b/theano/scan_module/scan_op.py
@@ -260,12 +260,16 @@ class Scan(PureOp):
                                    zip(self.inner_seqs(self.inputs),
                                        self.outer_seqs(inputs))):
            if inner_seq.type.dtype != outer_seq[idx].type.dtype:
+                assert isinstance(idx, int)
+
                raise ValueError(err_msg1 % ('sequence',
                                             str(outer_seq),
                                             idx,
                                             outer_seq.type.dtype,
+                                             outer_seq.ndim,
                                             str(inner_seq),
-                                             inner_seq.type.dtype))
+                                             inner_seq.type.dtype,
+                                             inner_seq.ndim))
        argoffset += len(self.outer_seqs(inputs))
        # Check that this 3 things have the same dtype for mit_mot:
        #   - initial state of the output
@@ -1260,7 +1264,7 @@ class Scan(PureOp):
        # the gradients with respect to all outputs)
        def compute_gradient(y, g_y):
            gmp = gradient.grad_sources_inputs(
-                        [(y, g_y)], diff_inputs, False)
+                        [(y, g_y)], diff_inputs)
            return [gmp.get(p, None) for p in diff_inputs]

        # 6. clean the outputs (i.e. remove update rules)
@@ -1301,7 +1305,13 @@ class Scan(PureOp):

        # 7.3. compute gradients of the inputs given one output
        for dx, out in enumerate(clean_outputs):
-            inner_g_out = safe_new(out)
+            if g_outs[dx] != None:
+                inner_g_out = safe_new(g_outs[dx][0])
+            else:
+                # We do not have a gradient on this output so we need a
+                # placeholder, which for now has the same dtype as the
+                # output
+                inner_g_out = safe_new(out)
            ###
            #### I need to clip the gradient HERE !!


--- a/theano/sparse/basic.py
+++ b/theano/sparse/basic.py
@@ -18,6 +18,7 @@ from theano.gof.python25 import all
 from theano.gradient import DisconnectedType
 from theano.sparse.utils import hash_from_sparse
 import theano.tests.unittest_tools as utt
+from theano.gradient import grad_not_implemented

 sparse_formats = ['csc', 'csr']

@@ -255,11 +256,13 @@ def sp_zeros_like(x):
    :return: The same as `x` with zero entries
             for all element.
    """
+
    # TODO: don't restrict to CSM formats
    _, _, indptr, shape = csm_properties(x)
-    return CSM(format=x.format)(numpy.array([], dtype=x.type.dtype),
-                                numpy.array([]), tensor.zeros_like(indptr),
-                                shape)
+    return CSM(format=x.format)(data=numpy.array([], dtype=x.type.dtype),
+                                indices=numpy.array([]),
+                                indptr=tensor.zeros_like(indptr),
+                                shape=shape)


 class _sparse_py_operators:
@@ -670,7 +673,7 @@ class CSM(gof.Op):
    the sparse matrix. Fancy indexing with numpy.ndarray
    should be used for this purpose.

-    :param data: One dimensionnal tensor representing
+    :param data: One dimensional tensor representing
                 the data of the sparse to construct.
    :param indices: One dimensional tensor of integers
                    representing the indices of the sparse
@@ -678,7 +681,7 @@ class CSM(gof.Op):
    :param indptr: One dimensional tensor of integers
                   representing the indice pointer for
                   the sparse matrix to construct.
-    :param shape: One dimensionnal tensor of integers
+    :param shape: One dimensional tensor of integers
                  representing the shape of the sparse
                  matrix to construct.

@@ -782,6 +785,9 @@ class CSM(gof.Op):
                                              indptr.copy()), shape.copy(),
                                             copy=False)

+    def connection_pattern(self, node):
+        return [[True], [False], [False], [False]]
+
    def grad(self, (x_data, x_indices, x_indptr, x_shape), (g_out,)):
        g_data, g_indices, g_indptr, g_shape = csm_properties(g_out)
        # unpack the data vector and wrap it as a 1d TensorType
@@ -984,7 +990,19 @@ class DenseFromSparse(gof.op.Op):

    def grad(self, (x, ), (gz, )):
        if self.sparse_grad:
-            return [sp_ones_like(x) * gz]
+            left = sp_ones_like(x)
+            right = gz
+
+            # Do upcasting if necessary to avoid an unimplemented case
+            # of mul
+
+            if right.dtype == 'float64' and left.dtype == 'float32':
+                left = left.astype('float64')
+
+            if right.dtype == 'float32' and left.dtype == 'float64':
+                right = right.astype('float64')
+
+            return [left * right]
        else:
            return [SparseFromDense(x.type.format)(gz)]

@@ -1993,7 +2011,9 @@ class MulSS(gof.op.Op):
    def make_node(self, x, y):
        x, y = as_sparse_variable(x), as_sparse_variable(y)
        if x.type != y.type:
-            raise NotImplementedError()
+            raise NotImplementedError(
+                    "MulSS not supported for differing types. "
+                    "Got %s and %s." % (str(x.type), str(y.type)))
        return gof.Apply(self, [x, y], [x.type()])

    def perform(self, node, (x, y), (out, )):
@@ -2042,7 +2062,9 @@ class MulSD(gof.op.Op):
            y = tensor.cast(y, dtype)

        if x.type.dtype != y.type.dtype:
-            raise NotImplementedError()
+            raise NotImplementedError(
+                "MulSD not implemented for different input dtypes. "
+                "Got %s and %s." % (x.type.dtype, y.type.dtype))
        # The magic number two here arises because L{scipy.sparse}
        # objects must be matrices (have dimension 2)
        # Broadcasting of the sparse matrix is not supported.
@@ -2128,7 +2150,9 @@ class MulSV(gof.op.Op):
        assert y.type.ndim == 1

        if x.type.dtype != y.type.dtype:
-            raise NotImplementedError()
+            raise NotImplementedError(
+                    "MulSV not implemented for differing dtypes."
+                    "Got %s and %s." % (str(x.type.dtype), str(y.type.dtype)))
        return gof.Apply(self,
                         [x, y],
                         [SparseType(dtype=x.type.dtype,
@@ -2142,6 +2166,15 @@ class MulSV(gof.op.Op):
    def grad(self, (x, y), (gz,)):
        assert _is_sparse_variable(x) and _is_dense_variable(y)
        assert _is_sparse_variable(gz)
+
+        # mul_s_v is not implemented if the types vary
+
+        if gz.dtype == 'float64' and y.dtype == 'float32':
+            y = y.astype('float64')
+
+        if gz.dtype == 'float32' and y.dtype == 'float64':
+            gz = gz.astype('float64')
+
        return mul_s_v(gz, y), sp_sum(x * gz, axis=0, sparse_grad=True)

    def infer_shape(self, node, ins_shapes):
@@ -2176,8 +2209,18 @@ def mul(x, y):

    assert x_is_sparse_variable or y_is_sparse_variable
    if x_is_sparse_variable and y_is_sparse_variable:
+
+        # mul_s_s is not implemented if the types differ
+        if y.dtype == 'float64' and x.dtype == 'float32':
+            x = x.astype('float64')
+
        return mul_s_s(x, y)
    elif x_is_sparse_variable and not y_is_sparse_variable:
+
+        # mul is unimplemented if the dtypes differ
+        if y.dtype == 'float64' and x.dtype == 'float32':
+            x = x.astype('float64')
+
        return mul_s_d(x, y)
    elif y_is_sparse_variable and not x_is_sparse_variable:
        return mul_s_d(y, x)
@@ -3260,7 +3303,7 @@ class SamplingDot(gof.op.Op):
        rval = [
            dot(p * gz, y),
            dot((p * gz).T, x),
-            None
+            grad_not_implemented(self, 2, p)
        ]

        return rval

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -14,6 +14,7 @@ from theano.scalar import Scalar
 from theano.printing import min_informative_str, pprint
 from theano.gof.python25 import all, any
 from theano.tensor.utils import hash_from_dict
+from theano.gradient import DisconnectedType

 config = theano.config

@@ -277,7 +278,8 @@ class DimShuffle(Op):
        #get the copy / view of the input depending on whether we're doingi
        # things inplace or not.
        if self.inplace:
-            get_base = ['{ PyArrayObject * %(basename)s = %(input)s', 'Py_INCREF((PyObject*)%(basename)s)']
+            get_base = [
+                '{ PyArrayObject * %(basename)s = %(input)s', 'Py_INCREF((PyObject*)%(basename)s)']
        else:
            get_base = [('{ PyArrayObject * %(basename)s = (PyArrayObject*)PyArray_FromAny((PyObject*)%(input)s, NULL,'
                    '0, 0, NPY_ALIGNED|NPY_ENSURECOPY, NULL)')]
@@ -285,7 +287,8 @@ class DimShuffle(Op):
        shape_statements = ['npy_intp dimensions[%i]' % nd_out]
        for i, o in enumerate(self.new_order):
            if o != 'x':
-                shape_statements += [('dimensions[' + str(i) + '] = %(basename)s->dimensions[' + str(o) + ']')]
+                shape_statements += [('dimensions[' + str(
+                    i) + '] = %(basename)s->dimensions[' + str(o) + ']')]
            else:
                shape_statements += [('dimensions[' + str(i) + '] = 1')]

@@ -294,7 +297,8 @@ class DimShuffle(Op):
        #set the strides of the non-broadcasted dimensions
        for i, o in enumerate(self.new_order):
            if o != 'x':
-                strides_statements += [('strides[' + str(i) + '] = %(basename)s->strides[' + str(o) + ']')]
+                strides_statements += [('strides[' + str(i)
+                     + '] = %(basename)s->strides[' + str(o) + ']')]
            else:
                strides_statements += [('strides[' + str(i) + '] = 0')]

@@ -310,7 +314,8 @@ class DimShuffle(Op):
                '-1] = %(basename)s->descr->elsize'
            )
        for i in xrange(nd_out - 2, -1, -1):
-            strides_statements.append("if (strides[%(i)s] == 0) strides[%(i)s] = strides[%(i)s+1] * dimensions[%(i)s+1]" % dict(i=str(i)))
+            strides_statements.append(
+                "if (strides[%(i)s] == 0) strides[%(i)s] = strides[%(i)s+1] * dimensions[%(i)s+1]" % dict(i=str(i)))

        #
        # PyObject* PyArray_New(PyTypeObject* subtype, int nd, npy_intp* dims, int type_num,
@@ -605,7 +610,8 @@ class Elemwise(Op):
                # the right thing to do .. have to talk to Ian and James
                # about it

-                if bgrads[jdx] is None:
+                if bgrads[jdx] is None or \
+                        isinstance(bgrads[jdx].type, DisconnectedType):
                    pass
                elif eval_point is not None:
                    if rop_out is None:
@@ -617,6 +623,13 @@ class Elemwise(Op):

        return rval

+    def connection_pattern(self, node):
+
+        if hasattr(self.scalar_op, 'connection_pattern'):
+            return self.scalar_op.connection_pattern(node)
+
+        return [[True for output in node.outputs] for ipt in node.inputs]
+
    def grad(self, inputs, ograds):

        #compute grad with respect to broadcasted input
@@ -676,10 +689,16 @@ class Elemwise(Op):

            theano.config.compute_test_value = prev_setting

+        if not isinstance(scalar_igrads, (list, tuple)):
+            raise TypeError('%s.grad returned %s instead of list or tuple' %
+                    (str(self.scalar_op), str(type(scalar_igrads))))
+
        nd = len(inputs[0].type.broadcastable)  # this is the same for everyone

        def transform(r):
            # From a graph of ScalarOps, make a graph of Broadcast ops.
+            if isinstance(r.type, DisconnectedType):
+                return r
            if r in scalar_inputs:
                return inputs[scalar_inputs.index(r)]
            if r in scalar_ograds:
@@ -803,7 +822,7 @@ class Elemwise(Op):
            errormsg = ('While computing ' + str(node.outputs) +
                        ': Failed calling ufunc for op ' +
                        str(self.scalar_op) +
-                        'for params of shape ' +
+                        ' for params of shape ' +
                        str([arg.shape for arg in ufunc_args]))

            if config.exception_verbosity == 'high':
@@ -1324,7 +1343,8 @@ class CAReduce(Op):
            alloc += """
 for(int i=0;i<%(iname)s->nd;i++){
  if(PyArray_DIMS(%(iname)s)[i]==0 && tosum[i]){
-    PyErr_Format(PyExc_ValueError, "Input of CAReduce{%(scal_name)s} has zero-size on axis %%d",i);
+    PyErr_Format(PyExc_ValueError,
+         "Input of CAReduce{%(scal_name)s} has zero-size on axis %%d",i);
    %(fail)s;
  }
 }
@@ -1585,6 +1605,12 @@ class Sum(CAReduceDtype):

    def grad(self, inp, grads):
        x, = inp
+
+        out = self(*inp)
+
+        if out.dtype.find('int') != -1:
+            return [x.zeros_like().astype(theano.config.floatX)]
+
        gz, = grads
        gz = as_tensor_variable(gz)
        axis = self.axis
@@ -1601,7 +1627,7 @@ class Sum(CAReduceDtype):
                new_dims.append(i)
                i += 1
        ds_op = DimShuffle(gz.type.broadcastable, new_dims)
-        gx = Elemwise(scalar.second)(x, ds_op(gz).astype(x.dtype))
+        gx = Elemwise(scalar.second)(x, ds_op(gz))
        return [gx]

    def R_op(self, inputs, eval_points):
@@ -1646,7 +1672,7 @@ class Prod(CAReduceDtype):

    def grad(self, inp, grads):
        '''
-        The grad of this Op could be very easy, it is was not for the case
+        The grad of this Op could be very easy, if it is was not for the case
        where zeros are present in a given "group" (ie. elements reduced
        together to form the product).

@@ -1692,8 +1718,11 @@ class Prod(CAReduceDtype):
        '''
        prod_in, = inp
        gz, = grads
-        if prod_in.dtype[0:3] in ('int', 'uin'):
-            return [None]
+
+        out = self(*inp)
+
+        if out.dtype[0:3] in ('int', 'uin'):
+            return [prod_in.zeros_like().astype(theano.config.floatX)]

        # Prepare the broadcasting that is used everywhere to broadcast
        # over the original groups (ie. broadcast over the elements of a given

--- a/theano/tensor/extra_ops.py
+++ b/theano/tensor/extra_ops.py
@@ -5,6 +5,7 @@ import theano
 import basic
 from theano import gof, scalar
 import basic as tensor
+from theano.gradient import DisconnectedType


 class DiffOp(theano.Op):
@@ -148,7 +149,13 @@ class BinCountOp(theano.Op):
        z[0] = np.bincount(x, weights=weights, minlength=self.minlength)

    def grad(self, inputs, outputs_gradients):
-        return [None for i in inputs]
+        output = self(*inputs)
+
+        if output.dtype.find('int') != -1:
+            return [inp.zeros_like().astype(theano.config.floatX)
+                    for inp in inputs]
+
+        raise NotImplementedError()

    def infer_shape(self, node, ins_shapes):
        x = node.inputs[0]
@@ -252,6 +259,10 @@ class RepeatOp(theano.Op):
        z = output_storage[0]
        z[0] = np.repeat(x, repeats=repeats, axis=self.axis)

+    def connection_pattern(self, node):
+
+        return [[True], [False]]
+
    def grad(self, (x, repeats), (gz, )):
        if repeats.ndim == 0:
            if self.axis is None:
@@ -265,7 +276,8 @@ class RepeatOp(theano.Op):
            shape = [x.shape[k] for k in range(x.ndim)]
            shape.insert(axis, repeats)

-            return [gz.reshape(shape, x.ndim + 1).sum(axis=axis), None]
+            return [gz.reshape(shape, x.ndim + 1).sum(axis=axis),
+                    DisconnectedType()()]
        elif repeats.ndim == 1:
            # For this implementation, we would need to specify the length
            # of repeats in order to split gz in the right way to sum
@@ -387,7 +399,6 @@ def bartlett(M):
    return bartlett_(M)


-
 class FillDiagonal(gof.Op):
    # See function fill_diagonal for docstring
    def __eq__(self, other):

--- a/theano/tensor/nnet/ConvGrad3D.py
+++ b/theano/tensor/nnet/ConvGrad3D.py
@@ -2,6 +2,8 @@ import theano
 from theano.tensor import basic as T
 from theano.misc import strutil
 import numpy as N
+from theano.gradient import grad_undefined
+from theano.gradient import DisconnectedType


 #TODO: speed up by reordering loops. Should pass through the videos once, incrementing all weight gradients, rather
@@ -9,7 +11,7 @@ import numpy as N

 class ConvGrad3D(theano.Op):
    """ Gradient of Conv3D with respect to W """
-    def __eq__(self,other):
+    def __eq__(self, other):
        return type(self) == type(other)

    def __hash__(self):
@@ -27,20 +29,26 @@ class ConvGrad3D(theano.Op):
        return theano.Apply(self, inputs=[V_, d_, WShape_, dCdH_], outputs = [ T.TensorType(V_.dtype, (False,False,False,False,False))() ] )

    def infer_shape(self, node, input_shapes):
-        V,d,W_shape, dCdH = node.inputs
+        V, d, W_shape, dCdH = node.inputs
        return [ ( W_shape[0], W_shape[1], W_shape[2], W_shape[3], W_shape[4] ) ]

-    def grad(self,inputs, output_gradients):
-        C,d, WShape, B = inputs
-        dLdA ,= output_gradients
+    def connection_pattern(self, node):

-        z = T.zeros_like(C[0,0,0,0,:])
-        dLdC = convTransp3D( dLdA, z, d, B, C.shape[1:4])
-        dLdd = None #not differentiable, since d is not continuous
-        dLdWShape = None #not differentiable, since d is not continuous
-        dLdB = conv3D( C, dLdA, T.zeros_like(B[0,0,0,0,:]), d)
+        return [[True], [True], [False], [True]]

-        return [ dLdC, dLdd, dLdWShape, dLdB ]
+    def grad(self, inputs, output_gradients):
+        C, d, WShape, B = inputs
+        dLdA, = output_gradients
+
+        z = T.zeros_like(C[0, 0, 0, 0, :])
+        dLdC = convTransp3D(dLdA, z, d, B, C.shape[1:4])
+        # d actually does affect the outputs, so it's not disconnected
+        dLdd = grad_undefined(self, 1, d)
+        # The shape of the weights doesn't affect the output elements
+        dLdWShape = DisconnectedType()()
+        dLdB = conv3D(C, dLdA, T.zeros_like(B[0, 0, 0, 0, :]), d)
+
+        return [dLdC, dLdd, dLdWShape, dLdB]

    def perform(self, node, inputs, output_storage):
        V, d, WShape, dCdH = inputs
@@ -64,17 +72,15 @@ class ConvGrad3D(theano.Op):

        #print 'computing output of shape '+str(WShape)

-
-
-        for k in xrange(0,WShape[1]):
-            for l in xrange(0,WShape[2]):
-                for m in xrange(0,WShape[3]):
-                    for i in xrange(0,batchSize):
-                        for p in xrange(0,outputHeight):
-                            for q in xrange(0,outputWidth):
-                                for r in xrange(0,outputDur):
-                                    for j in xrange(0,WShape[0]):
-                                        for z in xrange(0,WShape[4]):
+        for k in xrange(0, WShape[1]):
+            for l in xrange(0, WShape[2]):
+                for m in xrange(0, WShape[3]):
+                    for i in xrange(0, batchSize):
+                        for p in xrange(0, outputHeight):
+                            for q in xrange(0, outputWidth):
+                                for r in xrange(0, outputDur):
+                                    for j in xrange(0, WShape[0]):
+                                        for z in xrange(0, WShape[4]):
                                            dCdW[j,k,l,m,z] +=  dCdH[i,p,q,r,j] * V[i,dr*p+k,dc*q+l,dt*r+m,z]

        output_storage[0][0] = dCdW
@@ -89,7 +95,7 @@ class ConvGrad3D(theano.Op):

        dCdW = outputs[0]

-        codeSource =  """
+        codeSource = """
            ///////////// < code generated by ConvGradW3D >

            //printf("\t\t\t\tConvGradW3D c code\\n");
@@ -269,7 +275,7 @@ class ConvGrad3D(theano.Op):
            ///////////// < /code generated by ConvGradW3D >
        """

-        return strutil.renderString(codeSource,locals())
+        return strutil.renderString(codeSource, locals())


 convGrad3D = ConvGrad3D()

--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -2,10 +2,13 @@ import numpy as N
 from theano.tensor import basic as T
 from theano.misc import strutil
 import theano
+from theano.gradient import grad_undefined
+from theano.gradient import DisconnectedType
+

 class ConvTransp3D(theano.Op):
    """ "Transpose" of Conv3D (Conv3D implements multiplication by an implicitly defined matrix W. This implements multiplication by its transpose) """
-    def __eq__(self,other):
+    def __eq__(self, other):
        return type(self) == type(other)

    def __hash__(self):
@@ -14,7 +17,7 @@ class ConvTransp3D(theano.Op):
    def c_code_cache_version(self):
        return (3,)

-    def make_node(self, W, b, d, H, RShape = None):
+    def make_node(self, W, b, d, H, RShape=None):
        """
        :param W: Weights, filter
        :param b: bias, shape == (W.shape[0],)
@@ -28,7 +31,7 @@ class ConvTransp3D(theano.Op):
        if RShape:
            RShape_ = T.as_tensor_variable(RShape)
        else:
-            RShape_ = T.as_tensor_variable([-1,-1,-1])
+            RShape_ = T.as_tensor_variable([-1, -1, -1])

        return theano.Apply(self, inputs=[W_,b_,d_,H_, RShape_], outputs = [ T.TensorType(H_.dtype, (False,False,False,False,False))() ] )

@@ -36,22 +39,25 @@ class ConvTransp3D(theano.Op):
        flags = ['-Werror']
        return flags

-
    def infer_shape(self, node, input_shapes):
-        W,b,d,H,RShape = node.inputs
+        W, b, d, H, RShape = node.inputs
        W_shape, b_shape, d_shape, H_shape, RShape_shape = input_shapes
        return [(H_shape[0],  RShape[0], RShape[1], RShape[2], W_shape[4])]

-    def grad(self,inputs, output_gradients):
-        W,b,d,H, RShape = inputs
-        dCdR ,= output_gradients
-        dCdH = conv3D( dCdR, W, T.zeros_like(H[0,0,0,0,:]), d)
-        WShape = W.shape
-        dCdW = convGrad3D(dCdR,d,WShape,H)
-        dCdb = T.sum(dCdR,axis=(0,1,2,3))
-        dCdd = None #not differentiable, since d is not continuous
-        dCdRShape = None #not differentiable, since RShape is not continuous
+    def connection_pattern(self, node):
+        return [[True], [True], [True], [True], [False]]

+    def grad(self, inputs, output_gradients):
+        W, b, d, H, RShape = inputs
+        dCdR, = output_gradients
+        dCdH = conv3D(dCdR, W, T.zeros_like(H[0, 0, 0, 0, :]), d)
+        WShape = W.shape
+        dCdW = convGrad3D(dCdR, d, WShape, H)
+        dCdb = T.sum(dCdR, axis=(0, 1, 2, 3))
+        # not differentiable, since d affects the output elements
+        dCdd = grad_undefined(self, 2, d)
+        # disconnected, since RShape just determines the output shape
+        dCdRShape = DisconnectedType()()

        if 'name' in dir(dCdR) and dCdR.name is not None:
            dCdR_name = dCdR.name
@@ -76,15 +82,14 @@ class ConvTransp3D(theano.Op):

        dCdW.name = 'ConvTransp3D_dCdW.H='+H_name+',dCdR='+dCdR_name+',W='+W_name
        dCdb.name = 'ConvTransp3D_dCdb.H='+H_name+',dCdR='+dCdR_name+',W='+W_name+',b='+b_name
-        dCdH.name = 'ConvTransp3D_dCdH.H='+H_name+',dCdR='+dCdR_name
-
-        return [ dCdW,  dCdb, dCdd, dCdH, dCdRShape ]
+        dCdH.name = 'ConvTransp3D_dCdH.H=' + H_name + ',dCdR=' + dCdR_name

+        return [dCdW,  dCdb, dCdd, dCdH, dCdRShape]

    def perform(self, node, inputs, output_storage):
        W, b, d, H, RShape = inputs
 #        print "\t\t\t\tConvTransp3D python code"
-        output_storage[0][0] = computeR(W,b,d,H,RShape)
+        output_storage[0][0] = computeR(W, b, d, H, RShape)

    def c_code(self, node, nodename, inputs, outputs, sub):
        W, b, d, H, RShape = inputs
@@ -321,33 +326,35 @@ class ConvTransp3D(theano.Op):
               ///////////// < /code generated by ConvTransp3D >
                     """

-        return strutil.renderString(codeSource,locals())
+        return strutil.renderString(codeSource, locals())


 convTransp3D = ConvTransp3D()

 #If the input size wasn't a multiple of D we may need to cause some automatic padding to get the right size of reconstruction
-def computeR(W,b,d,H,Rshape = None):
+
+
+def computeR(W, b, d, H, Rshape=None):
    assert len(W.shape) == 5
    assert len(H.shape) == 5
    assert len(b.shape) == 1
    assert len(d) == 3

-
-    outputChannels,  filterHeight, filterWidth, filterDur, inputChannels = W.shape
-    batchSize, outputHeight, outputWidth, outputDur, outputChannelsAgain = H.shape
+    outputChannels,  filterHeight, filterWidth, filterDur, \
+        inputChannels = W.shape
+    batchSize, outputHeight, outputWidth, outputDur, \
+        outputChannelsAgain = H.shape
    assert outputChannelsAgain == outputChannels
    assert b.shape[0] == inputChannels

-
-    dr,dc,dt = d
+    dr, dc, dt = d
    assert dr > 0
    assert dc > 0
    assert dt > 0

-    videoHeight = (outputHeight-1) * dr + filterHeight
-    videoWidth = (outputWidth-1) * dc + filterWidth
-    videoDur = (outputDur-1) * dt + filterDur
+    videoHeight = (outputHeight - 1) * dr + filterHeight
+    videoWidth = (outputWidth - 1) * dc + filterWidth
+    videoDur = (outputDur - 1) * dt + filterDur

    if Rshape is not None and Rshape[0] != -1:
        if Rshape[0] < videoHeight:
@@ -364,24 +371,27 @@ def computeR(W,b,d,H,Rshape = None):

    #print "video size: "+str((videoHeight, videoWidth, videoDur))

-    R =  N.zeros( (batchSize, videoHeight,
-            videoWidth, videoDur, inputChannels ) , dtype=H.dtype)
+    R = N.zeros((batchSize, videoHeight,
+            videoWidth, videoDur, inputChannels), dtype=H.dtype)

    #R[i,j,r,c,t] = b_j + sum_{rc,rk | d \circ rc + rk = r} sum_{cc,ck | ...} sum_{tc,tk | ...} sum_k W[k, j, rk, ck, tk] * H[i,k,rc,cc,tc]
-    for i in xrange(0,batchSize):
+    for i in xrange(0, batchSize):
        #print '\texample '+str(i+1)+'/'+str(batchSize)
-        for j in xrange(0,inputChannels):
+        for j in xrange(0, inputChannels):
            #print '\t\tfeature map '+str(j+1)+'/'+str(inputChannels)
-            for r in xrange(0,videoHeight):
+            for r in xrange(0, videoHeight):
                #print '\t\t\trow '+str(r+1)+'/'+str(videoHeight)
-                for c in xrange(0,videoWidth):
-                    for t in xrange(0,videoDur):
-                        R[i,r,c,t,j] = b[j]
+                for c in xrange(0, videoWidth):
+                    for t in xrange(0, videoDur):
+                        R[i, r, c, t, j] = b[j]

-                        ftc = max([0, int(N.ceil(float(t-filterDur +1  )/float(dt))) ])
-                        fcc = max([0, int(N.ceil(float(c-filterWidth +1)/float(dc))) ])
+                        ftc = max([0, int(N.ceil(
+                            float(t - filterDur + 1) / float(dt)))])
+                        fcc = max([0, int(N.ceil(
+                            float(c - filterWidth + 1) / float(dc)))])

-                        rc =  max([0, int(N.ceil(float(r-filterHeight+1)/float(dr))) ])
+                        rc = max([0, int(N.ceil(
+                            float(r - filterHeight + 1) / float(dr)))])
                        while rc < outputHeight:
                            rk = r - rc * dr
                            if rk < 0:
@@ -399,20 +409,21 @@ def computeR(W,b,d,H,Rshape = None):
                                    if tk < 0:
                                        break

-                                    R[i,r,c,t,j] += N.dot(W[:,rk,ck,tk,j], H[i,rc,cc,tc,:] )
+                                    R[
+                                        i,r,c,t,j] += N.dot(W[:,rk,ck,tk,j], H[i,rc,cc,tc,:] )

                                    tc += 1
-                                "" #close loop over tc
+                                ""  # close loop over tc
                                cc += 1
-                            "" #close loop over cc
+                            ""  # close loop over cc

                            rc += 1
-                        "" #close loop over rc
-                    "" #close loop over t
-                "" #close loop over c
-            "" #close loop over r
-        "" #close loop over j
-    "" #close loop over i
+                        ""  # close loop over rc
+                    ""  # close loop over t
+                ""  # close loop over c
+            ""  # close loop over r
+        ""  # close loop over j
+    ""  # close loop over i

    return R


--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -15,6 +15,7 @@ from theano.gof import Apply

 from theano.tensor.nnet.sigm import sigmoid, softplus
 from theano.gradient import DisconnectedType
+from theano.gradient import grad_not_implemented


 ############
@@ -79,7 +80,7 @@ class SoftmaxWithBias(gof.Op):
        g_sm, = grads

        if isinstance(g_sm.type, DisconnectedType):
-            return [ DisconnectedType()(), DisconnectedType()() ]
+            return [DisconnectedType()(), DisconnectedType()()]

        sm = softmax_with_bias(x, b)
        dx = softmax_grad(g_sm, sm)
@@ -560,8 +561,8 @@ if 0:
                            axis = ds_input.owner.op.axis
                            sum_input = ds_input.owner.inputs[0]

-                        if ((ds_order!=(0,'x')) or
-                            (axis!=(1,)) or
+                        if ((ds_order != (0, 'x')) or
+                            (axis != (1,)) or
                            (sum_input is not prod_term)):
                            rest.append(add_in)
                            #print 'ds_order =', ds_order
@@ -712,16 +713,20 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        am_shp = idx_shp
        return [nll_shp, sm_shp, am_shp]

+    def connection_pattern(self, node):
+
+        return [[True, True, True],  # x
+                [True, True, True],  # b
+                [False, False, True]]  # y_idx
+
    def grad(self, inp, grads):
        x, b, y_idx = inp
        g_nll, g_sm, g_am = grads

-
        dx_terms = []
        db_terms = []
        d_idx_terms = []

-
        if not isinstance(g_nll.type, DisconnectedType):
            nll, sm = crossentropy_softmax_1hot_with_bias(x, b, y_idx)
            dx = crossentropy_softmax_1hot_with_bias_dx(g_nll, sm, y_idx)
@@ -739,7 +744,7 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
            db_terms.append(b.zeros_like())
            d_idx_terms.append(y_idx.zeros_like())

-        def fancy_sum( terms ):
+        def fancy_sum(terms):
            if len(terms) == 0:
                return DisconnectedType()()
            rval = terms[0]
@@ -747,8 +752,8 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
                rval = rval + term
            return rval

-        return [ fancy_sum(terms) for terms in
-                [dx_terms, db_terms, d_idx_terms ] ]
+        return [fancy_sum(terms) for terms in
+                [dx_terms, db_terms, d_idx_terms]]

    def c_headers(self):
        return ['<iostream>', '<cmath>']
@@ -897,7 +902,7 @@ class CrossentropySoftmax1HotWithBiasDx (gof.Op):
                    sm, tensor.fill(dy, -1), y_idx_range, y_idx),
                axis=1)
        g_sm = dy.dimshuffle(0, 'x') * g_dx
-        g_y_idx = None
+        g_y_idx = grad_not_implemented(self, 2, y_idx)
        return [g_dy, g_sm, g_y_idx]

    def c_code_cache_version(self):
@@ -1136,7 +1141,7 @@ class CrossentropyCategorical1Hot(gof.Op):
        coding, one_of_n = inp
        g_y, = grads
        return [crossentropy_categorical_1hot_grad(g_y, coding, one_of_n),
-                None]
+                grad_not_implemented(self, 1, one_of_n)]

 crossentropy_categorical_1hot = CrossentropyCategorical1Hot()

@@ -1325,7 +1330,6 @@ def local_advanced_indexing_crossentropy_onehot(node):
            except Exception:
                pass

-
    if sm is not None and sm.owner and sm.owner.op in (softmax,
                                                       softmax_with_bias):
        sm_w_bias = local_softmax_with_bias.transform(sm.owner)
@@ -1481,7 +1485,8 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):

            if adv_subtensor is not None:
                try:
-                    maybe_sm, maybe_rows, maybe_labels = adv_subtensor.owner.inputs
+                    maybe_sm, maybe_rows, \
+                        maybe_labels = adv_subtensor.owner.inputs
                except Exception:
                    return

@@ -1691,7 +1696,6 @@ class Prepend_scalar_constant_to_each_row(gof.Op):
        shp = (in_shapes[0][0], in_shapes[0][1] + 1)
        return [shp]

-
    def grad(self, inp, grads):
        mat, = inp
        goutput, = grads
@@ -1758,18 +1762,19 @@ prepend_1_to_each_row = Prepend_scalar_constant_to_each_row(1.)
 #numerically stabilize log softmax (X)
 # as  X-X.max(axis=1).dimshuffle(0,'x') - log(exp(X-X.max(axis=1).dimshuffle(0,'x')).sum(axis=1)).dimshuffle(0,'x)
 def make_out_pattern(X):
-    stabilized_X = X - X.max(axis=1).dimshuffle(0,'x')
-    out_var = stabilized_X - tensor.log(tensor.exp(stabilized_X).sum(axis=1)).dimshuffle(0,'x')
+    stabilized_X = X - X.max(axis=1).dimshuffle(0, 'x')
+    out_var = stabilized_X - tensor.log(tensor.exp(stabilized_X).sum(
+        axis=1)).dimshuffle(0, 'x')
    #tell DEBUG_MODE that it's OK if the original graph produced NaN and the optimized graph does not
    out_var.values_eq_approx = out_var.type.values_eq_approx_remove_nan
    return out_var


-local_log_softmax = gof.PatternSub( in_pattern = (tensor.log, (softmax, 'x')),
-                                    out_pattern = (make_out_pattern, 'x'),
+local_log_softmax = gof.PatternSub(in_pattern=(tensor.log, (softmax, 'x')),
+                                    out_pattern=(make_out_pattern, 'x'),
                                   allow_multiple_clients=True)

 #don't do register_stabilize, this is to make local_log_softmax run
 #only after another more specific optimization that stabilizes cross entropy
 #opt.register_stabilize(local_log_softmax, name = 'local_log_softmax')
-opt.register_specialize(local_log_softmax, name = 'local_log_softmax')
+opt.register_specialize(local_log_softmax, name='local_log_softmax')
--- a/theano/tensor/nnet/sigm.py
+++ b/theano/tensor/nnet/sigm.py
@@ -30,13 +30,20 @@ class ScalarSigmoid(scalar.UnaryScalarOp):
        if x > 30.0:
            return 1.0
        return 1.0 / (1.0 + numpy.exp(-x))
+
    def impl(self, x):
        return ScalarSigmoid.st_impl(x)
+
    def grad(self, inp, grads):
        x, = inp
        gz, = grads
        y = scalar_sigmoid(x)
-        return [gz * y * (1.0 - y)]
+        rval = gz * y * (1.0 - y)
+
+        assert rval.type.dtype.find('float') != -1
+
+        return [rval]
+
    def c_code(self, node, name, inp, out, sub):
        x, = inp
        z, = out
@@ -50,6 +57,7 @@ class ScalarSigmoid(scalar.UnaryScalarOp):
            return """%(z)s = %(x)s < -709.0 ? 0.0 : %(x)s > 19.0 ? 1.0 : 1.0 /(1.0+exp(-%(x)s));""" % locals()
        else:
            raise NotImplementedError('only floatingpoint is implemented')
+
    def c_code_cache_version(self):
        v = super(ScalarSigmoid, self).c_code_cache_version()
        if v:
@@ -61,7 +69,7 @@ sigmoid = elemwise.Elemwise(scalar_sigmoid, name='sigmoid')

 sigmoid_inplace = elemwise.Elemwise(
        ScalarSigmoid(scalar.transfer_type(0)),
-        inplace_pattern={0:0},
+        inplace_pattern={0: 0},
        name='sigmoid_inplace',
        )

@@ -76,12 +84,15 @@ class ScalarSoftplus(scalar.UnaryScalarOp):
        if x > 30.0:
            return x
        return numpy.log1p(numpy.exp(x))
+
    def impl(self, x):
        return ScalarSoftplus.static_impl(x)
+
    def grad(self, inp, grads):
        x, = inp
        gz, = grads
        return [gz * scalar_sigmoid(x)]
+
    def c_code(self, node, name, inp, out, sub):
        x, = inp
        z, = out
@@ -95,27 +106,29 @@ class ScalarSoftplus(scalar.UnaryScalarOp):
            return """%(z)s = %(x)s < -745.0 ? 0.0 : %(x)s > 16.0 ? %(x)s : log1p(exp(%(x)s));""" % locals()
        else:
            raise NotImplementedError('only floatingpoint is implemented')
+
    def c_code_cache_version(self):
        v = super(ScalarSoftplus, self).c_code_cache_version()
        if v:
            return (2,) + v
        else:
            return v
-scalar_softplus = ScalarSoftplus(scalar.upgrade_to_float, name='scalar_softplus')
+scalar_softplus = ScalarSoftplus(scalar.upgrade_to_float, name=                                                                                                                                                                                                        'scalar_softplus')
 softplus = elemwise.Elemwise(scalar_softplus, name='softplus')

 pprint.assign(softplus, printing.FunctionPrinter('softplus'))

+
 def _skip_mul_1(r):
    if r.owner and r.owner.op == tensor.mul:
-        not_is_1 = [i for i in r.owner.inputs if not _is_1(i) ]
-        if len(not_is_1)==1:
+        not_is_1 = [i for i in r.owner.inputs if not _is_1(i)]
+        if len(not_is_1) == 1:
            return not_is_1[0]

 logsigm_to_softplus = gof.PatternSub(
    (tensor.log, (sigmoid, 'x')),
    (tensor.neg, (softplus, (tensor.neg, 'x'))),
-    allow_multiple_clients = True,
+    allow_multiple_clients=True,
    skip_identities_fn=_skip_mul_1)


@@ -131,21 +144,22 @@ def _is_1(expr):
 log1msigm_to_softplus = gof.PatternSub(
    (tensor.log,
        (tensor.sub,
-            dict(pattern='y', constraint = _is_1),
+            dict(pattern='y', constraint=_is_1),
            (sigmoid, 'x'))),
    (tensor.neg, (softplus, 'x')),
-    allow_multiple_clients = True,
+    allow_multiple_clients=True,
    skip_identities_fn=_skip_mul_1)

 log1pexp_to_softplus = gof.PatternSub(
    (tensor.log1p,
     (tensor.exp, 'x')),
    (softplus, 'x'),
-    allow_multiple_clients = True)
+    allow_multiple_clients=True)
+
+opt.register_stabilize(logsigm_to_softplus, name='logsigm_to_softplus')
+opt.register_stabilize(log1msigm_to_softplus, name='log1msigm_to_softplus')
+opt.register_stabilize(log1pexp_to_softplus, name='log1pexp_to_softplus')

-opt.register_stabilize(logsigm_to_softplus, name = 'logsigm_to_softplus')
-opt.register_stabilize(log1msigm_to_softplus, name = 'log1msigm_to_softplus')
-opt.register_stabilize(log1pexp_to_softplus, name = 'log1pexp_to_softplus')

 def is_1pexp(t):
    """
@@ -239,7 +253,7 @@ def partition_num_or_denom(r, f):
        else:
            neg_t, f_t = f_t
            f_terms.append(f_t)
-            neg ^= neg_t #bit flip if neg_t is true
+            neg ^= neg_t  # bit flip if neg_t is true
    return f_terms, rest, neg


@@ -291,7 +305,8 @@ def local_exp_over_1_plus_exp(node):
        #find all the exp() terms in the numerator
        num, denom = node.inputs
        num_exp_x, num_rest, num_neg = partition_num_or_denom(num, is_exp)
-        denom_1pexp, denom_rest, denom_neg = partition_num_or_denom(denom, is_1pexp)
+        denom_1pexp, denom_rest, \
+            denom_neg = partition_num_or_denom(denom, is_1pexp)

        sigmoids = []
        for t in denom_1pexp:
@@ -303,7 +318,7 @@ def local_exp_over_1_plus_exp(node):
                # case: 1/(1+exp(x))
                sigmoids.append(sigmoid(-t))

-        if not sigmoids: # we didn't find any.  abort
+        if not sigmoids:  # we didn't find any.  abort
            return
        # put the new numerator together
        new_num = sigmoids + [tensor.exp(t) for t in num_exp_x] + num_rest
@@ -322,6 +337,7 @@ def local_exp_over_1_plus_exp(node):
        else:
            return [new_num / tensor.mul(*denom_rest)]

+
 def parse_mul_tree(root):
    """
    Parse a tree of multiplications starting at the given root.
@@ -504,7 +520,7 @@ def perform_sigm_times_exp(tree, exp_x=None, exp_minus_x=None, sigm_x=None,
        sigm_minus_x = []
    if full_tree is None:
        full_tree = tree
-    if False: # Debug code.
+    if False:  # Debug code.
        print '<perform_sigm_times_exp>'
        print '  full_tree   = %s' % full_tree
        print '  tree        = %s' % tree
@@ -613,10 +629,13 @@ def local_inv_1_plus_exp(node):
                if nonconsts[0].owner and nonconsts[0].owner.op == tensor.exp:
                    if scalars and numpy.allclose(numpy.sum(scalars), 1):
                        return opt._fill_chain(
-                                sigmoid(tensor.neg(nonconsts[0].owner.inputs[0])),
+                                sigmoid(
+                                    tensor.neg(nonconsts[0].owner.inputs[0])),
                                scalar_inputs)

 # Registration is below, and conditional.
+
+
 @gof.local_optimizer([tensor.sub])
 def local_1msigmoid(node):
    """
@@ -625,7 +644,7 @@ def local_1msigmoid(node):
    if node.op == tensor.sub:
        sub_l, sub_r = node.inputs
        if len(sub_r.clients) > 1:
-            return # graph is using both sigm and 1-sigm
+            return  # graph is using both sigm and 1-sigm
        if sub_r.owner and sub_r.owner.op == sigmoid:
            try:
                val_l = opt.get_constant_value(sub_l)
@@ -678,13 +697,14 @@ if 0:
                        assert t0.owner.op == div
                        t0top, t0bot = t0.owner.inputs
                        t1top, t1bot = t1.owner.inputs
-                        rval.append(div(mul(*(t0top+t1top)), mul(*(t0bot+t1bot))))
+                        rval.append(div(mul(*(
+                            t0top + t1top)), mul(*(t0bot + t1bot))))

                        if len(rval) > 100:
                            # This loop can be exponentially long.
                            # aborting
                            return []
-        elif len(node.outputs)>1:
+        elif len(node.outputs) > 1:
            return []
        else:
            return [node.outputs[0]]
--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -542,15 +542,12 @@ class MakeVector(T.Op):
    def grad(self, inputs, output_gradients):
        # If the output is of an integer dtype, no gradient shall pass
        if 'int' in self.dtype:
-            return [None] * len(inputs)
+            return [ipt.zeros_like().astype(theano.config.floatX)
+                    for ipt in inputs]

        grads = []
        for i, inp in enumerate(inputs):
-            if 'int' in inp.dtype:
-                # No gradient wrt integer inputs
-                grads.append(None)
-            else:
-                grads.append(output_gradients[0][i])
+            grads.append(output_gradients[0][i])
        return grads

    def R_op(self, inputs, eval_points):
@@ -1914,6 +1911,8 @@ def local_subtensor_of_alloc(node):

    nw_val = val[tuple(val_slices)]
    nw_dims += dims[len(slices):]
+    if nw_val.ndim > len(nw_dims):
+        return False
    rval = T.alloc(nw_val, *nw_dims)
    if type(rval) not in (list, tuple):
        rval = [rval]

--- a/theano/tensor/randomstreams.py
+++ b/theano/tensor/randomstreams.py
@@ -136,7 +136,7 @@ class RandomStreams(Component, raw_random.RandomStreamsBase):

    """

-    def __init__(self, seed=None, no_warn = False):
+    def __init__(self, seed=None, no_warn=False):
        """:type seed: None or int

        :param seed: a default seed to initialize the RandomState
@@ -146,7 +146,7 @@ class RandomStreams(Component, raw_random.RandomStreamsBase):
        """
        if not no_warn:
            deprecation_warning()
-        super(RandomStreams, self).__init__(no_warn = True)
+        super(RandomStreams, self).__init__(no_warn=True)
        self.random_state_variables = []
        self.default_instance_seed = seed

@@ -164,7 +164,6 @@ class RandomStreams(Component, raw_random.RandomStreamsBase):
    def build(self, mode, memo):
        """override `Component.build` """
        if self not in memo:
-            print 'creating RandomStreamsInstance'
            memo[self] = RandomStreamsInstance(self, memo,
                                               self.default_instance_seed)
        return memo[self]

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
--- a/theano/tensor/tests/test_naacl09.py
+++ b/theano/tensor/tests/test_naacl09.py
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tests/test_gradient.py
+++ b/theano/tests/test_gradient.py
--- a/theano/tests/test_rop.py
+++ b/theano/tests/test_rop.py