merge

52057806 · Yann N. Dauphin · 84a35f28 · ce1eeab9 · 52057806 · 52057806
--- a/doc/index.txt
+++ b/doc/index.txt
@@ -20,7 +20,9 @@ since 2007.  But it is also approachable enough to be used in the classroom
 News
 ====
-* Theano 0.6rc3 was released. Everybody is encouraged to update.
+* Ian Goodfellow did a `12h class with exercises on Theano <https://github.com/goodfeli/theano_exercises>`_.
+* Theano 0.6 was released. Everybody is encouraged to update.
 * New technical report on Theano: `Theano: new features and speed improvements <http://arxiv.org/abs/1211.5590>`_.
  However, please keep citing the other paper below in scientific work involving Theano.

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -3,8 +3,8 @@
 Easy Installation of an optimized Theano on Ubuntu
 ==================================================
-These instruction was done for Ubuntu 11.04, 11.10 and 12.04. You can
+These instruction was done for Ubuntu 11.04, 11.10, 12.04, 12.10, 13.04
-probably do something similar on older computer.
+and 13.10. You can probably do something similar on older computer.
 .. note::
@@ -49,7 +49,7 @@ probably do something similar on older computer.
 Installation steps
 ~~~~~~~~~~~~~~~~~~
-Ubuntu 11.10/12.04/12.10/13.04:
+Ubuntu 11.10/12.04/12.10/13.04/13.10:
 1) ``sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git``
 2) ``sudo pip install Theano``
@@ -236,15 +236,4 @@ Test GPU configuration
   Ubuntu 12.10: default gcc version 4.7.2. gcc 4.4.7, 4.5.4 and 4.6.3 availables.
+   Ubuntu 13.10: default gcc version 4.8.1. gcc 4.4.7, 4.6.4 and 4.7.3 availables.
--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -607,6 +607,27 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
    have shape (2, 60).
+.. function:: tile(x, reps, ndim=None)
+    Construct an array by repeating the input `x` according to `reps`
+    pattern.
+    Tiles its input according to `reps`. The length of `reps` is the
+    number of dimension of `x` and contains the number of times to
+    tile `x` in each dimension.
+    :see: `numpy.tile
+        <http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html>`_
+        documentation for examples.
+    :see: :func:`theano.tensor.extra_ops.repeat
+        <theano.tensor.extra_ops.repeat>`
+    :note: Currently, `reps` must be a constant, `x.ndim` and
+        `len(reps)` must be equal and, if specified, `ndim` must be
+        equal to both.
 Creating Tensor
 ===============
@@ -1542,6 +1563,86 @@ Gradient / Differentiation
    :rtype: variable or list of variables (matching `wrt`)
    :returns: gradients of the cost with respect to each of the `wrt` terms
+.. function:: subgraph_grad(wrt, end, start=None, cost=None, details=False)
+    With respect to `wrt`, computes gradients of cost and/or from existing 
+    `start` gradients, up to the `end` variables of a symbolic digraph. 
+    In other words, computes gradients for a subgraph of the
+    symbolic theano function. Ignores all disconnected inputs.
+    This can be useful when one needs to perform the gradient descent 
+    iteratively (e.g. one layer at a time in an MLP), or when a particular 
+    operation is not differentiable in theano (e.g. stochastic sampling 
+    from a multinomial). In the latter case, the gradient of the 
+    non-differentiable process could be approximated by user-defined 
+    formula, which could be calculated using the gradients of a cost 
+    with respect to samples (0s and 1s). These gradients are obtained 
+    by performing a subgraph_grad from the `cost` or previously known gradients 
+    (`start`) up to the outputs of the stochastic process (`end`). 
+    A dictionary mapping gradients obtained from the user-defined 
+    differentiation of the process, to variables, could then be fed into 
+    another subgraph_grad as `start` with any other `cost` (e.g. weight decay).
+    In an MLP, we could use subgraph_grad to iteratively backpropagate:
+    >>> x, t = theano.tensor.fvector('x'), theano.tensor.fvector('t')
+    >>> w1 = theano.shared(np.random.randn(3,4))
+    >>> w2 = theano.shared(np.random.randn(4,2))
+    >>> a1 = theano.tensor.tanh(theano.tensor.dot(x,w1))
+    >>> a2 = theano.tensor.tanh(theano.tensor.dot(a1,w2))
+    >>> cost2 = theano.tensor.sqr(a2 - t).sum() 
+    >>> cost2 += theano.tensor.sqr(w2.sum())
+    >>> cost1 = theano.tensor.sqr(w1.sum())
+    >>> params = [[w2],[w1]]
+    >>> costs = [cost2,cost1]
+    >>> grad_ends = [[a1], [x]]
+    >>> next_grad = None
+    >>> param_grads = []
+    >>> for i in xrange(2):
+    >>>     param_grad, next_grad = theano.subgraph_grad(
+    >>>         wrt=params[i], end=grad_ends[i], 
+    >>>         start=next_grad, cost=costs[i]
+    >>>     )
+    >>>     next_grad = dict(zip(grad_ends[i], next_grad))
+    >>>     param_grads.extend(param_grad)
+    :type wrt : List of Variables.
+        Gradients are computed with respect to `wrt`.
+    :type end : List of Variables.
+        Theano variables at which to end gradient descent
+        (they are considered constant in theano.grad). 
+        For convenience, the gradients with respect to these variables 
+        are also returned.
+    :type start : Dictionary of Variables
+    :param start: If not None, a dictionary mapping variables to 
+            their gradients. This is useful when the gradient on some 
+            variables are known. These are used to compute the gradients
+            backwards up to the variables in `end` 
+            (they are used as known_grad in theano.grad).
+    :type cost: Scalar (0-dimensional) Variable.
+    :param cost: 
+            Additional costs for which to compute the gradients.  
+            For example, these could be weight decay, an l1 constraint,
+            MSE, NLL, etc. May optionally be None if start is provided.
+            Warning : If the gradients of `cost` with respect to any 
+            of the `start` variables is already part of the `start` 
+            dictionary, then it may be counted twice with respect to `wrt` 
+            and `end`.
+    :type details: bool.
+    :param details: When True, additionally returns the 
+        list of gradients from `start` and of `cost`, respectively, 
+        with respect to `wrt` (not `end`).
+    :rtype: Tuple of 2 or 4 Lists of Variables
+    :return: Returns lists of gradients with respect to `wrt` and `end`, 
+            respectively.
 .. _R_op_list:

--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
@@ -24,6 +24,246 @@ Scan
 The full documentation can be found in the library: :ref:`Scan <lib_scan>`.
+**Scan Example: Computing tanh(x(t).dot(W) + b) elementwise**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # defining the tensor variables
+  X = T.matrix("X")
+  W = T.matrix("W")
+  b_sym = T.vector("b_sym")
+  results, updates = theano.scan(lambda v:T.tanh(T.dot(v,W)+b_sym), sequences=X)
+  compute_elementwise = theano.function(inputs = [X, W, b_sym], outputs=[results])
+  # test values
+  x = np.eye(2)
+  w = np.ones((2,2))
+  b = np.ones((2))
+  b[1] = 2
+  print compute_elementwise(x, w, b)[0]
+  # comparison with numpy
+  print np.tanh(x.dot(w) + b)
+**Scan Example: Computing the sequence x(t) = tanh(x(t-1).dot(W) + y(t).dot(U) + p(T-t).dot(V))**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variables
+  X = T.vector("X")
+  W = T.matrix("W")
+  b_sym = T.vector("b_sym")
+  U = T.matrix("U")
+  Y = T.matrix("Y")
+  V = T.matrix("V")
+  P = T.matrix("P")
+  results, updates = theano.scan(lambda 
+            y,p,x_tm1:T.tanh(T.dot(x_tm1,W)+T.dot(y,U)+T.dot(p,V)), 
+            sequences=[Y,P[::-1]], outputs_info=[X])
+  compute_seq = theano.function(inputs = [X, W, Y, U, P, V], outputs=[results])
+  # test values
+  x = np.zeros((2))
+  x[1] = 1
+  w = np.ones((2,2))
+  y = np.ones((5,2))
+  y[0,:] = -3
+  u = np.ones((2,2))
+  p = np.ones((5,2))
+  p[0,:] = 3
+  v = np.ones((2,2))
+  print compute_seq(x,w,y,u,p,v)[0]
+  # comparison with numpy
+  x_res = np.zeros((5,2))
+  x_res[0] = np.tanh(x.dot(w) + y[0].dot(u) + p[4].dot(v))
+  for i in range(1,5):
+    x_res[i] = np.tanh(x_res[i-1].dot(w) + y[i].dot(u) + p[4-i].dot(v))
+**Scan Example: Computing norms of lines of X**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variable
+  X = T.matrix("X")
+  results, updates = theano.scan(lambda x_i:T.sqrt((x_i**2).sum()), sequences=[X])
+  compute_norm_lines = theano.function(inputs = [X], outputs=[results])
+  # test value
+  x = np.diag(np.arange(1,6),1)
+  print compute_norm_lines(x)[0]
+  # comparison with numpy
+  print np.sqrt((x**2).sum(1))
+**Scan Example: Computing norms of columns of X**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variable
+  X = T.matrix("X")
+  results, updates = theano.scan(lambda x_i:T.sqrt((x_i**2).sum()), sequences=[X.T])
+  compute_norm_cols = theano.function(inputs = [X], outputs=[results])
+  # test value
+  x = np.diag(np.arange(1,6),1)
+  print compute_norm_cols(x)[0]
+  # comparison with numpy
+  print np.sqrt((x**2).sum(0))
+**Scan Example: Computing trace of X**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  floatX = "float32"
+  # define tensor variable
+  X = T.matrix("X")
+  results, updates = theano.scan(lambda i, j, t_f:T.cast(X[i,j]+t_f, floatX), \
+                    sequences=[T.arange(X.shape[0]), T.arange(X.shape[1])], \
+                    outputs_info=np.asarray(0., dtype=floatX))
+  result = results[-1]
+  compute_trace = theano.function(inputs = [X], outputs=[result])
+  # test value
+  x = np.eye(5)
+  x[0] = np.arange(5)
+  compute_trace(x)[0]
+  # comparison with numpy
+  print np.diagonal(x).sum()
+**Scan Example: Computing the sequence x(t) = x(t-2).dot(U) + x(t-1).dot(V) +  tanh(x(t-1).dot(W)  + b)**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variables
+  X = T.matrix("X")
+  W = T.matrix("W")
+  b_sym = T.vector("b_sym")
+  U = T.matrix("U")
+  V = T.matrix("V")
+  n_sym = T.iscalar("n_sym")
+  results, updates = theano.scan(lambda x_tm2,x_tm1:T.dot(x_tm2,U) + T.dot(x_tm1,V) \
+                                + T.tanh(T.dot(x_tm1,W) + b_sym), \
+                      n_steps=n_sym, outputs_info=[dict(initial = X, taps = [-2,-1])])
+  compute_seq2 = theano.function(inputs = [X, U, V, W, b_sym, n_sym], outputs=[results])
+  # test values
+  x = np.zeros((2,2)) # the initial value must be able to return x[-2]
+  x[1,1] = 1
+  w = 0.5*np.ones((2,2))
+  u = 0.5*(np.ones((2,2))-np.eye(2))
+  v = 0.5*np.ones((2,2))
+  n = 10
+  b = np.ones((2))
+  print compute_seq2(x,u,v,w,b,n)
+  # comparison with numpy
+  x_res = numpy.zeros((10,2))
+  x_res[0] = x[0].dot(u) + x[1].dot(v) + numpy.tanh(x[1].dot(w) + b)
+  x_res[1] = x[1].dot(u) + x_res[0].dot(v) + numpy.tanh(x_res[0].dot(w) + b)
+  x_res[2] = x_res[0].dot(u) + x_res[1].dot(v) \
+            + numpy.tanh(x_res[1].dot(w) + b)
+  for i in range(2,10):
+    x_res[i] = (x_res[i-2].dot(u) + x_res[i-1].dot(v) \
+            + numpy.tanh(x_res[i-1].dot(w) + b))
+**Scan Example: Computing the Jacobian of y = tanh(v.dot(A)) wrt x**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variables
+  v = T.vector()
+  A = T.matrix()
+  y = T.tanh(T.dot(v,A))
+  results, updates = theano.scan(lambda i:T.grad(y[i], v), sequences = [T.arange(y.shape[0])])
+  compute_jac_t = theano.function([A,v], [results], allow_input_downcast = True) # shape (d_out, d_in)
+  # test values
+  x = np.eye(5)[0]
+  w = np.eye(5,3)
+  w[2] = np.ones((3))
+  print compute_jac_t(w,x)[0]
+  # compare with numpy
+  print ((1 - np.tanh(x.dot(w))**2)*w).T
+Note that we need to iterate over the indices of ``y`` and not over the elements of ``y``. The reason is that scan create a placeholder variable for its internal function and this placeholder variable does not have the same dependencies than the variables that will replace it. 
+**Scan Example: Accumulate number of loop during a scan**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define shared variables
+  k = theano.shared(0)
+  n_sym = T.iscalar("n_sym")
+  results, updates = theano.scan(lambda:{k:(k+1)}, n_steps=n_sym)
+  accumulator = theano.function([n_sym], [], updates=updates, allow_input_downcast = True)
+  k.get_value()
+  accumulator(5)
+  k.get_value()
+**Scan Example: Computing tanh(v.dot(W) + b)*d where b is binomial**
+.. code-block:: python
+  import theano
+  import theano.tensor as T
+  import numpy as np
+  # define tensor variables
+  X = T.matrix("X")
+  W = T.matrix("W")
+  b_sym = T.vector("b_sym")
+  # define shared random stream
+  trng = T.shared_randomstreams.RandomStreams(1234)
+  d=trng.binomial(size=W[1].shape)
+  results, updates = theano.scan(lambda v:T.tanh(T.dot(v,W)+b_sym)*d, sequences=X)
+  compute_with_bnoise = theano.function(inputs = [X, W, b_sym], outputs=[results], \
+                            updates=updates, allow_input_downcast = True)
+  x = np.eye(10,2)
+  w = np.ones((2,2))
+  b = np.ones((2))
+  print compute_with_bnoise(x, w, b)
+Note that if you want to use a random variable ``d`` that will not be updated through scan loops, you should pass this variable as a ``non_sequences`` arguments. 
 **Scan Example: Computing pow(A,k)**
 .. code-block:: python

--- a/theano/__init__.py
+++ b/theano/__init__.py
@@ -79,7 +79,7 @@ from theano.updates import Updates, OrderedUpdates
 #we don't import by default as we don't want to force having scipy installed.
 #import sparse
-from theano.gradient import Rop, Lop, grad
+from theano.gradient import Rop, Lop, grad, subgraph_grad
 if config.device.startswith('gpu') or config.init_gpu_device.startswith('gpu'):
    import theano.sandbox.cuda

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -1077,6 +1077,7 @@ class FunctionMaker(object):
        self.mode = mode
        self.accept_inplace = accept_inplace
        self.function_builder = function_builder
+        self.on_unused_input = on_unused_input  # Used only for the pickling
        self.required = [(i.value is None) for i in self.inputs]
        self.refeed = [
@@ -1215,6 +1216,7 @@ def _pickle_FunctionMaker(self):
                accept_inplace=self.accept_inplace,
                function_builder=self.function_builder,
                profile=self.profile,
+                on_unused_input=self.on_unused_input,
                )
    return (_constructor_FunctionMaker, (kwargs,))

--- a/theano/compile/profiling.py
+++ b/theano/compile/profiling.py
@@ -507,13 +507,22 @@ class ProfileStats(object):
        print >> file, header_str
-        atimes = [(
+        topos = {}  # Only do the topo once per fct.
+        atimes = []
+        for a, t in self.apply_time.items():
+            if a.fgraph not in topos:
+                topo = a.fgraph.toposort()
+                topos[a.fgraph] = topo
+            else:
+                topo = topos[a.fgraph]
+            atimes.append((
                t * 100 / local_time,
                t,
                a,
-                a.fgraph.toposort().index(a),
+                topo.index(a),
-                self.apply_callcount[a])
+                self.apply_callcount[a]))
-            for a, t in self.apply_time.items()]
+        del topos
        atimes.sort()
        atimes.reverse()
        tot = 0

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -117,19 +117,10 @@ AddConfigVar('mode',
 enum = EnumStr("g++", "")
 # Test whether or not g++ is present: disable C code if it is not.
-# Using the dummy file descriptor below is a workaround for a crash experienced
-# in an unusual Python 2.4.4 Windows environment with the default stdin=None.
-dummy_stdin = open(os.devnull)
 try:
-    try:
+    rc = call_subprocess_Popen(['g++', '-v'])
-        rc = call_subprocess_Popen(['g++', '-v'], stdout=subprocess.PIPE,
+except OSError:
-                                   stderr=subprocess.PIPE,
+    rc = 1
-                                   stdin=dummy_stdin).wait()
-    except OSError:
-        rc = 1
-finally:
-    dummy_stdin.close()
-    del dummy_stdin
 if rc == 0:
    # Keep the default linker the same as the one for the mode FAST_RUN
    AddConfigVar('linker',

--- a/theano/gof/__init__.py
+++ b/theano/gof/__init__.py
@@ -57,7 +57,10 @@ from theano.gof.link import \
 from theano.gof.op import \
    Op, OpenMPOp, PureOp, ops_with_inner_function
-from theano.gof.opt import (Optimizer, optimizer, SeqOptimizer,
+from theano.gof.opt import (
+    Optimizer,
+    optimizer, inplace_optimizer,
+    SeqOptimizer,
    MergeOptimizer, MergeOptMerge,
    LocalOptimizer, local_optimizer, LocalOptGroup,
    OpSub, OpRemove, PatternSub,

--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
@@ -29,7 +29,8 @@ from theano.compat.six import b, BytesIO, StringIO
 from theano.gof.utils import flatten
 from theano.configparser import config
 from theano.gof.cc import hash_from_code
-from theano.misc.windows import call_subprocess_Popen
+from theano.misc.windows import (subprocess_Popen, call_subprocess_Popen,
+                                 output_subprocess_Popen)
 # we will abuse the lockfile mechanism when reading and writing the registry
 from theano.gof import compilelock
@@ -1438,8 +1439,12 @@ def get_gcc_shared_library_arg():
 def std_include_dirs():
-    return (numpy.distutils.misc_util.get_numpy_include_dirs()
+    numpy_inc_dirs = numpy.distutils.misc_util.get_numpy_include_dirs()
-            + [distutils.sysconfig.get_python_inc()])
+    py_inc = distutils.sysconfig.get_python_inc()
+    py_plat_spec_inc = distutils.sysconfig.get_python_inc(plat_specific=True)
+    python_inc_dirs = ([py_inc] if py_inc == py_plat_spec_inc
+                       else [py_inc, py_plat_spec_inc])
+    return numpy_inc_dirs + python_inc_dirs
 def std_lib_dirs_and_libs():
@@ -1512,11 +1517,8 @@ def gcc_llvm():
        pass
        p = None
        try:
-            p = call_subprocess_Popen(['g++', '--version'],
+            p_out = output_subprocess_Popen(['g++', '--version'])
-                                      stdout=subprocess.PIPE,
+            output = p_out[0] + p_out[1]
-                                      stderr=subprocess.PIPE)
-            p.wait()
-            output = p.stdout.read() + p.stderr.read()
        except OSError:
            # Typically means g++ cannot be found.
            # So it is not an llvm compiler.
@@ -1569,11 +1571,11 @@ class GCC_compiler(object):
            GCC_compiler.march_flags = []
            def get_lines(cmd, parse=True):
-                p = call_subprocess_Popen(cmd,
+                p = subprocess_Popen(cmd,
-                                          stdout=subprocess.PIPE,
+                                     stdout=subprocess.PIPE,
-                                          stderr=subprocess.PIPE,
+                                     stderr=subprocess.PIPE,
-                                          stdin=subprocess.PIPE,
+                                     stdin=subprocess.PIPE,
-                                          shell=True)
+                                     shell=True)
                # For mingw64 with GCC >= 4.7, passing os.devnull
                # as stdin (which is the default) results in the process
                # waiting forever without returning. For that reason,
@@ -1713,7 +1715,7 @@ class GCC_compiler(object):
                                    continue
                                mj, mn, patch = [int(vp) for vp in version]
                                if (((mj, mn) == (4, 6) and patch < 4) or
-                                        ((mj, mn) == (4, 7) and patch < 3) or
+                                        ((mj, mn) == (4, 7) and patch <= 3) or
                                        ((mj, mn) == (4, 8) and patch < 1)):
                                    new_flags[i] = p.rstrip('-avx')
@@ -1811,21 +1813,15 @@ class GCC_compiler(object):
                os.write(fd, src_code)
                os.close(fd)
                fd = None
-                proc = call_subprocess_Popen(
+                p_ret = call_subprocess_Popen(
-                        ['g++', path, '-o', exe_path] + flags,
+                    ['g++', path, '-o', exe_path] + flags)
-                        stdout=subprocess.PIPE,
+                if p_ret != 0:
-                        stderr=subprocess.PIPE)
-                proc.wait()
-                if proc.returncode != 0:
                    compilation_ok = False
                elif try_run:
                    # Try to execute the program
                    try:
-                        proc = call_subprocess_Popen([exe_path],
+                        p_ret = call_subprocess_Popen([exe_path])
-                                stdout=subprocess.PIPE,
+                        run_ok = (p_ret == 0)
-                                stderr=subprocess.PIPE)
-                        proc.wait()
-                        run_ok = (proc.returncode == 0)
                    finally:
                        os.remove(exe_path)
            finally:
@@ -1958,14 +1954,14 @@ class GCC_compiler(object):
            print >> sys.stderr, ' '.join(cmd)
        try:
-            p = call_subprocess_Popen(cmd, stderr=subprocess.PIPE)
+            p_out = output_subprocess_Popen(cmd)
-            compile_stderr = decode(p.communicate()[1])
+            compile_stderr = decode(p_out[1])
        except Exception:
            # An exception can occur e.g. if `g++` is not found.
            print_command_line_error()
            raise
-        status = p.returncode
+        status = p_out[2]
        if status:
            print '==============================='

--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
@@ -16,27 +16,17 @@ import numpy
 import theano
 from theano.configparser import config, AddConfigVar, ConfigParam, StrParam
 from theano.gof.utils import flatten
-from theano.misc.windows import call_subprocess_Popen
+from theano.misc.windows import output_subprocess_Popen
 _logger = logging.getLogger("theano.gof.compiledir")
-# Using the dummy file descriptors below is a workaround for a crash
-# experienced in an unusual Python 2.4.4 Windows environment with the default
-# None values.
-dummy_err = open(os.devnull, 'w')
-p = None
 try:
-    p = call_subprocess_Popen(['g++', '-dumpversion'],
+    p_out = output_subprocess_Popen(['g++', '-dumpversion'])
-                              stdout=subprocess.PIPE,
+    gcc_version_str = p_out[0].strip().decode()
-                              stderr=dummy_err.fileno())
-    p.wait()
-    gcc_version_str = p.stdout.readline().strip().decode()
 except OSError:
    # Typically means gcc cannot be found.
    gcc_version_str = 'GCC_NOT_FOUND'
-del p
-del dummy_err
 def local_bitwidth():

--- a/theano/gof/compilelock.py
+++ b/theano/gof/compilelock.py
@@ -165,8 +165,12 @@ def lock(tmp_dir, timeout=120, min_wait=5, max_wait=10, verbosity=1):
    my_pid = os.getpid()
    no_display = (verbosity == 0)
-    # Acquire lock.
    nb_error = 0
+    # The number of time we sleep when their is no errors.
+    # Used to don't display it the first time to display it less frequently.
+    # And so don't get as much email about this!
+    nb_wait = 0
+    # Acquire lock.
    while True:
        try:
            last_owner = 'no_owner'
@@ -214,7 +218,7 @@ def lock(tmp_dir, timeout=120, min_wait=5, max_wait=10, verbosity=1):
                    last_owner = read_owner
                    time_start = time.time()
                    no_display = (verbosity == 0)
-                if not no_display:
+                if not no_display and nb_wait > 0:
                    if read_owner == 'failure':
                        msg = 'unknown process'
                    else:
@@ -225,6 +229,7 @@ def lock(tmp_dir, timeout=120, min_wait=5, max_wait=10, verbosity=1):
                                 tmp_dir)
                    if verbosity <= 1:
                        no_display = True
+                nb_wait += 1
                time.sleep(random.uniform(min_wait, max_wait))
            try:

--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
--- a/theano/gof/optdb.py
+++ b/theano/gof/optdb.py
@@ -179,23 +179,33 @@ class Query(object):
 class EquilibriumDB(DB):
-    """ A set of potential optimizations which should be applied in an
+    """A set of potential optimizations which should be applied in an
        arbitrary order until equilibrium is reached.
    Canonicalize, Stabilize, and Specialize are all equilibrium optimizations.
+    :param ignore_newtrees: If False, we will apply local opt on new
+        node introduced during local optimization application. This
+        could result in less fgraph iterations, but this don't mean it
+        will be faster globally.
    .. note::
        We can put LocalOptimizer and Optimizer as EquilibriumOptimizer
        suppor both.
    """
+    def __init__(self, ignore_newtrees=True):
+        super(EquilibriumDB, self).__init__()
+        self.ignore_newtrees = ignore_newtrees
    def query(self, *tags, **kwtags):
        opts = super(EquilibriumDB, self).query(*tags, **kwtags)
-        return opt.EquilibriumOptimizer(opts,
+        return opt.EquilibriumOptimizer(
-                max_use_ratio=config.optdb.max_use_ratio,
+            opts,
-                failure_callback=opt.NavigatorOptimizer.warn_inplace)
+            max_use_ratio=config.optdb.max_use_ratio,
+            ignore_newtrees=self.ignore_newtrees,
+            failure_callback=opt.NavigatorOptimizer.warn_inplace)
 class SequenceDB(DB):

--- a/theano/gradient.py
+++ b/theano/gradient.py
@@ -544,6 +544,109 @@ def grad(cost, wrt, consider_constant=None,
        rval, = rval
    return rval
+def subgraph_grad(wrt, end, start=None, cost=None, details=False):
+    '''
+    With respect to `wrt`, computes gradients of cost and/or from existing 
+    `start` gradients, up to the `end` variables of a symbolic digraph. 
+    In other words, computes gradients for a subgraph of the
+    symbolic theano function. Ignores all disconnected inputs.
+    This can be useful when one needs to perform the gradient descent 
+    iteratively (e.g. one layer at a time in an MLP), or when a particular 
+    operation is not differentiable in theano (e.g. stochastic sampling 
+    from a multinomial). In the latter case, the gradient of the 
+    non-differentiable process could be approximated by user-defined 
+    formula, which could be calculated using the gradients of a cost 
+    with respect to samples (0s and 1s). These gradients are obtained 
+    by performing a subgraph_grad from the `cost` or previously known gradients 
+    (`start`) up to the outputs of the stochastic process (`end`). 
+    A dictionary mapping gradients obtained from the user-defined 
+    differentiation of the process, to variables, could then be fed into 
+    another subgraph_grad as `start` with any other `cost` (e.g. weight decay).
+    :type wrt : List of Variables.
+        Gradients are computed with respect to `wrt`.
+    :type end : List of Variables.
+        Theano variables at which to end gradient descent
+        (they are considered constant in theano.grad). 
+        For convenience, the gradients with respect to these variables 
+        are also returned.
+    :type start : Dictionary of Variables
+    :param start: If not None, a dictionary mapping variables to 
+            their gradients. This is useful when the gradient on some 
+            variables are known. These are used to compute the gradients
+            backwards up to the variables in `end` 
+            (they are used as known_grad in theano.grad).
+    :type cost: Scalar (0-dimensional) Variable.
+    :param cost: 
+            Additional costs for which to compute the gradients.  
+            For example, these could be weight decay, an l1 constraint,
+            MSE, NLL, etc. May optionally be None if start is provided.
+            Warning : If the gradients of `cost` with respect to any 
+            of the `start` variables is already part of the `start` 
+            dictionary, then it may be counted twice with respect to `wrt` 
+            and `end`.
+    :type details: bool.
+    :param details: When True, additionally returns the 
+        list of gradients from `start` and of `cost`, respectively, 
+        with respect to `wrt` (not `end`).
+    :rtype: Tuple of 2 or 4 Lists of Variables
+    :return: Returns lists of gradients with respect to `wrt` and `end`, 
+            respectively.
+    '''
+    assert ((cost is not None) or (start is not None))
+    assert isinstance(end, list)
+    assert isinstance(wrt, list)
+    if start is not None:
+        assert isinstance(start, dict)
+    params = list(set(wrt + end))
+    start_grads = None
+    cost_grads = None
+    if start is not None:
+        start_grads = list(
+            theano.grad(
+                cost=None, wrt=params, known_grads=start, 
+                consider_constant=end, 
+                disconnected_inputs='ignore'
+            )
+        )
+    if cost is not None:
+        cost_grads = list(
+            theano.grad(
+                cost=cost, wrt=params,
+                consider_constant=end,
+                disconnected_inputs='ignore'
+            )
+        )
+    grads = None
+    if start is None:
+        grads = cost_grads
+    else:
+        grads = start_grads
+        if cost_grads is not None:
+            for i in range(len(grads)):
+                grads[i] += cost_grads[i]
+    pgrads = OrderedDict(zip(params, grads))
+    # separate wrt from end grads:
+    wrt_grads = list(pgrads[k] for k in wrt)
+    end_grads = list(pgrads[k] for k in end)
+    if details:
+        return wrt_grads, end_grads, start_grads, cost_grads
+    return wrt_grads, end_grads
 def _node_to_pattern(node):
    """ given an apply node, obtain its connection pattern

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -203,6 +203,7 @@ if __name__ == "__main__":
        cuda version      5.5    5.0    4.2    4.1    4.0    3.2    3.0   # note
        gpu
+        K6000/NOECC       0.06s
        K20m/ECC                 0.07s
        K20/NOECC                0.07s
        M2090             0.19s

--- a/theano/misc/windows.py
+++ b/theano/misc/windows.py
@@ -2,9 +2,11 @@ import os
 import subprocess
-def call_subprocess_Popen(command, **params):
+def subprocess_Popen(command, **params):
    """
-    Utility function to work around windows behavior that open windows
+    Utility function to work around windows behavior that open windows.
+    :see: call_subprocess_Popen and output_subprocess_Popen
    """
    startupinfo = None
    if os.name == 'nt':
@@ -36,3 +38,40 @@ def call_subprocess_Popen(command, **params):
        if stdin is not None:
            del stdin
    return proc
+def call_subprocess_Popen(command, **params):
+    """
+    Calls subprocess_Popen and discards the output, returning only the
+    exit code.
+    """
+    if 'stdout' in params or 'stderr' in params:
+        raise TypeError("don't use stderr or stdout with call_subprocess_Popen")
+    null = open(os.devnull, 'wb')
+    # stdin to devnull is a workaround for a crash in a weird Windows
+    # environement where sys.stdin was None
+    params.setdefault('stdin', null)
+    params['stdout'] = null
+    params['stderr'] = null
+    p = subprocess_Popen(command, **params)
+    p.wait()
+    return p.returncode
+def output_subprocess_Popen(command, **params):
+    """
+    Calls subprocess_Popen, returning the output, error and exit code
+    in a tuple.
+    """
+    if 'stdout' in params or 'stderr' in params:
+        raise TypeError("don't use stderr or stdout with output_subprocess_Popen")
+    # stdin to devnull is a workaround for a crash in a weird Windows
+    # environement where sys.stdin was None
+    if not hasattr(params, 'stdin'):
+        null = open(os.devnull, 'wb')
+        params['stdin'] = null
+    params['stdout'] = subprocess.PIPE
+    params['stderr'] = subprocess.PIPE
+    p = subprocess_Popen(command, **params)
+    # we need to use communicate to make sure we don't deadlock around
+    # the stdour/stderr pipe.
+    out = p.communicate()
+    return out + (p.returncode,)
--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -296,38 +296,15 @@ class GpuDimShuffle(GpuOp):
    def __init__(self, input_broadcastable, new_order):
        input_broadcastable = tuple(input_broadcastable)
        self.input_broadcastable = input_broadcastable
-        new_order = tuple(new_order)
        self.new_order = new_order
-        # list of dimensions of the input to drop
-        self.drop = []
-        # this maps i before dropping dimensions to j after dropping
-        # dimensions so self.shuffle can be set properly later on
-        i2j = {}
-        j = 0
        for i, b in enumerate(input_broadcastable):
            if i not in new_order:
-                # we want to drop this dimension because it's not a
+                if not b:
-                # value in new_order
-                if b == 1:  # 1 aka True
-                    self.drop.append(i)
-                else:
                    # we cannot drop non-broadcastable dimensions
                    raise ValueError("You cannot drop a non-broadcastable"
                                     " dimension.",
                                     (input_broadcastable, new_order))
-            else:
-                i2j[i] = j
-                j += 1
-        # transposition of non-broadcastable dimensions This is how
-        # the dimensions will be permuted, without accounting for the
-        # extra 'x' broadcastable dimensions to insert.
-        self.shuffle = [i2j[x] for x in new_order if x != 'x']
-        # list of dimensions of the output that are broadcastable and
-        # were not in the original input
-        self.augment = [i for i, x in enumerate(new_order) if x == 'x']
        self.view_map = {0: [0]}
@@ -481,8 +458,6 @@ class GpuDimShuffle(GpuOp):
            print self
            print "IN BROAD", self.input_broadcastable
            print "NEW ORDER", self.new_order
-            print "SHUFFLE", self.shuffle
-            print "AUGMENT", self.augment
            print '------------'
            print ''
            print sio.getvalue()
@@ -1198,7 +1173,11 @@ class GpuCAReduce(GpuOp):
                    n_threads.z += 1;
                else
                    break;
-            }""" % locals()
+            }
+            //Maximum for Fermi GPU on that dimensions.
+            n_threads.z = std::min(n_threads.z, (unsigned)64);
+        """ % locals()
        if len(self.reduce_mask) == 2:
            threads_y = ''
@@ -1509,6 +1488,8 @@ class GpuCAReduce(GpuOp):
                n_threads.z += 1;
            }
            n_threads.z -= 1;
+            //Maximum for Fermi GPU on that dimensions.
+            n_threads.z = std::min(n_threads.z, (unsigned)64);
            dim3 n_blocks(1,1,1);
            %(makecall)s
@@ -1605,7 +1586,7 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_cache_version_apply(self, node):
-        version = [8]  # the version corresponding to the c code in this Op
+        version = [9]  # the version corresponding to the c code in this Op
        # now we insert versions for the ops on which we depend...
        scalar_node = Apply(self.scalar_op,
@@ -3192,13 +3173,27 @@ class GpuAlloc(GpuOp):
                # If the output is a constant, it will have to be deepcopied
                # each time the function is called.  So we do not fold.
                return False
-            elif (not isinstance(client[0], basestring)
+            elif (#The following ops work inplace of their input id 0.
-                    and isinstance(client[0].op, (
+                  client[1] == 0 and
-                        tensor.IncSubtensor,
+                  isinstance(client[0].op, (
-                        tensor.AdvancedIncSubtensor1,
+                    #Ops that will work inplace on the Alloc. So if they
-                        GpuIncSubtensor,
+                    #get constant_folded, they would copy the
-                        GpuAdvancedIncSubtensor1
+                    #constant and this is less efficients.
-                        ))):
+                    #Not doing the constant folding could also lower
+                    #the peak memory usage, as we the "constant" won't
+                    #always exists.
+                      #theano.tensor.subtensor.AdvancedIncSubtensor,
+                      GpuIncSubtensor,
+                      GpuAdvancedIncSubtensor1,
+                      theano.sandbox.cuda.blas.GpuGemm,
+                      theano.sandbox.cuda.blas.GpuGemv,
+                      theano.sandbox.cuda.blas.GpuGer,
+                  ))):
+                return False
+            #If the clients is a transfer, we don't want to fold. We
+            #let the moving opt finish before deciding what to do.
+            elif isinstance(client[0].op, HostFromGpu):
                return False
        return True

--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -5093,7 +5093,7 @@ int fprint_CudaNdarray(FILE * fd, const CudaNdarray *self)
 int CudaNdarray_prep_output(CudaNdarray ** arr, int nd,
-        const int * dims)
+                            const int * dims, int fortran)
 {
    bool allocated = false;
    if (*arr == NULL)
@@ -5105,7 +5105,7 @@ int CudaNdarray_prep_output(CudaNdarray ** arr, int nd,
        allocated = true;
    }
-    if (CudaNdarray_alloc_contiguous(*arr, nd, dims))
+    if (CudaNdarray_alloc_contiguous(*arr, nd, dims, fortran))
    {
        if (allocated)
        {

--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
@@ -160,6 +160,12 @@ CudaNdarray_CheckExact(const PyObject * ob);
 DllExport bool
 CudaNdarray_is_c_contiguous(const CudaNdarray * self);
+/**
+ * Return true for a F-contiguous CudaNdarray, else false
+ */
+DllExport bool
+CudaNdarray_is_f_contiguous(const CudaNdarray * self);
 /****
 * Returns the number of elements necessary in host_structure and dev_structure for a given number of dimensions.
 */
@@ -326,10 +332,13 @@ CudaNdarray_set_nd(CudaNdarray * self, const int nd)
 * Allocate storage space for a tensor of rank 'nd' and given dimensions.
 * (No-op if self already has a contiguous tensor of the right dimensions)
 *
+ * If fortran is non-zeros, a fortran order is made, otherwise it is a c order.
+ *
 * Note: CudaNdarray_alloc_contiguous is templated to work for both int dimensions and npy_intp dimensions
 */
 template<typename inttype>
-static int CudaNdarray_alloc_contiguous(CudaNdarray *self, const int nd, const inttype * dim)
+static int CudaNdarray_alloc_contiguous(CudaNdarray *self, const int nd,
+                                        const inttype * dim, int fortran=0)
 {
    // allocate an empty ndarray with c_contiguous access
    // return 0 on success
@@ -342,11 +351,23 @@ static int CudaNdarray_alloc_contiguous(CudaNdarray *self, const int nd, const i
    {
        return -1;
    }
-    for (int i = nd-1; i >= 0; --i)
+    if (fortran)
+    {
+        for (int i = 0; i < nd; i++)
+        {
+            CudaNdarray_set_stride(self, i, (dim[i] == 1) ? 0 : size);
+            CudaNdarray_set_dim(self, i, dim[i]);
+            size = size * dim[i];
+        }
+    }
+    else
    {
-        CudaNdarray_set_stride(self, i, (dim[i] == 1) ? 0 : size);
+        for (int i = nd-1; i >= 0; --i)
-        CudaNdarray_set_dim(self, i, dim[i]);
+        {
-        size = size * dim[i];
+            CudaNdarray_set_stride(self, i, (dim[i] == 1) ? 0 : size);
+            CudaNdarray_set_dim(self, i, dim[i]);
+            size = size * dim[i];
+        }
    }
    // If the allocated buffer is already of the right size, we don't need to
@@ -497,6 +518,27 @@ CudaNdarray_is_c_contiguous(const CudaNdarray * self)
    return c_contiguous;
 }
+/**
+ * True iff the strides look like [1, dim[0], dim[0]*dim[1], ...]
+ */
+DllExport inline bool ALWAYS_INLINE
+CudaNdarray_is_f_contiguous(const CudaNdarray * self)
+{
+    bool f_contiguous = true;
+    int size = 1;
+    for (int i = 0; (i < self->nd) && f_contiguous; i++)
+    {
+        if (CudaNdarray_HOST_DIMS(self)[i] == 1)
+            continue;
+        if (CudaNdarray_HOST_STRIDES(self)[i] != size)
+        {
+            f_contiguous = false;
+        }
+        size = size * CudaNdarray_HOST_DIMS(self)[i];
+    }
+    return f_contiguous;
+}
 DllExport PyObject * CudaNdarray_IS_C_Contiguous(CudaNdarray * self);
 DllExport int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B, float beta, CudaNdarray * C);
@@ -525,8 +567,9 @@ DllExport int CudaNdarray_inplace_elemwise(PyObject* py_self, PyObject * py_othe
 // *arr may initially be NULL, a pointer to an ndarray of the wrong size,
 // or a pointer to an ndarray of the right size. In the last case it will
 // not change.
+// If fortran is non-zero, a fortran order is expected/created
 DllExport int CudaNdarray_prep_output(CudaNdarray ** arr, int nd,
-        const int * dims);
+                                      const int * dims, int fortran = 0);
 DllExport inline const char* ALWAYS_INLINE cublasGetErrorString(cublasStatus err){
    if(CUBLAS_STATUS_SUCCESS == err)

--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -16,7 +16,7 @@ from theano.gof.cmodule import (std_libs, std_lib_dirs,
                                std_include_dirs, dlimport,
                                get_lib_extension)
 from theano.gof.python25 import any
-from theano.misc.windows import call_subprocess_Popen
+from theano.misc.windows import output_subprocess_Popen
 _logger = logging.getLogger("theano.sandbox.cuda.nvcc_compiler")
 _logger.setLevel(logging.WARN)
@@ -98,12 +98,8 @@ nvcc_version = None
 def is_nvcc_available():
    """Return True iff the nvcc compiler is found."""
    def set_version():
-        p = call_subprocess_Popen([nvcc_path, '--version'],
+        p_out = output_subprocess_Popen([nvcc_path, '--version'])
-                                  stdout=subprocess.PIPE,
+        ver_line = decode(p_out[0]).strip().split('\n')[-1]
-                                  stderr=subprocess.PIPE)
-        p.wait()
-        ver_line = decode(p.stdout.readlines()[-1])
        build, version = ver_line.split(',')[1].strip().split()
        assert build == 'release'

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -109,11 +109,13 @@ def test_careduce():
                               ((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011
                               #((4100,4,3),[0,2]),((5,4100,3),[0,2]),((5,4,4100),[0,2]),#101 ##not implemented
                               ((4100,4,3),[0,1,2]),((5,4100,3),[0,1,2]),((5,4,4100),[0,1,2]),#111
+                               ((65,4,3),[0,1,2]),((5,65,3),[0,1,2]),((5,4,65),[0,1,2]),#111
                               ((4100,4,3,2),[2,3]),((4,4100,3,2),[2,3]),((4,3,4100,2),[2,3]),((4,3,2,4100),[2,3]),#0011
                               ((4100,4,3,2),[1,3]),((4,4100,3,2),[1,3]),((4,3,4100,2),[1,3]),((4,3,2,4100),[1,3]),#0101
                               ((4100,4,3,2),[0,2,3]),((4,4100,3,2),[0,2,3]),((4,3,4100,2),[0,2,3]),#((4,3,2,4100),[0,2,3]),#1011
                               ((4100,4,3,2),[1,2,3]),((4,4100,3,2),[1,2,3]),((4,3,4100,2),[1,2,3]),((4,3,2,4100),[1,2,3]),#0111
+                               ((65,4,3,2),[1,2,3]),((4,65,3,2),[1,2,3]),((4,3,65,2),[1,2,3]),((4,3,2,65),[1,2,3]),#0111
                               ((4100,2,3,4),[0,1,2,3]),((2,4100,3,4),[0,1,2,3]),((2,3,4100,4),[0,1,2,3]),((2,3,4,4100),[0,1,2,3]),((128,1,3,3), [0,1,2,3]),#1111

--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
+import operator
 import sys
 import numpy
@@ -213,20 +214,29 @@ def test_huge_elemwise_fusion():
    """
    shape = (2, 3, 4, 5, 6)
    ttype = tensor.tensor(dtype='float32', broadcastable=(False,) * len(shape))
-    vars = [tensor.tanh(ttype) for x in range(7)]
+    gpu_ptr_size = theano.sandbox.cuda.opt.get_device_type_sizes()['gpu_ptr_size']
-    f = pfunc(vars, [vars[0] - vars[1] - vars[2] - vars[3] - vars[4] -
+    if gpu_ptr_size == 8:
-                     vars[5] - vars[6]], mode=mode_with_gpu)
+        nb_in = 7
+        len_topo = 10
+    elif gpu_ptr_size == 4:
+        nb_in = 8
+        len_topo = 11
+    else:
+        raise Exception("Unexpected value for gpu_ptr_size", gpu_ptr_size)
+    vars = [tensor.tanh(ttype) for x in range(nb_in)]
+    f = pfunc(vars, [reduce(operator.sub, vars)], mode=mode_with_gpu)
    topo = f.maker.fgraph.toposort()
    #theano.printing.debugprint(f)
    #for i, node in enumerate(topo):
    #    print >> sys.stdout, i, node
-    assert len(topo) == 10
+    assert len(topo) == len_topo
    assert sum([isinstance(node.op, cuda.GpuElemwise) for node in topo]) == 2
-    assert isinstance(topo[7].op.scalar_op, theano.scalar.basic.Sub)
+    assert isinstance(topo[-3].op.scalar_op, theano.scalar.basic.Sub)
-    assert isinstance(topo[8].op.scalar_op, theano.scalar.basic.Composite)
+    assert isinstance(topo[-2].op.scalar_op, theano.scalar.basic.Composite)
    #let debugmode catch errors
    gen = lambda: theano._asarray(numpy.random.rand(*shape), dtype='float32')
-    f(gen(), gen(), gen(), gen(), gen(), gen(), gen())
+    f(*[gen() for i in range(nb_in)])
    # Test the case where we can't put the computation on the gpu! their is too
    # many dimensions to the input to have 2 inputs to the op!

--- a/theano/sandbox/gpuarray/basic_ops.py
+++ b/theano/sandbox/gpuarray/basic_ops.py
@@ -3,12 +3,12 @@ import os
 import numpy
 import theano
-from theano import Op, Type, Apply, Variable, Constant
+from theano import Op, Apply
 from theano import tensor, scalar, config
 from theano.scalar import Scalar
 from theano.tensor.basic import Alloc
-from theano.gof.python25 import all, any
+from theano.gof.python25 import any
 from theano.gof.utils import MethodNotDefined
 from theano.compat import PY3
@@ -257,7 +257,7 @@ class GpuFromHost(Op):
    def R_op(self, inputs, eval_points):
        ev, = eval_points
-        if isintance(ev, GpuArrayType):
+        if isinstance(ev, GpuArrayType):
            return [host_from_gpu(ev)]
        else:
            return ev
@@ -317,7 +317,7 @@ class GpuFromCuda(Op):
    def R_op(self, inputs, eval_points):
        ev, = eval_points
-        if isintance(ev, GpuArrayType):
+        if isinstance(ev, GpuArrayType):
            return [cuda_from_gpu(ev)]
        else:
            return ev
@@ -651,6 +651,36 @@ class GpuAlloc(HideC, Alloc):
    def c_code_cache_version(self):
        return (2,)
+    def do_constant_folding(self, node):
+        for client in node.outputs[0].clients:
+            if client[0] == 'output':
+                # If the output is a constant, it will have to be deepcopied
+                # each time the function is called.  So we do not fold.
+                return False
+            elif (#The following ops work inplace of their input id 0.
+                  client[1] == 0 and
+                  isinstance(client[0].op, (
+                    #Ops that will work inplace on the Alloc. So if they
+                    #get constant_folded, they would copy the
+                    #constant and this is less efficients.
+                    #Not doing the constant folding could also lower
+                    #the peak memory usage, as we the "constant" won't
+                    #always exists.
+                      #theano.tensor.subtensor.AdvancedIncSubtensor,
+                      theano.sandbox.gpuarray.subtensor.GpuIncSubtensor,
+                      #theano.sandbox.gpuarray.subtensor.GpuAdvancedIncSubtensor1,
+                      theano.sandbox.gpuarray.blas.GpuGemm,
+                      theano.sandbox.gpuarray.blas.GpuGemv,
+                      #theano.sandbox.gpuarray.blas.GpuGer, Not Yet implemented
+                  ))):
+                return False
+            #If the clients is a transfer, we don't want to fold. We
+            #let the moving opt finish before deciding what to do.
+            elif isinstance(client[0].op, HostFromGpu):
+                return False
+        return True
 gpu_alloc = GpuAlloc()

--- a/theano/sandbox/gpuarray/blas.py
+++ b/theano/sandbox/gpuarray/blas.py
@@ -200,13 +200,13 @@ from theano.gof import local_optimizer, LocalOptGroup
 from theano.tensor.opt import in2out
-@local_optimizer([gpugemv_no_inplace])
+@local_optimizer([gpugemv_no_inplace], inplace=True)
 def local_inplace_gpuagemv(node):
    if node.op == gpugemv_no_inplace:
        return [gpugemv_inplace(*node.inputs)]
-@local_optimizer([gpugemm_no_inplace])
+@local_optimizer([gpugemm_no_inplace], inplace=True)
 def local_inplace_gpuagemm(node):
    if node.op == gpugemm_no_inplace:
        return [gpugemm_inplace(*node.inputs)]

--- a/theano/sandbox/gpuarray/elemwise.py
+++ b/theano/sandbox/gpuarray/elemwise.py
@@ -1281,7 +1281,10 @@ class GpuCAReduceCuda(HideC, CAReduce):
                    n_threads.z += 1;
                else
                    break;
-            }""" % locals()
+            }
+            //Maximum for Fermi GPU on that dimensions.
+            n_threads.z = std::min(n_threads.z, (unsigned)64);
+        """ % locals()
        if len(self.reduce_mask) == 2:
            threads_y = ''
@@ -1601,6 +1604,8 @@ class GpuCAReduceCuda(HideC, CAReduce):
                n_threads.z += 1;
            }
            n_threads.z -= 1;
+            //Maximum for Fermi GPU on that dimensions.
+            n_threads.z = std::min(n_threads.z, (unsigned)64);
            dim3 n_blocks(1,1,1);
            %(makecall)s
@@ -1697,7 +1702,7 @@ class GpuCAReduceCuda(HideC, CAReduce):
        """ % locals()
    def c_code_cache_version_apply(self, node):
-        version = [8]  # the version corresponding to the c code in this Op
+        version = [9]  # the version corresponding to the c code in this Op
        # now we insert versions for the ops on which we depend...
        scalar_node = Apply(self.scalar_op,

--- a/theano/sandbox/gpuarray/opt.py
+++ b/theano/sandbox/gpuarray/opt.py
@@ -341,17 +341,20 @@ def local_gpua_crossentropysoftmaxargmax1hotwithbias(node):
 @op_lifter([tensor.nnet.CrossentropySoftmax1HotWithBiasDx])
 def local_gpua_crossentropysoftmax1hotwithbiasdx(node):
    return GpuCrossentropySoftmax1HotWithBiasDx()
 @register_opt()
 @op_lifter([tensor.nnet.Softmax])
 def local_gpua_softmax(node):
    return GpuSoftmax()
 @register_opt()
 @op_lifter([tensor.nnet.SoftmaxWithBias])
 def local_gpua_softmaxwithbias(node):
    return GpuSoftmaxWithBias()
 @register_opt()
 @op_lifter([gpu_from_host, ConvOp])
 def local_gpu_conv(node):

--- a/theano/sandbox/gpuarray/tests/test_basic_ops.py
+++ b/theano/sandbox/gpuarray/tests/test_basic_ops.py
@@ -32,11 +32,13 @@ if not theano.sandbox.gpuarray.pygpu_activated:
 from theano.sandbox.gpuarray.type import (GpuArrayType,
                                          gpuarray_shared_constructor)
-from theano.sandbox.gpuarray.basic_ops import (host_from_gpu, gpu_from_host,
+from theano.sandbox.gpuarray.basic_ops import (
-                                               gpu_alloc, gpu_from_cuda,
+    host_from_gpu, gpu_from_host,
-                                               cuda_from_gpu, HostFromGpu,
+    gpu_alloc, GpuAlloc,
-                                               GpuFromHost, GpuReshape,
+    gpu_from_cuda,
-                                               GpuEye)
+    cuda_from_gpu, HostFromGpu,
+    GpuFromHost, GpuReshape,
+    GpuEye)
 from theano.tests import unittest_tools as utt
 utt.seed_rng()
@@ -290,6 +292,13 @@ GpuAllocTester = makeTester(
 )
+class TestAlloc(theano.tensor.tests.test_basic.TestAlloc):
+    dtype = "float32"
+    mode = mode_with_gpu
+    shared = staticmethod(gpuarray_shared_constructor)
+    allocs = [GpuAlloc, GpuAlloc, T.Alloc]
 def test_shape():
    x = GpuArrayType(dtype='float32', broadcastable=[False, False, False])()
    v = gpuarray.zeros((3, 4, 5), dtype='float32')

--- a/theano/sandbox/gpuarray/tests/test_elemwise.py
+++ b/theano/sandbox/gpuarray/tests/test_elemwise.py
-import unittest
 from theano import scalar, gof
-from theano.gof import FunctionGraph
 from theano.gof.python25 import all, any
-from theano.tests.unittest_tools import SkipTest
 from theano.tensor.tests.test_elemwise import (test_Broadcast, test_DimShuffle,
                                               test_CAReduce)
@@ -126,11 +122,13 @@ class test_GpuCAReduceCuda(test_GpuCAReduceCPY):
             ((4100,4,3),[1,2]),((5,4100,3),[1,2]),((5,4,4100),[1,2]),#011
             #((4100,4,3),[0,2]),((5,4100,3),[0,2]),((5,4,4100),[0,2]),#101 ##not implemented
             ((4100,4,3),[0,1,2]),((5,4100,3),[0,1,2]),((5,4,4100),[0,1,2]),#111
+             ((65,4,3),[0,1,2]),((5,65,3),[0,1,2]),((5,4,65),[0,1,2]),#111
             ((4100,4,3,2),[2,3]),((4,4100,3,2),[2,3]),((4,3,4100,2),[2,3]),((4,3,2,4100),[2,3]),#0011
             ((4100,4,3,2),[1,3]),((4,4100,3,2),[1,3]),((4,3,4100,2),[1,3]),((4,3,2,4100),[1,3]),#0101
             ((4100,4,3,2),[0,2,3]),((4,4100,3,2),[0,2,3]),((4,3,4100,2),[0,2,3]),#((4,3,2,4100),[0,2,3]),#1011
             ((4100,4,3,2),[1,2,3]),((4,4100,3,2),[1,2,3]),((4,3,4100,2),[1,2,3]),((4,3,2,4100),[1,2,3]),#0111
+             ((65,4,3,2),[1,2,3]),((4,65,3,2),[1,2,3]),((4,3,65,2),[1,2,3]),((4,3,2,65),[1,2,3]),#0111
             ((4100,2,3,4),[0,1,2,3]),((2,4100,3,4),[0,1,2,3]),((2,3,4100,4),[0,1,2,3]),((2,3,4,4100),[0,1,2,3]),((128,1,3,3), [0,1,2,3]),#1111
             #test pattern implemented by reshape

--- a/theano/sandbox/gpuarray/tests/test_subtensor.py
+++ b/theano/sandbox/gpuarray/tests/test_subtensor.py
@@ -26,4 +26,6 @@ class G_subtensor(T_subtensor):
                             dtype='float32',
                             ignore_topo=(HostFromGpu, GpuFromHost,
                                          DeepCopyOp))
+        # GPU opt can't run in fast_compile only.
+        self.fast_compile = False
        assert self.sub == GpuSubtensor
--- a/theano/sandbox/rng_mrg.py
+++ b/theano/sandbox/rng_mrg.py
@@ -26,8 +26,10 @@ if cuda_available:
    from theano.sandbox.cuda import (CudaNdarrayType,
                                     float32_shared_constructor)
 def matVecModM(A, s, m):
-    return numpy.int32(numpy.sum((numpy.int64(A)*s) % m, 1) % m)
+    assert A.dtype == 'int64'
+    return numpy.int32(numpy.sum((A*s) % m, 1) % m)
 def multMatVect(v, A, m1, B, m2):
@@ -142,24 +144,30 @@ MASK2 = numpy.int32(65535)                     #2^16 - 1
 MULT2 = numpy.int32(21069)
 NORM = 4.656612873077392578125e-10;            #1./2^31
-A1p0 = numpy.asarray([[0, 4194304, 129], [1, 0, 0], [0, 1, 0]])
+#A1p0 = numpy.asarray([[0, 4194304, 129], [1, 0, 0], [0, 1, 0]],
-A2p0 = numpy.asarray([[32768, 0, 32769], [1, 0, 0], [0, 1, 0]])
+#                      dtype='int64')
+#A2p0 = numpy.asarray([[32768, 0, 32769], [1, 0, 0], [0, 1, 0]],
+#                      dtype='int64')
 A1p72 = numpy.asarray([[1516919229, 758510237, 499121365],
                       [1884998244, 1516919229, 335398200],
-                       [601897748, 1884998244, 358115744]])
+                       [601897748, 1884998244, 358115744]],
+                      dtype='int64')
 A2p72 = numpy.asarray([[1228857673, 1496414766, 954677935],
                       [1133297478, 1407477216, 1496414766],
-                       [2002613992, 1639496704, 1407477216]])
+                       [2002613992, 1639496704, 1407477216]],
+                      dtype='int64')
 A1p134 = numpy.asarray(
    [[1702500920, 1849582496, 1656874625],
     [828554832, 1702500920, 1512419905],
-     [1143731069, 828554832, 102237247]])
+     [1143731069, 828554832, 102237247]],
+    dtype='int64')
 A2p134 = numpy.asarray(
    [[796789021, 1464208080, 607337906],
     [1241679051, 1431130166, 1464208080],
-     [1401213391, 1178684362, 1431130166]])
+     [1401213391, 1178684362, 1431130166]],
+    dtype='int64')
 np_int32_vals = [numpy.int32(i) for i in (0, 7, 9, 15, 16, 22, 24)]

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -909,7 +909,22 @@ class UnaryScalarOp(ScalarOp):
            node.inputs[0].type != node.outputs[0].type):
            raise theano.gof.utils.MethodNotDefined()
-        dtype = node.inputs[0].dtype
+        dtype = node.inputs[0].type.dtype_specs()[1]
+        fct_call = self.c_code_contiguous_raw(dtype, 'n', 'x', 'z')
+        return """
+{
+        npy_intp n = PyArray_SIZE(%(z)s);
+        %(dtype)s * x = (%(dtype)s*) PyArray_DATA(%(x)s);
+        %(dtype)s * z = (%(dtype)s*) PyArray_DATA(%(z)s);
+        %(fct_call)s;
+}
+        """ % locals()
+    def c_code_contiguous_raw(self, dtype, n, i, o):
+        if not config.lib.amdlibm:
+            raise theano.gof.utils.MethodNotDefined()
+        if dtype.startswith('npy_'):
+            dtype = dtype[4:]
        if dtype == 'float32' and self.amd_float32 is not None:
            dtype = 'float'
            fct = self.amd_float32
@@ -918,12 +933,7 @@ class UnaryScalarOp(ScalarOp):
            fct = self.amd_float64
        else:
            raise theano.gof.utils.MethodNotDefined()
-        return """
+        return "%(fct)s(%(n)s, %(i)s, %(o)s)" % locals()
-        npy_intp n = PyArray_SIZE(%(z)s);
-        %(dtype)s * x = (%(dtype)s*) PyArray_DATA(%(x)s);
-        %(dtype)s * z = (%(dtype)s*) PyArray_DATA(%(z)s);
-        %(fct)s(n, x, z);
-        """ % locals()
 class BinaryScalarOp(ScalarOp):
@@ -2964,7 +2974,40 @@ class Composite(ScalarOp):
        # We need to clone the graph as sometimes its nodes already
        # contain a reference to an fgraph. As we want the Composite
        # to be pickable, we can't have reference to fgraph.
-        inputs, outputs = gof.graph.clone(inputs, outputs)
+        # Also, if there is Composite in the inner graph, we want to
+        # remove them. In that case, we do a more complicated clone
+        # that will flatten Composite. We don't need to do this
+        # recusively, as the way the fusion optimizer work, we have
+        # only 1 new Composite each time at the output.
+        if len(outputs) > 1 or not any([isinstance(var.owner.op, Composite)
+                                        for var in outputs]):
+            # No inner Composite
+            inputs, outputs = gof.graph.clone(inputs, outputs)
+        else:
+            # Inner Composite that we need to flatten
+            assert len(outputs) == 1
+            # 1. Create a new graph from inputs up to the
+            # Composite
+            res = theano.compile.rebuild_collect_shared(
+                inputs=inputs,
+                outputs=outputs[0].owner.inputs,
+                copy_inputs_over=False) #  Clone also the inputs
+            # 2. We continue this partial clone with the graph in
+            # the inner Composite
+            res2 = theano.compile.rebuild_collect_shared(
+                inputs=outputs[0].owner.op.inputs,
+                outputs=outputs[0].owner.op.outputs,
+                replace=dict(zip(outputs[0].owner.op.inputs, res[1]))
+            )
+            assert len(res2[1]) == len(outputs)
+            assert len(res[0]) == len(inputs)
+            assert res[0] != inputs
+            inputs, outputs = res[0], res2[1]
+            # Next assert comment just for speed
+            #assert not any([isinstance(node.op, Composite) for node in
+            #                theano.gof.graph.ops(inputs, outputs)])
        self.inputs = copy(inputs)
        self.outputs = copy(outputs)
        self.inputs_type = tuple([input.type for input in inputs])

--- a/theano/scalar/tests/test_basic.py
+++ b/theano/scalar/tests/test_basic.py
@@ -68,19 +68,17 @@ class test_composite(unittest.TestCase):
        fn = gof.DualLinker().accept(g).make_function()
        assert fn(1.0, 2.0) == 1.5
-#    def test_sin(self):
+    def test_flatten(self):
-#        x = inputs()
+        #Test that we flatten multiple Composite.
-#        e = sin(x)
+        x, y, z = inputs()
-#        C = Composite([x], [e])
+        C = Composite([x, y], [x + y])
-#        c = C.make_node(x)
+        CC = Composite([x, y], [C(x * y, y)])
-#        # print c.c_code(['x'], ['z'], dict(id = 0))
+        assert not isinstance(CC.outputs[0].owner.op, Composite)
-#        g = FunctionGraph([x], [c.out])
-#        fn = gof.DualLinker().accept(g).make_function()
+        # Test with multiple outputs
-#        assert fn(0) == 0
+        CC = Composite([x, y, z], [C(x * y, y), C(x * z, y)])
-#        assert fn(3.14159265358/2) == 1
+        #We don't flatten that case.
-#        assert fn(3.14159265358) == 0
+        assert isinstance(CC.outputs[0].owner.op, Composite)
-    # WRITEME: Test for sin, pow, and other scalar ops.
    def test_with_constants(self):
        x, y, z = inputs()

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
@@ -173,12 +173,9 @@ SOMEPATH/Canopy_64bit/User/lib/python2.7/site-packages/numpy/distutils/system_in
  warnings.warn('Specified path %s is invalid.' % d)
 """
            #I'm not able to remove all printed stuff
-            with_context = warnings.catch_warnings(record=True)
+            with warnings.catch_warnings(record=True):
-            with_context.__enter__()
+                numpy.distutils.system_info.system_info.verbosity = 0
-            try:
                blas_info = numpy.distutils.system_info.get_info("blas_opt")
-            finally:
-                with_context.__exit__(None, None, None)
        # If we are in a EPD installation, mkl is available
        if "EPD" in sys.version:
@@ -1193,32 +1190,31 @@ def _beta_L_plus_alpha_M(beta, L, alpha, M, recurse_flip=True):
    # it also might be the case that there is a dimshuffle between the +
    # and the dot22. local_dot_to_dot22 in particular will put in such things.
-    if M.owner and isinstance(M.owner.op, T.DimShuffle):
+    if (M.owner and isinstance(M.owner.op, T.DimShuffle) and
+        M.owner.inputs[0].owner and
+        isinstance(M.owner.inputs[0].owner.op, Dot22)):
        MM = M.owner.inputs[0]
-        if tuple(M.owner.op.new_order) == (0,):
+        if M.owner.op.new_order == (0,):
            # it is making a column MM into a vector
-            if MM.owner and MM.owner.op == _dot22:
+            MMl, MMr = MM.owner.inputs
-                MMl, MMr = MM.owner.inputs
+            g = gemm_no_inplace(L.dimshuffle(0, 'x'),
-                g = gemm_no_inplace(L.dimshuffle(0, 'x'),
+                                alpha, MMl, MMr, beta)
-                        alpha, MMl, MMr, beta)
+            rval = [g.dimshuffle(0)]
-                rval = [g.dimshuffle(0)]
+            return rval, MM
-                return rval, MM
+        if M.owner.op.new_order == (1,):
-        if tuple(M.owner.op.new_order) == (1,):
            # it is making a row MM into a vector
-            if MM.owner and MM.owner.op == _dot22:
+            MMl, MMr = MM.owner.inputs
-                MMl, MMr = MM.owner.inputs
+            g = gemm_no_inplace(L.dimshuffle('x', 0),
-                g = gemm_no_inplace(L.dimshuffle('x', 0),
+                                alpha, MMl, MMr, beta)
-                        alpha, MMl, MMr, beta)
+            rval = [g.dimshuffle(1)]
-                rval = [g.dimshuffle(1)]
+            return rval, MM
-                return rval, MM
+        if len(M.owner.op.new_order) == 0:
-        if tuple(M.owner.op.new_order) == ():
            # it is making a row MM into a vector
-            if MM.owner and MM.owner.op == _dot22:
+            MMl, MMr = MM.owner.inputs
-                MMl, MMr = MM.owner.inputs
+            g = gemm_no_inplace(L.dimshuffle('x', 'x'),
-                g = gemm_no_inplace(L.dimshuffle('x', 'x'),
+                                alpha, MMl, MMr, beta)
-                        alpha, MMl, MMr, beta)
+            rval = [g.dimshuffle()]
-                rval = [g.dimshuffle()]
+            return rval, MM
-                return rval, MM
    # this is False'd out because of inadequate testing.
    # TODO see ticket #237
@@ -1382,29 +1378,31 @@ def _gemm_from_factored_list(lst):
    """Returns None, or a list to replace node.outputs
    """
-    # Make every pair in list have matching dtypes
-    # sM can be a tuple of 2 elements or a theano variable.
-    # We should not use __len__ as theano variables don't support
-    # it. I don't want to change this to isinstance(sM, tuple)
-    # as I'm not able to make a test that triggers this case.
-    def is_pair(sM):
-        try:
-            s, M = sM
-            return True
-        except Exception:
-            return False
    lst2 = []
    # Remove the tuple that can't be cast correctly.
    # This can happen when we try to cast a complex to a real
    for sM in lst:
-        if is_pair(sM):
+        # Make every pair in list have matching dtypes
+        # sM can be a tuple of 2 elements or a theano variable.
+        if isinstance(sM, tuple):
            sm0, sm1 = sM
            sm0 = T.as_tensor_variable(sm0)
            if theano.scalar.upcast(sm0.dtype, sm1.dtype) == sm1.dtype:
                lst2.append((T.cast(sm0, sm1.dtype), sM[1]))
    lst = lst2
+    def item_to_var(t):
+        try:
+            s, M = t
+        except Exception:
+            return t
+        if s == 1:
+            return M
+        if s == -1:
+            return -M
+        return s * M
    # Try every pair in the sM_list, trying to turn it into a gemm operation
    for i in xrange(len(lst) - 1):
        s_i, M_i = lst[i]
@@ -1421,16 +1419,6 @@ def _gemm_from_factored_list(lst):
                                                              s_j, M_j)
            #print 'GOT IT', gemm_of_sM_list
            if gemm_of_sM_list:
-                def item_to_var(t):
-                    try:
-                        s, M = t
-                    except Exception:
-                        return t
-                    if s == 1:
-                        return M
-                    if s == -1:
-                        return -M
-                    return s * M
                assert len(gemm_of_sM_list) == 1
                add_inputs = [item_to_var(input)
@@ -1715,20 +1703,19 @@ def local_dot_to_dot22(node):
    _logger.info('Not optimizing dot with inputs %s %s %s %s',
                 x, y, x.type, y.type)
+@local_optimizer([gemm_no_inplace], inplace=True)
-@local_optimizer([gemm_no_inplace])
 def local_inplace_gemm(node):
    if node.op == gemm_no_inplace:
        return [gemm_inplace(*node.inputs)]
-@local_optimizer([gemv_no_inplace])
+@local_optimizer([gemv_no_inplace], inplace=True)
 def local_inplace_gemv(node):
    if node.op == gemv_no_inplace:
        return [gemv_inplace(*node.inputs)]
-@local_optimizer([ger])
+@local_optimizer([ger], inplace=True)
 def local_inplace_ger(node):
    if node.op == ger:
        return [ger_destructive(*node.inputs)]

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -774,8 +774,7 @@ class Elemwise(OpenMPOp):
            super(Elemwise, self).perform(node, inputs, output_storage)
        maxsize = max(len(input.shape) for input in inputs)
-        for dims in izip(*[([(1, True)] * (maxsize - len(input.shape))
+        for dims in izip(*[zip(input.shape, sinput.type.broadcastable)
-                            + zip(input.shape, sinput.type.broadcastable))
                          for input, sinput in zip(inputs, node.inputs)]):
            if max(d for d, b in dims) != 1 and (1, False) in dims:
                # yes there may be more compact ways to write this code,
@@ -808,34 +807,36 @@ class Elemwise(OpenMPOp):
                out_shape.append(max(values))
        out_shape = tuple(out_shape)
-        if not self.inplace_pattern:
+        # Commented as we don't reuse outputs now.
-            for output, storage in izip(node.outputs, output_storage):
+        #
-                odat = storage[0]
+        # if not self.inplace_pattern:
-                if odat is not None:
+        #     for output, storage in izip(node.outputs, output_storage):
-                    if odat.shape != out_shape:
+        #         odat = storage[0]
-                        # It is unsafe to try to resize odat,
+        #         if odat is not None:
-                        # we have to allocate output storage.
+        #             if odat.shape != out_shape:
-                        odat = None
+        #                 # It is unsafe to try to resize odat,
-                if odat is None:
+        #                 # we have to allocate output storage.
-                    odat = numpy.ndarray(out_shape, dtype=output.type.dtype)
+        #                 odat = None
-                storage[0] = odat
+        #         if odat is None:
-        else:
+        #             odat = numpy.ndarray(out_shape, dtype=output.type.dtype)
-            for i, (output, storage) in enumerate(
+        #         storage[0] = odat
-                    izip(node.outputs, output_storage)):
+        # else:
-                #i is an output idx
+        #     for i, (output, storage) in enumerate(
-                if i in self.inplace_pattern:
+        #             izip(node.outputs, output_storage)):
-                    odat = inputs[self.inplace_pattern[i]]
+        #         #i is an output idx
-                else:
+        #         if i in self.inplace_pattern:
-                    odat = storage[0]
+        #             odat = inputs[self.inplace_pattern[i]]
-                    if odat is not None:
+        #         else:
-                        if odat.shape != out_shape:
+        #             odat = storage[0]
-                            # It is unsafe to try to resize odat,
+        #             if odat is not None:
-                            # we have to allocate output storage.
+        #                 if odat.shape != out_shape:
-                            odat = None
+        #                     # It is unsafe to try to resize odat,
-                    if odat is None:
+        #                     # we have to allocate output storage.
-                        odat = numpy.ndarray(out_shape,
+        #                     odat = None
-                                dtype=output.type.dtype)
+        #             if odat is None:
-                storage[0] = odat
+        #                 odat = numpy.ndarray(out_shape,
+        #                         dtype=output.type.dtype)
+        #         storage[0] = odat
        ufunc_args = inputs  # + output_storage
        if self.nfunc and len(inputs) == self.nfunc_spec[1]:
@@ -860,26 +861,25 @@ class Elemwise(OpenMPOp):
        if nout == 1:
            variables = [variables]
+        i = 0
        for variable, storage, nout in izip(variables, output_storage,
                                            node.outputs):
-            if str(getattr(variable, "dtype", "")) == 'object':
+            if getattr(variable, "dtype", "") == 'object':
                # Since numpy 1.6, function created with numpy.frompyfunc
                # always return an ndarray with dtype object
                variable = numpy.asarray(variable, dtype=nout.dtype)
-            # The storage has been resized earlier.
+            if i in self.inplace_pattern:
-            if hasattr(variable, 'shape'):
+                odat = inputs[self.inplace_pattern[i]]
-                assert storage[0].shape == variable.shape
+                odat[...] = variable
+                storage[0] = odat
+            # Sometimes NumPy return a Python type.
+            elif not isinstance(variable, numpy.ndarray):
+                variable = numpy.asarray(variable, nout.dtype)
+                storage[0] = variable
            else:
-                # If variable has not shape, then it is a scalar.
+                storage[0] = variable
-                assert numpy.prod(storage[0].shape) == 1
+            i += 1
-            storage[0][...] = variable
-            assert str(storage[0].dtype) != 'object'
-        # the following should be used instead of the previous loop,
-        # unfortunately it tends to segfault
-        # self.ufunc(*(ufunc_args+[s[0] for s in output_storage]))
    def infer_shape(self, node, i_shapes):
        rval = []

--- a/theano/tensor/extra_ops.py
+++ b/theano/tensor/extra_ops.py
@@ -571,6 +571,8 @@ def repeat(x, repeats, axis=None):
    :param axis: int, optional.
+    :see: :func:`tensor.tile <tensor.tile>`
    .. versionadded:: 0.6
    """
    return RepeatOp(axis=axis)(x, repeats)

--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
@@ -95,7 +95,7 @@ class SoftmaxWithBias(gof.Op):
        return ['<iostream>', '<cmath>']
    @staticmethod
-    def c_code_template():
+    def c_code_template(dtype):
        # this implementation was lifted from
        # /u/bergstrj/cvs/bergstrj/src/feb07/nn.cxx
@@ -107,6 +107,10 @@ class SoftmaxWithBias(gof.Op):
        #TODO: use this to accept float32 and int32: node.inputs[0].type.dtype_specs()[1]
        init_decl = """
        npy_intp* Nx = PyArray_DIMS(%(x)s);
+        npy_intp Sx = 0;
+        npy_intp Sb = 0;
+        npy_intp Ssm = 0;
        if (PyArray_NDIM(%(x)s) != 2)
        {
@@ -151,6 +155,10 @@ class SoftmaxWithBias(gof.Op):
                %(fail)s
            }
        }
+        Sx = PyArray_STRIDES(%(x)s)[1]/sizeof(dtype_%(x)s);
+        Sb = PyArray_STRIDES(%(b)s)[0]/sizeof(dtype_%(b)s);
+        Ssm = PyArray_STRIDES(%(sm)s)[1]/sizeof(dtype_%(sm)s);
        """
        begin_row_loop = """
@@ -163,9 +171,7 @@ class SoftmaxWithBias(gof.Op):
            const dtype_%(x)s* __restrict__ x_i = (dtype_%(x)s*)(PyArray_BYTES(%(x)s) + PyArray_STRIDES(%(x)s)[0] * i);
            const dtype_%(b)s* __restrict__ b_i = (dtype_%(b)s*)(PyArray_BYTES(%(b)s));
            dtype_%(sm) s* __restrict__ sm_i = (dtype_%(sm)s*)(PyArray_BYTES(%(sm)s) + PyArray_STRIDES(%(sm)s)[0] * i);
-        """
-        inside_row_loop = """
            npy_intp Sx = PyArray_STRIDES(%(x)s)[1]/sizeof(dtype_%(x)s);
            npy_intp Sb = PyArray_STRIDES(%(b)s)[0]/sizeof(dtype_%(b)s);
            npy_intp Ssm = PyArray_STRIDES(%(sm)s)[1]/sizeof(dtype_%(sm)s);
@@ -182,6 +188,9 @@ class SoftmaxWithBias(gof.Op):
                row_max   = (row_ij > row_max) ? row_ij : row_max;
            }
+        """
+        inside_row_loop = """
            for (j = 0; j < Nx[1]; ++j)
            {
                dtype_%(sm)s row_ij = x_i[j * Sx] +  b_i[j * Sb];
@@ -201,6 +210,42 @@ class SoftmaxWithBias(gof.Op):
        """
+        # Get the vectorized version of exp if it exist
+        try:
+            vec_exp = theano.scalar.exp.c_code_contiguous_raw(dtype,
+                                                              "Nx[1]", "sm_i", "sm_i")
+            inside_row_loop_contig = """
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                dtype_%%(sm)s row_ij = x_i[j * Sx] +  b_i[j * Sb];
+                //std::cout << "2 " << j << " " << row_ij << " " << row_max << "\\n";
+                dtype_%%(sm)s sm_ij = row_ij - row_max;
+                //std::cout << "3 " << j << " " << sm_ij << "\\n";
+                sm_i[j * Ssm] = sm_ij;
+            }
+            %(vec_exp)s;
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                sum += sm_i[j * Ssm];
+            }
+            //cblas_dscal(x.N, 1.0 / sum, &mat_at(s,i,0), s.n);
+            double sum_inv = 1.0 / sum;
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                sm_i[j * Ssm] *= sum_inv;
+            }
+        """ % locals()
+            inside_row_loop = """
+            if(Ssm == 1){
+                %(inside_row_loop_contig)s
+            }else{
+                %(inside_row_loop)s
+            }
+            """ % locals()
+        except theano.gof.utils.MethodNotDefined:
+            pass
        end_row_loop = """
        }
        """
@@ -210,12 +255,13 @@ class SoftmaxWithBias(gof.Op):
    def c_code(self, node, name, inp, out, sub):
        x, b = inp
        sm, = out
-        code_template = ''.join(self.c_code_template())
+        code_template = ''.join(self.c_code_template(
+            node.inputs[0].type.dtype_specs()[1]))
        return code_template % dict(locals(), **sub)
    @staticmethod
    def c_code_cache_version():
-        return (6,)
+        return (8,)
 softmax_with_bias = SoftmaxWithBias()
@@ -384,7 +430,7 @@ class Softmax(gof.Op):
        return ['<iostream>', '<cmath>']
    @staticmethod
-    def c_code_template():
+    def c_code_template(dtype):
        # this implementation was lifted from
        # /u/bergstrj/cvs/bergstrj/src/feb07/nn.cxx
@@ -396,6 +442,8 @@ class Softmax(gof.Op):
        #TODO: use this to accept float32 and int32: node.inputs[0].type.dtype_specs()[1]
        init_decl = """
        npy_intp* Nx = PyArray_DIMS(%(x)s);
+        npy_intp Sx1 = 0;
+        npy_intp Ssm1 = 0;
        if (PyArray_NDIM(%(x)s) != 2)
        {
@@ -413,7 +461,7 @@ class Softmax(gof.Op):
            || (PyArray_DIMS(%(sm)s)[0] != PyArray_DIMS(%(x)s)[0])
            || (PyArray_DIMS(%(sm)s)[1] != PyArray_DIMS(%(x)s)[1]))
        {
-            if (NULL != %(sm)s) Py_XDECREF(%(sm)s);
+            Py_XDECREF(%(sm)s);
            %(sm)s = (PyArrayObject*)PyArray_SimpleNew(2, PyArray_DIMS(%(x)s),
                                                       type_num_%(x)s);
            if(!%(sm)s) {
@@ -422,6 +470,8 @@ class Softmax(gof.Op):
                %(fail)s
            }
        }
+        Sx1 = PyArray_STRIDES(%(x)s)[1]/sizeof(dtype_%(x)s);
+        Ssm1 = PyArray_STRIDES(%(sm)s)[1]/sizeof(dtype_%(sm)s);
        """
        begin_row_loop = """
@@ -433,11 +483,6 @@ class Softmax(gof.Op):
            const dtype_%(x)s* __restrict__ x_i = (dtype_%(x)s*)(PyArray_BYTES(%(x)s) + PyArray_STRIDES(%(x)s)[0] * i);
            dtype_%(sm) s* __restrict__ sm_i = (dtype_%(sm)s*)(PyArray_BYTES(%(sm)s) + PyArray_STRIDES(%(sm)s)[0] * i);
-        """
-        inside_row_loop = """
-            npy_intp Sx = PyArray_STRIDES(%(x)s)[1]/sizeof(dtype_%(x)s);
-            npy_intp Ssm = PyArray_STRIDES(%(sm)s)[1]/sizeof(dtype_%(sm)s);
            size_t row_max_j=0;
            dtype_%(sm)s row_max = x_i[0];
@@ -445,46 +490,82 @@ class Softmax(gof.Op):
            // Get the maximum value of the row
            for (j = 1; j < Nx[1]; ++j)
            {
-                dtype_%(sm)s row_ij = x_i[j * Sx] ;
+                dtype_%(sm)s row_ij = x_i[j * Sx1] ;
                //std::cout << "1 " << row_ij << "\\n";
                row_max_j = (row_ij > row_max) ? j : row_max_j;
                row_max   = (row_ij > row_max) ? row_ij : row_max;
            }
+        """
+        inside_row_loop = """
            for (j = 0; j < Nx[1]; ++j)
            {
-                dtype_%(sm)s row_ij = x_i[j * Sx] ;
+                dtype_%(sm)s row_ij = x_i[j * Sx1] ;
                //std::cout << "2 " << j << " " << row_ij << " " << row_max << "\\n";
                dtype_%(sm)s sm_ij = exp(row_ij - row_max);
                //std::cout << "3 " << j << " " << sm_ij << "\\n";
                sum += sm_ij;
-                sm_i[j * Ssm] = sm_ij;
+                sm_i[j * Ssm1] = sm_ij;
            }
            //cblas_dscal(x.N, 1.0 / sum, &mat_at(s,i,0), s.n);
            double sum_inv = 1.0 / sum;
            for (j = 0; j < Nx[1]; ++j)
            {
-                sm_i[j * Ssm] *= sum_inv;
+                sm_i[j * Ssm1] *= sum_inv;
            }
        """
+        # Get the vectorized version of exp if it exist
+        try:
+            vec_exp = theano.scalar.exp.c_code_contiguous_raw(dtype,
+                                                              "Nx[1]", "sm_i", "sm_i")
+            inside_row_loop_contig = """
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                sm_i[j * Ssm1] = x_i[j * Sx1] - row_max;
+            }
+            %(vec_exp)s;
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                sum += sm_i[j * Ssm1];
+            }
+            //cblas_dscal(x.N, 1.0 / sum, &mat_at(s,i,0), s.n);
+            double sum_inv = 1.0 / sum;
+            for (j = 0; j < Nx[1]; ++j)
+            {
+                sm_i[j * Ssm1] *= sum_inv;
+            }
+            """ % locals()
+            inside_row_loop = """
+            if(Ssm1 == 1){
+                %(inside_row_loop_contig)s
+            }else{
+                %(inside_row_loop)s
+            }
+            """ % locals()
+        except theano.gof.utils.MethodNotDefined:
+            pass
        end_row_loop = """
        }
        """
        return (init_decl, begin_row_loop, inside_row_loop, end_row_loop)
    def c_code(self, node, name, inp, out, sub):
        x, = inp
        sm, = out
-        code_template = ''.join(self.c_code_template())
+        code_template = ''.join(self.c_code_template(
+            node.inputs[0].type.dtype_specs()[1]))
        return code_template % dict(locals(), **sub)
    @staticmethod
    def c_code_cache_version():
-        return (1,)
+        return (3,)
 softmax = Softmax()
@@ -863,7 +944,7 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        return ['<iostream>', '<cmath>']
    @staticmethod
-    def c_code_template():
+    def c_code_template(dtype):
        # this implementation was lifted from
        # /u/bergstrj/cvs/bergstrj/src/feb07/nn.cxx
@@ -874,7 +955,7 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        #TODO: use this to accept float32 and int32: node.inputs[0].type.dtype_specs()[1]
        (init_decl, begin_row_loop, inside_row_loop, end_row_loop) = \
-                SoftmaxWithBias.c_code_template()
+                SoftmaxWithBias.c_code_template(dtype)
        return (init_decl,
                """
        if (PyArray_NDIM(%(y_idx)s) != 1)
@@ -947,7 +1028,8 @@ class CrossentropySoftmaxArgmax1HotWithBias(gof.Op):
        nll, sm, am = out
        y_idx_type = node.inputs[2].type.dtype_specs()[1]
        am_type = y_idx_type
-        code_template = ''.join(self.c_code_template())
+        dtype = node.inputs[0].type.dtype_specs()[1]
+        code_template = ''.join(self.c_code_template(dtype))
        return code_template % dict(locals(), **sub)

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -1928,7 +1928,8 @@ class TestAlloc(unittest.TestCase):
                #AdvancedIncSubtensor1
                (some_matrix[arange(60)], 2),
                #AdvancedIncSubtensor
-                (some_matrix[idx, idx], 1)]):
+                (some_matrix[idx, idx], 1)
+        ]):
            derp = sum(dot(subtensor, variables))
            fobj = theano.function([some_vector], derp, mode=self.mode)
@@ -1936,14 +1937,18 @@ class TestAlloc(unittest.TestCase):
            fgrad = theano.function([some_vector], grad_derp,
                                    mode=self.mode)
            topo_obj = fobj.maker.fgraph.toposort()
+            #<= is needed as the GPU currently don't implement
+            #AdvancedIncSubtensor. When this is the case it can be
+            #replaced with ==.
            assert numpy.sum([isinstance(node.op, alloc)
-                              for node in topo_obj]) == 0
+                              for node in topo_obj]) <= 1
            topo_grad = fgrad.maker.fgraph.toposort()
            #print subtensor
            #theano.printing.debugprint(fgrad)
            assert numpy.sum([isinstance(node.op, alloc)
-                              for node in topo_grad]) == n_alloc
+                              for node in topo_grad]) == n_alloc, (
+                                  alloc, subtensor, n_alloc, topo_grad)
            fobj(test_params)
            fgrad(test_params)
@@ -6736,6 +6741,17 @@ class TestTensorInstanceMethods(unittest.TestCase):
        # Test equivalent advanced indexing
        assert_array_equal(X[:,indices].eval({X: x}), x[:,indices])
+    def test_cumsum(self):
+        X, _ = self.vars
+        x, _ = self.vals
+        assert_array_equal(X.cumsum().eval({X: x}), x.cumsum())
+    def test_cumprod(self):
+        X, _ = self.vars
+        x, _ = self.vals
+        assert_array_equal(X.cumprod().eval({X: x}), x.cumprod())
 def test_norm():
    x = theano.tensor.vector('x')
    n = x.norm(2)

--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
@@ -1091,7 +1091,7 @@ class TestGemv(TestCase, unittest_tools.TestOptimizationMixin):
        # Assert that the dot was optimized somehow
        self.assertFunctionContains0(f, T.dot)
-        self.assertFunctionContains1(f, Gemv(False))
+        self.assertFunctionContains1(f, Gemv(True))
        # Assert they produce the same output
        assert numpy.allclose(f(), numpy.dot(v.get_value(), w.get_value()))

--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tensor/type.py
+++ b/theano/tensor/type.py
@@ -164,7 +164,8 @@ class TensorType(Type):
                            " Theano C code does not support that.",
                            msg,
                            "object shape", data.shape,
-                            "object strides", data.strides)
+                            "object strides", data.strides,
+                            "object dtype", data.dtype)
        i = 0
        for b in self.broadcastable:

--- a/theano/tensor/var.py
+++ b/theano/tensor/var.py
@@ -11,6 +11,7 @@ from theano.tensor.utils import hash_from_ndarray
 from theano.tensor.type import TensorType
 class AsTensorError(TypeError):
    """Raised when as_tensor_variable isn't able to create a
    TensorVariable.
@@ -509,13 +510,11 @@ class _tensor_py_operators:
    def sort(self, axis=-1, kind='quicksort', order=None):
        """See `theano.tensor.sort`"""
-        from theano.tensor.sort import sort
+        return theano.tensor.sort(self, axis, kind, order)
-        return sort(self, axis, kind, order)
    def argsort(self, axis=-1, kind='quicksort', order=None):
        """See `theano.tensor.argsort`"""
-        from theano.tensor.sort import argsort
+        return theano.tensor.argsort(self, axis, kind, order)
-        return argsort(self, axis, kind, order)
    def clip(self, a_min, a_max):
        "Clip (limit) the values in an array."
@@ -529,16 +528,14 @@ class _tensor_py_operators:
    def repeat(self, repeats, axis=None):
        """See `theano.tensor.repeat`"""
-        from theano.tensor.extra_ops import repeat
+        return theano.tensor.extra_ops.repeat(self, repeats, axis)
-        return repeat(self, repeats, axis)
    def round(self, mode="half_away_from_zero"):
        """See `theano.tensor.round`"""
        return theano.tensor.basic.round(self, mode)
    def trace(self):
-        from theano.sandbox.linalg import trace
+        return theano.sandbox.linalg.trace(self)
-        return trace(self)
    # TO TRUMP NUMPY OPERATORS
    __array_priority__ = 1000
@@ -549,6 +546,12 @@ class _tensor_py_operators:
    def zeros_like(model, dtype=None):
        return theano.tensor.basic.zeros_like(model, dtype=dtype)
+    def cumsum(self, axis=None):
+        return theano.tensor.extra_ops.cumsum(self, axis)
+    def cumprod(self, axis=None):
+        return theano.tensor.extra_ops.cumprod(self, axis)
 class TensorVariable(_tensor_py_operators, Variable):
    """Subclass to add the tensor operators to the basic `Variable` class."""

--- a/theano/tests/run_tests_in_batch.py
+++ b/theano/tests/run_tests_in_batch.py
@@ -62,7 +62,7 @@ import sys
 import time
 import theano
-from theano.misc.windows import call_subprocess_Popen
+from theano.misc.windows import output_subprocess_Popen
 def main(stdout=None, stderr=None, argv=None, theano_nose=None,
@@ -271,19 +271,17 @@ def run(stdout, stderr, argv, theano_nose, batch_size, time_profile,
                    time.ctime(), test_id, data["ids"][test_id]))
                f_rawlog.flush()
-                proc = call_subprocess_Popen(
+                p_out = output_subprocess_Popen(
                    ([python, theano_nose, '-v', '--with-id']
-                    + [str(test_id)] + argv +
+                     + [str(test_id)] + argv +
-                     ['--disabdocstring']),
+                     ['--disabdocstring']))
                    # the previous option calls a custom Nosetests plugin
                    # precluding automatic sustitution of doc. string for
                    # test name in display
                    # (see class 'DisabDocString' in file theano-nose)
-                    stderr=subprocess.PIPE,
-                    stdout=dummy_out.fileno())
                # recovering and processing data from pipe
-                err = proc.stderr.read()
+                err = p_out[1]
                # print the raw log
                f_rawlog.write(err)
                f_rawlog.flush()

--- a/theano/tests/test_gradient.py
+++ b/theano/tests/test_gradient.py
@@ -554,6 +554,52 @@ def test_disconnected_cost_grad():
        except theano.gradient.DisconnectedInputError:
            return
        raise AssertionError("A disconnected gradient has been ignored.")
+def test_subgraph_grad():
+    # Tests that the grad method with no known_grads
+    # matches what happens if you use successive subgraph_grads
+    x = theano.tensor.fvector('x')
+    t = theano.tensor.fvector('t')
+    w1 = theano.shared(np.random.randn(3,4))
+    w2 = theano.shared(np.random.randn(4,2))
+    a1 = theano.tensor.tanh(theano.tensor.dot(x,w1))
+    a2 = theano.tensor.tanh(theano.tensor.dot(a1,w2))
+    cost2 = theano.tensor.sqr(a2 - t).sum() 
+    cost2 += theano.tensor.sqr(w2.sum())
+    cost1 = theano.tensor.sqr(w1.sum())
+    params = [[w2],[w1]]
+    costs = [cost2,cost1]
+    grad_ends = [[a1], [x]]
+    inputs = [t, x]
+    rng = np.random.RandomState([2012, 11, 15])
+    values = [rng.randn(2), rng.randn(3)]
+    values = [np.cast[ipt.dtype](value) for ipt, value in zip(inputs, values)]
+    wrt = [w2, w1]
+    cost = cost2 + cost1
+    true_grads = theano.grad(cost, wrt)
+    true_grads = theano.function(inputs, true_grads)
+    true_grads = true_grads(*values)
+    from theano.gof.python25 import OrderedDict
+    next_grad = None
+    param_grads = []
+    for i in xrange(2):
+        param_grad, next_grad = theano.subgraph_grad(
+            wrt=params[i], end=grad_ends[i], 
+            start=next_grad, cost=costs[i]
+        )
+        next_grad = OrderedDict(zip(grad_ends[i], next_grad))
+        param_grads.extend(param_grad)
+    pgrads = theano.function(inputs, param_grads)
+    pgrads = pgrads(*values)
+    for true_grad, pgrad in zip(true_grads, pgrads):
+        assert(np.sum(np.abs(true_grad - pgrad)) < 0.00001)
 class TestConsiderConstant(unittest.TestCase):

--- a/theano/tests/test_tutorial.py
+++ b/theano/tests/test_tutorial.py
@@ -1136,3 +1136,214 @@ class T_graphstructures(unittest.TestCase):
        assert e.owner.inputs[1] is mul_variable
        assert e.owner.inputs[1].owner.inputs[0] is y
        assert e.owner.inputs[1].owner.inputs[1] is z
+class T_scan(unittest.TestCase):
+    ## All tests here belong to
+    ## http://deeplearning.net/software/theano/tutorial/loop.html
+    ## Theano/doc/tutorial/loop.txt
+    ## Any change you do here also add it to the tutorial !
+    def test_elemwise(self):
+          # defining the tensor variables
+          X = T.matrix("X")
+          W = T.matrix("W")
+          b_sym = T.vector("b_sym")
+          results, updates = theano.scan(lambda v:T.tanh(T.dot(v,W)+b_sym), \
+                    sequences=X)
+          compute_elementwise = theano.function(inputs = [X, W, b_sym], \
+                    outputs=[results])
+          # test values
+          x = numpy.eye(2)
+          w = numpy.ones((2,2))
+          b = numpy.ones((2))
+          b[1] = 2
+          print "Scan results:", compute_elementwise(x, w, b)[0]
+          # comparison with numpy
+          print "Numpy results:", numpy.tanh(x.dot(w) + b)
+    def test_sequence(self):
+          # define tensor variables
+          X = T.vector("X")
+          W = T.matrix("W")
+          b_sym = T.vector("b_sym")
+          U = T.matrix("U")
+          Y = T.matrix("Y")
+          V = T.matrix("V")
+          P = T.matrix("P")
+          results, updates = theano.scan(lambda \
+                        y,p,x_tm1:T.tanh(T.dot(x_tm1,W) + \
+                        T.dot(y,U)+T.dot(p,V)), \
+                    sequences=[Y,P[::-1]], outputs_info=[X])
+          compute_seq = theano.function(inputs = [X, W, Y, U, P, V], \
+                    outputs=[results])
+          # test values
+          x = numpy.zeros((2))
+          x[1] = 1
+          w = numpy.ones((2,2))
+          y = numpy.ones((5,2))
+          y[0,:] = -3
+          u = numpy.ones((2,2))
+          p = numpy.ones((5,2))
+          p[0,:] = 3
+          v = numpy.ones((2,2))
+          print "Scan results", compute_seq(x,w,y,u,p,v)[0]
+          # comparison with numpy
+          x_res = numpy.zeros((5,2))
+          x_res[0] = numpy.tanh(x.dot(w) + y[0].dot(u) + p[4].dot(v))
+          for i in range(1,5):
+            x_res[i] = numpy.tanh(x_res[i-1].dot(w) \
+                        + y[i].dot(u) + p[4-i].dot(v))
+          print "Numpy results:", x_res
+    def test_norm(self):
+          # define tensor variable
+          X = T.matrix("X")
+          results, updates = theano.scan(lambda x_i:T.sqrt((x_i**2).sum()), \
+                            sequences=[X])
+          compute_norm_lines = theano.function(inputs = [X], outputs=[results])
+          results, updates = theano.scan(lambda x_i:T.sqrt((x_i**2).sum()), \
+                            sequences=[X.T])
+          compute_norm_cols = theano.function(inputs = [X], outputs=[results])
+          # test value
+          x = numpy.diag(numpy.arange(1,6),1)
+          print "Scan results:", compute_norm_lines(x)[0], \
+                            compute_norm_cols(x)[0]
+          # comparison with numpy
+          print "Numpy results:", numpy.sqrt((x**2).sum(1)), \
+                            numpy.sqrt((x**2).sum(0))
+    def test_trace(self):
+          # define tensor variable
+          X = T.matrix("X")
+          results, updates = theano.scan(lambda i, j, t_f:T.cast(X[i,j] + \
+                                t_f, theano.config.floatX), \
+                            sequences=[T.arange(X.shape[0]), \
+                                T.arange(X.shape[1])], \
+                            outputs_info=numpy.asarray(0., \
+                                dtype=theano.config.floatX))
+          result = results[-1]
+          compute_trace = theano.function(inputs = [X], outputs=[result])
+          # test value
+          x = numpy.eye(5)
+          x[0] = numpy.arange(5)
+          print "Scan results:", compute_trace(x)[0]
+          # comparison with numpy
+          print "Numpy results:", numpy.diagonal(x).sum()
+    def test_taps(self):
+          # define tensor variables
+          X = T.matrix("X")
+          W = T.matrix("W")
+          b_sym = T.vector("b_sym")
+          U = T.matrix("U")
+          V = T.matrix("V")
+          n_sym = T.iscalar("n_sym")
+          results, updates = theano.scan(lambda x_tm2,x_tm1:T.dot(x_tm2,U) \
+                        + T.dot(x_tm1,V) + T.tanh(T.dot(x_tm1,W) + b_sym), \
+                    n_steps=n_sym, \
+                    outputs_info=[dict(initial = X, taps = [-2,-1])])
+          compute_seq2 = theano.function(inputs = [X, U, V, W, b_sym, \
+                    n_sym], outputs=[results])
+          # test values
+          x = numpy.zeros((2,2)) 
+          # the initial value must be able to return x[-2]
+          x[1,1] = 1
+          w = 0.5*numpy.ones((2,2))
+          u = 0.5*(numpy.ones((2,2))-numpy.eye(2))
+          v = 0.5*numpy.ones((2,2))
+          n = 10
+          b = numpy.ones((2))
+          print "Scan results:", compute_seq2(x,u,v,w,b,n)
+          # comparison with numpy
+          x_res = numpy.zeros((10,2))
+          x_res[0] = x[0].dot(u) + x[1].dot(v) + numpy.tanh(x[1].dot(w) + b)
+          x_res[1] = x[1].dot(u) + x_res[0].dot(v) \
+                        + numpy.tanh(x_res[0].dot(w) + b)
+          x_res[2] = x_res[0].dot(u) + x_res[1].dot(v) \
+                    + numpy.tanh(x_res[1].dot(w) + b)
+          for i in range(2,10):
+            x_res[i] = (x_res[i-2].dot(u) + x_res[i-1].dot(v) \
+                    + numpy.tanh(x_res[i-1].dot(w) + b))
+          print "Numpy results:", x_res
+    def test_jacobian(self):
+          # define tensor variables
+          v = T.vector()
+          A = T.matrix()
+          y = T.tanh(T.dot(v,A))
+          results, updates = theano.scan(lambda i:T.grad(y[i], v), \
+                    sequences = [T.arange(y.shape[0])])
+          compute_jac_t = theano.function([A,v], [results], \
+                allow_input_downcast = True) # shape (d_out, d_in)
+          # test values
+          x = numpy.eye(5)[0]
+          w = numpy.eye(5,3)
+          w[2] = numpy.ones((3))
+          print "Scan results:", compute_jac_t(w,x)[0]
+          # compare with numpy
+          print "Numpy results:", ((1 - numpy.tanh(x.dot(w))**2)*w).T
+    def test_accumulator(self):
+          # define shared variables
+          k = theano.shared(0)
+          n_sym = T.iscalar("n_sym")
+          results, updates = theano.scan(lambda:{k:(k+1)}, n_steps=n_sym)
+          accumulator = theano.function([n_sym], [], updates=updates, \
+                allow_input_downcast = True)
+          print "Before 5 steps:", k.get_value()
+          accumulator(5)
+          print "After 5 steps:", k.get_value()
+    def test_random(self):
+          # define tensor variables
+          X = T.matrix("X")
+          W = T.matrix("W")
+          b_sym = T.vector("b_sym")
+          # define shared random stream
+          trng = T.shared_randomstreams.RandomStreams(1234)
+          d=trng.binomial(size=W[1].shape)
+          results, updates = theano.scan(lambda v:T.tanh(T.dot(v,W) \
+                        + b_sym)*d, sequences=X)
+          compute_with_bnoise = theano.function(inputs = [X, W, b_sym], \
+                        outputs=[results], \
+                        updates=updates, \
+                        allow_input_downcast = True)
+          x = numpy.eye(10,2)
+          w = numpy.ones((2,2))
+          b = numpy.ones((2))
+          print compute_with_bnoise(x, w, b)