Merge branch 'master' into ipT_grad

839fa93b · Frédéric Bastien · GitHub · e24aaabd · 0c53fb52 · 839fa93b
--- a/.jenkins/jenkins_buildbot_python2_debug.sh
+++ b/.jenkins/jenkins_buildbot_python2_debug.sh
 #!/bin/bash
 BUILDBOT_DIR=$WORKSPACE/nightly_build
-THEANO_PARAM="theano --with-timer --timer-top-n 10"
+THEANO_PARAM="theano --with-timer --timer-top-n 10 -v"
 export THEANO_FLAGS=init_gpu_device=gpu
 # CUDA

--- a/EMAIL.txt
+++ b/EMAIL.txt
@@ -66,8 +66,7 @@ features:
 * tight integration with NumPy: a similar interface to NumPy's.
   numpy.ndarrays are also used internally in Theano-compiled functions.
- * transparent use of a GPU: perform data-intensive computations up to
+ * transparent use of a GPU: perform data-intensive computations much faster than on a CPU.
-   140x faster than on a CPU (support for float32 only).
 * efficient symbolic differentiation: Theano can compute derivatives
   for functions of one or many inputs.
 * speed and stability optimizations: avoid nasty bugs when computing

--- a/doc/index.txt
+++ b/doc/index.txt
@@ -7,7 +7,7 @@ evaluate mathematical expressions involving multi-dimensional
 arrays efficiently. Theano features:
 * **tight integration with NumPy** -- Use `numpy.ndarray` in Theano-compiled functions.
-* **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
+* **transparent use of a GPU** -- Perform data-intensive computations much faster than on a CPU.
 * **efficient symbolic differentiation** -- Theano does your derivatives for functions with one or many inputs.
 * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny.
 * **dynamic C code generation** -- Evaluate expressions faster.

--- a/doc/install_centos6.txt
+++ b/doc/install_centos6.txt
@@ -12,6 +12,7 @@ CentOS 6 Installation Instructions
    page <http://deeplearning.net/software/theano_versions/dev/install_centos6.html>`_.
 .. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
+.. |CompilerName| replace:: ``g++``
 .. include:: requirements.inc

--- a/doc/install_generic.inc
+++ b/doc/install_generic.inc
@@ -9,6 +9,24 @@ Installation
 Stable Installation
 -------------------
+With ``conda``
+^^^^^^^^^^^^^^
+If you use conda, you can directly install both theano and pygpu. Libgpuarray
+will be automatically installed as a dependency.
+.. code-block:: bash
+    conda install theano pygpu
+With ``pip``
+^^^^^^^^^^^^
+If you use pip, you have to install Theano and libgpuarray separately.
+theano
+::::::
 Install the latest stable version of Theano with:
 .. raw:: html
@@ -27,23 +45,18 @@ Install the latest stable version of Theano with:
 If you encountered any trouble, head to the :ref:`troubleshooting` page.
-libgpuarray
+The latest stable version of Theano is ``0.9.0`` (tagged with ``rel-0.9.0``).
-^^^^^^^^^^^
-It is recommanded that you don't use 0.8.2 for the new back-end. Use
-the dev version of Theano or 0.9rc3.
-For the stable version of Theano(0.8.2) you need a specific version of libgpuarray,
+libgpuarray
-that has been tagged ``v-9998``.
+:::::::::::
-Download it with:
-.. raw:: html
+For the stable version of Theano you need a specific version of libgpuarray,
+that has been tagged ``v0.6.2``.
+Download it with::
-    <div class='highlight'><pre>
+    git clone https://github.com/Theano/libgpuarray.git
-    git clone https://github.com/Theano/libgpuarray.git --tags
-    git checkout origin/v-9998
    cd libgpuarray
-    </pre></div>
+    git checkout tags/v0.6.2 -b v0.6.2
 and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.

--- a/doc/install_macos.txt
+++ b/doc/install_macos.txt
@@ -20,6 +20,7 @@ alternative instructions here.
 .. _theano-users: http://groups.google.com/group/theano-users?pli=1
 .. |PlatformCompiler| replace:: ``clang`` (the system version)
+.. |CompilerName| replace:: ``Clang``
 .. include:: requirements.inc

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -14,6 +14,7 @@ Ubuntu Installation Instructions
 .. _gpu_linux:
 .. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
+.. |CompilerName| replace:: ``g++``
 .. include:: requirements.inc

--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
--- a/doc/internal/how_to_release.txt
+++ b/doc/internal/how_to_release.txt
@@ -153,7 +153,7 @@ For final releases, send the e-mail to the following mailing lists:
 * theano-users
 * theano-announce
 * numpy-discussion@scipy.org
-* scipy-user@scipy.org
+* scipy-user@python.org
 * G+, Scientific Python: https://plus.google.com/communities/108773711053400791849
 For release candidates, only e-mail:

--- a/doc/library/tensor/nnet/conv.txt
+++ b/doc/library/tensor/nnet/conv.txt
@@ -219,6 +219,7 @@ TODO: Give examples on how to use these things! They are pretty complicated.
      It flip the kernel.
 .. autofunction:: theano.tensor.nnet.conv2d
+.. autofunction:: theano.tensor.nnet.conv2d_transpose
 .. autofunction:: theano.tensor.nnet.conv3d
 .. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
 .. autofunction:: theano.tensor.nnet.Conv3D.conv3D

--- a/doc/requirements.inc
+++ b/doc/requirements.inc
@@ -7,21 +7,28 @@ Requirements
 .. _BLAS: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
 .. _Python: http://www.python.org/
+.. _LaTeX: http://www.latex-project.org/
+.. _dvipng: http://savannah.nongnu.org/projects/dvipng/
+.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
+.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
+.. _pycuda: https://mathema.tician.de/software/pycuda/
+.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
-    Python_ >= 2.7 or >= 3.3 The development package (python-dev or
+    Python_ == 2.7 or ( >= 3.3 and <= 3.5 )
+        The development package (python-dev or
        python-devel on most Linux distributions) is recommended (see
        just below). Python 2.4 was supported up to and including the
        release 0.6. Python 2.6 was supported up to and including the
        release 0.8.2. Python 3 is supported past the 3.3 release.
-    `NumPy <http://numpy.scipy.org/>`_ >= 1.9.1 < 1.11.1
+    `NumPy <http://numpy.scipy.org/>`_ >= 1.9.1 <= 1.12
        Earlier versions could work, but we don’t test it.
    `SciPy <http://scipy.org>`_ >= 0.14 < 0.17.1
        Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
    `BLAS`_ installation (with Level 3 functionality)
-        * **Recommended**: MKL, which is free through Conda. 
+        * **Recommended**: MKL, which is free through Conda.
        * Alternatively, we suggest to install OpenBLAS, with the development headers (``-dev``, ``-devel``, depending on your Linux distribution).
 **Optional requirements**
@@ -42,10 +49,9 @@ Requirements
        **Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
    `libgpuarray`_
-        Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`.)
+        Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
    `pycuda`_ and `skcuda`_
        Required for some extra operations on the GPU like fft and
        solvers. We use them to wrap cufft and cusolver. Quick install
        ``pip install pycuda scikit-cuda``. For cuda 8, the dev
@@ -63,7 +69,9 @@ Follow this `link <http://conda.pydata.org/miniconda.html>`__ to install Minicon
 .. note::
-    If you want fast compiled code (recommended), make sure you have g++ (Windows/Linux) or Clang (OS X) installed.
+    If you want fast compiled code (recommended), make sure you have |CompilerName| installed.
+.. install_requirements_and_optional_packages
 Install requirements and optional packages
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -109,9 +117,4 @@ Install and configure the GPU drivers (recommended)
    * add a ``cuda.root`` flag to :envvar:`THEANO_FLAGS`, as in ``THEANO_FLAGS='cuda.root=/path/to/cuda/root'``, or
    * add a [cuda] section to your .theanorc file containing the option ``root = /path/to/cuda/root``.
-.. _LaTeX: http://www.latex-project.org/
-.. _dvipng: http://savannah.nongnu.org/projects/dvipng/
-.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
-.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
-.. _pycuda: https://mathema.tician.de/software/pycuda/
-.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
--- a/doc/requirements.txt
+++ b/doc/requirements.txt
 .. |PlatformCompiler| replace:: ``g++`` (Linux and Windows), ``clang`` (OS X)
+.. |CompilerName| replace:: ``g++`` (Windows/Linux) or ``Clang`` (OS X)
 .. include:: requirements.inc
--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
@@ -220,6 +220,36 @@ The ``compute_test_value`` mechanism works as follows:
  This feature is currently incompatible with ``Scan`` and also with ops
  which do not implement a ``perform`` method.
+It is also possible to override variables ``__repr__`` method to have them return tag.test_value.
+.. testsetup:: printtestvalue
+   import theano
+   import theano.tensor as T
+.. testcode:: printtestvalue
+   x = T.scalar('x')
+   # Assigning test value
+   x.tag.test_value = 42
+   # Enable test value printing
+   theano.config.print_test_value = True
+   print(x.__repr__())
+   # Disable test value printing
+   theano.config.print_test_value = False
+   print(x.__repr__())
+Running the code above returns the following output:
+.. testoutput:: printtestvalue
+   x
+   array(42.0)
+   x
 "How do I Print an Intermediate Value in a Function?"
 -----------------------------------------------------

--- a/theano/__init__.py
+++ b/theano/__init__.py
@@ -31,14 +31,36 @@ import logging
 import sys
+def has_handlers(logger):
+    # copied from Logger.hasHandlers() (introduced in Python 3.2)
+    _logger = logger
+    _has_handler = False
+    while _logger:
+        if _logger.handlers:
+            _has_handler = True
+            break
+        if not _logger.propagate:
+            break
+        else:
+            _logger = _logger.parent
+    return _has_handler
 theano_logger = logging.getLogger("theano")
 logging_default_handler = logging.StreamHandler()
 logging_default_formatter = logging.Formatter(
    fmt='%(levelname)s (%(name)s): %(message)s')
 logging_default_handler.setFormatter(logging_default_formatter)
-theano_logger.addHandler(logging_default_handler)
 theano_logger.setLevel(logging.WARNING)
+if has_handlers(theano_logger) is False:
+    theano_logger.addHandler(logging_default_handler)
+# Disable default log handler added to theano_logger when the module
+# is imported.
+def disable_log_handler(logger=theano_logger, handler=logging_default_handler):
+    if has_handlers(logger):
+        logger.removeHandler(handler)
 # Version information.
 from theano.version import version as __version__

--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
@@ -2302,6 +2302,7 @@ class GCC_compiler(Compiler):
        if status:
            tf = tempfile.NamedTemporaryFile(
+                mode='w',
                prefix='theano_compilation_error_',
                delete=False
            )

--- a/theano/gof/graph.py
+++ b/theano/gof/graph.py
@@ -1375,3 +1375,27 @@ def list_of_nodes(inputs, outputs):
        lambda o: [inp.owner for inp in o.inputs
                   if inp.owner and
                   not any(i in inp.owner.outputs for i in inputs)])
+def is_in_ancestors(l_node, f_node):
+    r"""
+    Goes up in the graph and returns True if the apply node f_node is found.
+    Use a stack implementation as the vm algo.
+    We suppose all nodes are not lazy
+    (i.e. for IfElse we suppose all inputs are computed)
+    """
+    computed = set()
+    todo = [l_node]
+    while todo:
+        cur = todo.pop()
+        if cur.outputs[0] in computed:
+            continue
+        if all([i in computed or i.owner is None for i in cur.inputs]):
+            computed.update(cur.outputs)
+            if cur is f_node:
+                return True
+        else:
+            todo.append(cur)
+            todo.extend(i.owner for i in cur.inputs if i.owner)
+    return False
--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
@@ -2089,13 +2089,7 @@ class TopoOptimizer(NavigatorOptimizer):
            if node is not current_node:
                q.append(node)
-        def pruner(node):
+        u = self.attach_updater(fgraph, importer, None,
-            if node is not current_node:
-                try:
-                    q.remove(node)
-                except ValueError:
-                    pass
-        u = self.attach_updater(fgraph, importer, pruner,
                                name=getattr(self, 'name', None))
        nb = 0
        try:
@@ -2105,6 +2099,8 @@ class TopoOptimizer(NavigatorOptimizer):
                    node = q.pop()
                else:
                    node = q.popleft()
+                if node not in fgraph.apply_nodes:
+                    continue
                current_node = node
                nb += self.process_node(fgraph, node)
            loop_t = time.time() - t0
@@ -2217,17 +2213,13 @@ class OpKeyOptimizer(NavigatorOptimizer):
                if node.op == op:
                    q.append(node)
-        def pruner(node):
+        u = self.attach_updater(fgraph, importer, None,
-            if node is not current_node and node.op == op:
-                try:
-                    q.remove(node)
-                except ValueError:
-                    pass
-        u = self.attach_updater(fgraph, importer, pruner,
                                name=getattr(self, 'name', None))
        try:
            while q:
                node = q.pop()
+                if node not in fgraph.apply_nodes:
+                    continue
                current_node = node
                self.process_node(fgraph, node)
        finally:

--- a/theano/gpuarray/basic_ops.py
+++ b/theano/gpuarray/basic_ops.py
@@ -73,7 +73,7 @@ def as_gpuarray_variable(x, context_name):
        # If we couldn't deal with transfers, then maybe it's a tensor
        if isinstance(x.type, tensor.TensorType):
-            return gpu_from_host(context_name)(x)
+            return GpuFromHost(context_name)(x)
    # Try _as_GpuArrayVariable if possible
    if hasattr(x, '_as_GpuArrayVariable'):
@@ -617,7 +617,7 @@ class HostFromGpu(Op):
    def grad(self, inputs, grads):
        gz, = grads
-        return [gpu_from_host(inputs[0].type.context_name)(gz)]
+        return [GpuFromHost(inputs[0].type.context_name)(gz)]
    def R_op(self, inputs, eval_points):
        ev, = eval_points
@@ -663,8 +663,8 @@ class GpuFromHost(Op):
    def grad(self, inputs, grads):
        gz, = grads
-        return [host_from_gpu(as_gpuarray_variable(
+        return [as_gpuarray_variable(
-                gz, context_name=self.context_name))]
+                gz, context_name=self.context_name).transfer('cpu')]
    def R_op(self, inputs, eval_points):
        ev, = eval_points
@@ -722,14 +722,6 @@ class GpuFromHost(Op):
        return (9,)
-# Caching GPUAlloc
-def gpu_from_host(ctx):
-    if ctx not in gpu_alloc.cache:
-        gpu_from_host.cache[ctx] = GpuFromHost(ctx)
-    return gpu_from_host.cache[ctx]
-gpu_from_host.cache = {}
 class GpuToGpu(Op):
    """
    Transfer data between GPUs.
@@ -953,15 +945,6 @@ class GpuAlloc(HideC, Alloc):
        return True
-# Caching GPUAlloc
-def gpu_alloc(ctx, memset_0=False):
-    key = (ctx, memset_0)
-    if key not in gpu_alloc.cache:
-        gpu_alloc.cache[key] = GpuAlloc(ctx, memset_0)
-    return gpu_alloc.cache[key]
-gpu_alloc.cache = {}
 class GpuAllocEmpty(HideC, AllocEmpty):
    """
    Allocate uninitialized memory on the GPU.
@@ -1048,14 +1031,6 @@ def empty_like(var):
    return GpuAllocEmpty(var.type.dtype, var.type.context_name)(*var.shape)
-def gpu_alloc_empty(ctx, dtype):
-    key = (dtype, ctx)
-    if key not in gpu_alloc_empty.cache:
-        gpu_alloc_empty.cache[key] = GpuAllocEmpty(dtype, ctx)
-    return gpu_alloc_empty.cache[key]
-gpu_alloc_empty.cache = {}
 class GpuContiguous(Op):
    """
    Return a C contiguous version of the input.
@@ -1132,7 +1107,7 @@ class GpuReshape(HideC, tensor.Reshape):
        ctx_name = infer_context_name(x)
        x = as_gpuarray_variable(x, context_name=ctx_name)
        shp = tensor.as_tensor_variable(shp)
-        res = host_from_gpu(x).reshape(shp, ndim=self.ndim)
+        res = x.transfer('cpu').reshape(shp, ndim=self.ndim)
        otype = GpuArrayType(dtype=res.dtype,
                             broadcastable=res.broadcastable,
                             context_name=ctx_name)

--- a/theano/gpuarray/dnn.py
+++ b/theano/gpuarray/dnn.py
--- a/theano/gpuarray/extra_ops.py
+++ b/theano/gpuarray/extra_ops.py
@@ -2,13 +2,13 @@ from __future__ import absolute_import, print_function, division
 import os
 from theano import Apply, Op
 from theano.tensor.extra_ops import CumOp
-from .basic_ops import infer_context_name
 try:
    from pygpu import gpuarray
 except ImportError:
    pass
-from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel, GpuReshape)
+from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel, GpuReshape, infer_context_name)
 from .opt import register_opt, op_lifter, register_opt2

--- a/theano/gpuarray/nerv.py
+++ b/theano/gpuarray/nerv.py
@@ -10,7 +10,7 @@ from theano.scalar import as_scalar, constant
 from . import opt
 from .basic_ops import (as_gpuarray_variable, GpuAllocEmpty,
-                        infer_context_name, gpu_alloc_empty)
+                        infer_context_name)
 from .type import gpu_context_type
 from .opt_util import alpha_merge, output_merge
@@ -158,7 +158,7 @@ def local_gpua_dot_to_gemm16(op, ctx_name, inputs, outputs):
    if (A.ndim == 2 and B.ndim == 2 and
            A.dtype == 'float16' and B.dtype == 'float16'):
        fgraph = getattr(outputs[0], 'fgraph', None)
-        C = gpu_alloc_empty(ctx_name, dtype='float16')(
+        C = GpuAllocEmpty('float16', ctx_name)(
            shape_i(A, 0, fgraph), shape_i(B, 1, fgraph))
        return Gemm16()(C, 1.0, A, B, 0.0)

--- a/theano/gpuarray/opt.py
+++ b/theano/gpuarray/opt.py
@@ -44,8 +44,7 @@ from .basic_ops import (as_gpuarray_variable, infer_context_name,
                        HostFromGpu, GpuFromHost,
                        GpuSplit, GpuContiguous, gpu_contiguous,
                        GpuAlloc, GpuAllocEmpty, GpuReshape,
-                        GpuEye, gpu_join, GpuJoin, gpu_alloc_empty,
+                        GpuEye, gpu_join, GpuJoin)
-                        gpu_alloc, gpu_from_host)
 from .blas import (gpu_dot22, GpuGemm, GpuGer, GpuGemmBatch,
                   gpugemm_no_inplace, gpugemm_inplace,
                   gpugemmbatch_no_inplace,
@@ -61,9 +60,8 @@ from .blocksparse import (GpuSparseBlockGemv, GpuSparseBlockOuter,
 from .nnet import (gpu_crossentropy_softmax_1hot_with_bias_dx,
                   gpu_crossentropy_softmax_argmax_1hot_with_bias,
                   gpu_softmax_with_bias, gpu_softmax)
 from .elemwise import (GpuElemwise, GpuDimShuffle, GpuCAReduceCuda,
-                       GpuCAReduceCPY, gpu_ca_reduce_cuda, gpu_erfinv, gpu_erfcinv,
+                       GpuCAReduceCPY, gpu_erfinv, gpu_erfcinv,
                       max_inputs_to_GpuElemwise)
 from .subtensor import (GpuIncSubtensor, GpuSubtensor,
                        GpuAdvancedSubtensor,
@@ -165,14 +163,14 @@ gpu_optimizer.register('local_remove_all_assert',
 def safe_to_gpu(x, ctx_name):
    if isinstance(x.type, tensor.TensorType):
-        return gpu_from_host(ctx_name)(x)
+        return GpuFromHost(ctx_name)(x)
    else:
        return x
 def safe_to_cpu(x):
    if isinstance(x.type, GpuArrayType):
-        return host_from_gpu(x)
+        return x.transfer('cpu')
    else:
        return x
@@ -236,7 +234,7 @@ def op_lifter(OP, cuda_only=False):
                    elif isinstance(new_op, (tuple, list)):
                        return [safe_to_cpu(o) for o in new_op]
                    else:  # suppose it is a variable on the GPU
-                        return [host_from_gpu(new_op)]
+                        return [new_op.transfer('cpu')]
            return False
        local_opt.__name__ = maker.__name__
        return local_optimizer(OP)(local_opt)
@@ -269,7 +267,7 @@ class InputToGpuOptimizer(Optimizer):
                continue
            try:
-                new_input = host_from_gpu(gpu_from_host(target)(input))
+                new_input = GpuFromHost(target)(input).transfer('cpu')
                fgraph.replace_validate(input, new_input,
                                        "InputToGpuOptimizer")
            except TypeError:
@@ -546,7 +544,7 @@ def local_cut_gpu_transfers(node):
        # gpub ->
        if isinstance(n2.op, GpuToGpu):
-            return [host_from_gpu(n2.inputs[0])]
+            return [n2.inputs[0].transfer('cpu')]
    # ? -> gpua -> gpub
    elif isinstance(node.op, GpuToGpu):
@@ -600,14 +598,14 @@ def local_gpua_alloc2(node):
                i.owner.op in [host_from_gpu, tensor.alloc]
                for i in c.inputs[1:])
            for c, idx in node.outputs[0].clients)):
-        return [host_from_gpu(gpu_alloc(None)(*node.inputs))]
+        return [GpuAlloc(None)(*node.inputs).transfer('cpu')]
 @register_opt('fast_compile')
 @op_lifter([tensor.Alloc])
 @register_opt2([tensor.Alloc], 'fast_compile')
-def local_gpua_alloc(op, context_name, inputs, outputs):
+def local_gpuaalloc(op, context_name, inputs, outputs):
-    return gpu_alloc(context_name)
+    return GpuAlloc(context_name)(*inputs)
 @register_opt('fast_compile')
@@ -616,7 +614,7 @@ def local_gpua_alloc(op, context_name, inputs, outputs):
 def local_gpua_alloc_empty(op, context_name, inputs, outputs):
    # We use _props_dict() to make sure that the GPU op know all the
    # CPU op props.
-    return gpu_alloc_empty(context_name, **op._props_dict())
+    return GpuAllocEmpty(context_name=context_name, **op._props_dict())(*inputs)
 @register_opt()
@@ -627,7 +625,7 @@ def local_gpualloc_memset_0(node):
        if (isinstance(inp, GpuArrayConstant) and
                inp.data.size == 1 and
                (np.asarray(inp.data) == 0).all()):
-            new_op = gpu_alloc(node.op.context_name, memset_0=True)
+            new_op = GpuAlloc(node.op.context_name, memset_0=True)
            return [new_op(*node.inputs)]
@@ -637,8 +635,8 @@ def local_gpua_alloc_empty_to_zeros(node):
    if isinstance(node.op, GpuAllocEmpty):
        context_name = infer_context_name(*node.inputs)
        z = np.asarray(0, dtype=node.outputs[0].dtype)
-        return [gpu_alloc(context_name)(as_gpuarray_variable(z, context_name),
+        return [GpuAlloc(context_name)(as_gpuarray_variable(z, context_name),
-                                        *node.inputs)]
+                                       *node.inputs)]
 optdb.register('local_gpua_alloc_empty_to_zeros',
               theano.tensor.opt.in2out(local_gpua_alloc_empty_to_zeros),
               # After move to gpu and merge2, before inplace.
@@ -918,7 +916,7 @@ def local_gpu_pdbbreakpoint_op(node):
        new_outputs = []
        for i in range(len(new_op_outputs)):
            if input_transfered[i]:
-                new_outputs.append(host_from_gpu(new_op_outputs[i]))
+                new_outputs.append(new_op_outputs[i].transfer('cpu'))
            else:
                new_outputs.append(new_op_outputs[i])
@@ -983,7 +981,7 @@ def local_gpua_subtensor(op, context_name, inputs, outputs):
                        for n, _ in outputs[0].clients]):
                    return
                else:
-                    return [host_from_gpu(gpu_x.owner.op(outputs[0]))]
+                    return [gpu_x.owner.op(outputs[0]).transfer('cpu')]
    return GpuSubtensor(op.idx_list)
@@ -1234,7 +1232,7 @@ def local_gpua_dot22scalar(op, context_name, inputs, outputs):
    x, y, a = inputs
    x = as_gpuarray_variable(x, context_name)
    y = as_gpuarray_variable(y, context_name)
-    z = gpu_alloc_empty(context_name, dtype=x.dtype)(x.shape[0], y.shape[1])
+    z = GpuAllocEmpty(x.dtype, context_name)(x.shape[0], y.shape[1])
    return [gpugemm_no_inplace(z, a, x, y, 0)]
@@ -1804,10 +1802,10 @@ def local_gpu_elemwise_careduce(node):
            isinstance(node.inputs[0].owner.op.scalar_op, scalar.basic.Sqr)):
        op = node.op
        inp = node.inputs[0].owner.inputs[0]
-        return [gpu_ca_reduce_cuda(scalar_op=op.scalar_op,
+        return [GpuCAReduceCuda(scalar_op=op.scalar_op,
-                                   axis=op.axis,
+                                axis=op.axis,
-                                   reduce_mask=op.reduce_mask,
+                                reduce_mask=op.reduce_mask,
-                                   pre_scalar_op=scalar.basic.sqr)(inp)]
+                                pre_scalar_op=scalar.basic.sqr)(inp)]
 @local_optimizer(None)

--- a/theano/gpuarray/opt_util.py
+++ b/theano/gpuarray/opt_util.py
@@ -8,7 +8,7 @@ from theano.gof import local_optimizer
 from theano.tensor import (DimShuffle, get_scalar_constant_value,
                           NotScalarConstantError)
-from .basic_ops import GpuFromHost, HostFromGpu, GpuAllocEmpty, GpuReshape, gpu_alloc_empty
+from .basic_ops import GpuFromHost, HostFromGpu, GpuAllocEmpty, GpuReshape
 from .elemwise import GpuDimShuffle, GpuElemwise
 _one = scal.constant(np.asarray(1.0, dtype='float32'))
@@ -324,7 +324,7 @@ def inplace_allocempty(op, idx):
            if (alloc.owner and
                    isinstance(alloc.owner.op, GpuAllocEmpty) and
                    len(alloc.clients) > 1):
-                alloc_op = gpu_alloc_empty(alloc.owner.op.context_name, dtype=alloc.owner.op.dtype)
+                alloc_op = GpuAllocEmpty(alloc.owner.op.dtype, alloc.owner.op.context_name)
                inputs[idx] = alloc_op(*alloc.owner.inputs)
            return maker(node, inputs)
        return opt

--- a/theano/gpuarray/type.py
+++ b/theano/gpuarray/type.py
@@ -271,7 +271,7 @@ class GpuArrayType(Type):
        return data
    def filter_variable(self, other, allow_convert=True):
-        from theano.gpuarray.basic_ops import gpu_from_host
+        from theano.gpuarray.basic_ops import GpuFromHost
        if hasattr(other, '_as_GpuArrayVariable'):
            other = other._as_GpuArrayVariable(self.context_name)
@@ -303,7 +303,7 @@ class GpuArrayType(Type):
                                 str(self.broadcastable)))
            other = other2
-        return gpu_from_host(self.context_name)(other)
+        return GpuFromHost(self.context_name)(other)
    @staticmethod
    def values_eq(a, b, force_same_dtype=True):

--- a/theano/gradient.py
+++ b/theano/gradient.py
@@ -1712,6 +1712,9 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
            if max_abs_err > abs_tol and max_rel_err > rel_tol:
                raise verify_grad.E_grad(max_arg, max_err_pos,
+                                         analytic_grad[max_arg].shape,
+                                         analytic_grad[max_arg].flatten()[max_err_pos],
+                                         num_grad.gf[max_arg].flatten()[max_err_pos],
                                         max_abs_err, max_rel_err,
                                         abs_tol, rel_tol)
@@ -1727,10 +1730,14 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None,
 class GradientError(Exception):
    """This error is raised when a gradient is calculated, but incorrect."""
-    def __init__(self, arg, err_pos, abs_err, rel_err, abs_tol, rel_tol):
+    def __init__(self, arg, err_pos, shape, val1, val2,
+                 abs_err, rel_err, abs_tol, rel_tol):
        Exception.__init__(self)  # to be compatible with python2.4
        self.arg = arg
        self.err_pos = err_pos
+        self.shape = shape
+        self.val1 = val1
+        self.val2 = val2
        self.abs_err = abs_err
        self.rel_err = rel_err
        self.abs_tol = abs_tol
@@ -1741,10 +1748,13 @@ class GradientError(Exception):
        args_msg = ", ".join(str(a) for a in self.args)
        return """\
 GradientError: numeric gradient and analytic gradient exceed tolerance:
-        At position %i of argument %i,
+        At position %i of argument %i with shape %s,
+            val1 = %f      ,  val2 = %f
            abs. error = %f,  abs. tolerance = %f
            rel. error = %f,  rel. tolerance = %f
 Exception args: %s""" % (self.err_pos, self.arg,
+                         self.shape,
+                         self.val1, self.val2,
                         self.abs_err, self.abs_tol,
                         self.rel_err, self.rel_tol,
                         args_msg)

--- a/theano/ifelse.py
+++ b/theano/ifelse.py
@@ -26,7 +26,6 @@ from six import iteritems
 from six.moves import xrange
 from theano.compile import optdb
 from theano.tensor import opt
-from theano.scan_module.scan_utils import find_up
 from theano.scan_module.scan_utils import clone
@@ -578,7 +577,7 @@ class CondMerge(gof.Optimizer):
        merging_node = cond_nodes[0]
        for proposal in cond_nodes[1:]:
            if (proposal.inputs[0] == merging_node.inputs[0] and
-                    not find_up(proposal, merging_node)):
+                    not gof.graph.is_in_ancestors(proposal, merging_node)):
                # Create a list of replacements for proposal
                mn_ts = merging_node.inputs[1:][:merging_node.op.n_outs]
                mn_fs = merging_node.inputs[1:][merging_node.op.n_outs:]
@@ -683,8 +682,8 @@ def cond_merge_random_op(main_node):
    merging_node = cond_nodes[0]
    for proposal in cond_nodes[1:]:
        if (proposal.inputs[0] == merging_node.inputs[0] and
-                not find_up(proposal, merging_node) and
+                not gof.graph.is_in_ancestors(proposal, merging_node) and
-                not find_up(merging_node, proposal)):
+                not gof.graph.is_in_ancestors(merging_node, proposal)):
            # Create a list of replacements for proposal
            mn_ts = merging_node.inputs[1:][:merging_node.op.n_outs]
            mn_fs = merging_node.inputs[1:][merging_node.op.n_outs:]

--- a/theano/misc/latence_gpu_transfert.py
+++ b/theano/misc/latence_gpu_transfert.py
@@ -9,7 +9,7 @@ import theano
 y = theano.tensor.fvector()
 x = theano.shared(np.zeros(1, dtype='float32'))
 f1 = theano.function([y], updates={x: y})
-f2 = theano.function([], theano.sandbox.cuda.host_from_gpu(x))
+f2 = theano.function([], x.transfer('cpu'))
 print(f1.maker.fgraph.toposort())
 print(f2.maker.fgraph.toposort())
 for i in [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]:

--- a/theano/sandbox/linalg/__init__.py
+++ b/theano/sandbox/linalg/__init__.py
 from __future__ import absolute_import, print_function, division
-from .ops import (cholesky, matrix_inverse, solve,
+from theano.tensor.slinalg import (cholesky, solve, eigvalsh)
-        diag, extract_diag, alloc_diag,
+from theano.tensor.nlinalg import (matrix_inverse,
-        det, psd, eig, eigh, eigvalsh,
+                                   diag, extract_diag, alloc_diag,
-        trace, spectral_radius_bound)
+                                   det, eig, eigh,
+                                   trace)
+from theano.sandbox.linalg.ops import psd, spectral_radius_bound
--- a/theano/sandbox/linalg/ops.py
+++ b/theano/sandbox/linalg/ops.py
 from __future__ import absolute_import, print_function, division
 import logging
-logger = logging.getLogger(__name__)
-import numpy
 from six import iteritems, integer_types
 from six.moves import xrange
 from theano.gof import Op, Apply
-from theano.tensor import as_tensor_variable, dot, DimShuffle, Dot
+from theano.tensor import DimShuffle, Dot
 from theano.tensor.blas import Dot22
 from theano import tensor
 import theano.tensor
 from theano.tensor.opt import (register_stabilize,
-        register_specialize, register_canonicalize)
+                               register_specialize,
+                               register_canonicalize)
 from theano.gof import local_optimizer
 from theano.gof.opt import Optimizer
-from theano.gradient import DisconnectedType
+from theano.tensor.nlinalg import (MatrixInverse,
-from theano.tensor.nlinalg import ( MatrixInverse,
+                                   matrix_inverse,
-                                    matrix_inverse,
+                                   extract_diag,
-                                    MatrixPinv,
+                                   trace,
-                                    pinv,
+                                   det)
-                                    AllocDiag,
-                                    alloc_diag,
+from theano.tensor.slinalg import (Cholesky,
-                                    ExtractDiag,
+                                   cholesky,
-                                    extract_diag,
+                                   Solve,
-                                    diag,
+                                   solve,
-                                    trace,
+                                   imported_scipy)
-                                    Det,
-                                    det,
-                                    Eig,
+logger = logging.getLogger(__name__)
-                                    eig,
-                                    Eigh,
-                                    EighGrad,
-                                    eigh,
-                                    matrix_dot,
-                                    _zero_disconnected,
-                                    qr,
-                                    svd,
-                                    lstsq,
-                                    matrix_power,
-                                    norm
-                                    )
-from theano.tensor.slinalg import ( Cholesky,
-                                    cholesky,
-                                    CholeskyGrad,
-                                    Solve,
-                                    solve,
-                                    Eigvalsh,
-                                    EigvalshGrad,
-                                    eigvalsh
-                                    )
-try:
-    import scipy.linalg
-    imported_scipy = True
-except ImportError:
-    # some ops (e.g. Cholesky, Solve, A_Xinv_b) won't work
-    imported_scipy = False
 class Hint(Op):
@@ -212,8 +180,6 @@ class HintsFeature(object):
 class HintsOptimizer(Optimizer):
    """
    Optimizer that serves to add HintsFeature as an fgraph feature.
    """
    def __init__(self):
@@ -310,8 +276,8 @@ def tag_solve_triangular(node):
                    return [Solve('lower_triangular')(A, b)]
                else:
                    return [Solve('upper_triangular')(A, b)]
-            if (A.owner and isinstance(A.owner.op, DimShuffle)
+            if (A.owner and isinstance(A.owner.op, DimShuffle) and
-                and A.owner.op.new_order == (1, 0)):
+                    A.owner.op.new_order == (1, 0)):
                A_T, = A.owner.inputs
                if A_T.owner and isinstance(A_T.owner.op, type(cholesky)):
                    if A_T.owner.op.lower:
@@ -423,6 +389,5 @@ def spectral_radius_bound(X, log2_exponent):
    XX = X
    for i in xrange(log2_exponent):
        XX = tensor.dot(XX, XX)
-    return tensor.pow(
+    return tensor.pow(trace(XX),
-            trace(XX),
+                      2 ** (-log2_exponent))
-            2 ** (-log2_exponent))
--- a/theano/sandbox/linalg/tests/test_linalg.py
+++ b/theano/sandbox/linalg/tests/test_linalg.py
@@ -163,4 +163,4 @@ def test_matrix_inverse_solve():
    b = theano.tensor.dmatrix('b')
    node = matrix_inverse(A).dot(b).owner
    [out] = inv_as_solve.transform(node)
-    assert isinstance(out.owner.op, Solve)               
+    assert isinstance(out.owner.op, Solve)
--- a/theano/sandbox/rng_mrg.py
+++ b/theano/sandbox/rng_mrg.py
@@ -29,8 +29,7 @@ from theano.gpuarray.basic_ops import GpuKernelBase, Kernel, infer_context_name,
 from theano.gpuarray.type import GpuArrayType
 from theano.gpuarray.fp16_help import write_w
 from theano.gpuarray.opt import (register_opt as register_gpua,
-                                 register_opt2,
+                                 register_opt2)
-                                 host_from_gpu as host_from_gpua)
 if theano.sandbox.cuda.cuda_available:
    from theano.sandbox.cuda import (CudaNdarrayType,
                                     float32_shared_constructor)
@@ -1621,7 +1620,7 @@ def local_gpua_mrg_graph(op, context_name, inputs, outputs):
                                    op.output_type.ndim,
                                    op.output_type.dtype,
                                    inputs[1])
-        return [outs[0], host_from_gpua(outs[1])]
+        return [outs[0], outs[1].transfer('cpu')]
 @register_gpua('fast_compile')

--- a/theano/scan_module/scan_opt.py
+++ b/theano/scan_module/scan_opt.py
@@ -70,7 +70,7 @@ from theano.gof.opt import pre_constant_merge, pre_greedy_local_optimizer
 from theano.scan_module import scan_op
 from theano.scan_module import scan_utils
-from theano.scan_module.scan_utils import equal_computations, find_up, scan_args
+from theano.scan_module.scan_utils import equal_computations, scan_args
 __docformat__ = 'restructedtext en'
 __authors__ = ("Razvan Pascanu "
@@ -1605,7 +1605,7 @@ class ScanSaveMem(gof.Optimizer):
                        nw_pos = compress_map[idx]
                        old_new += [(o, new_outs[nw_pos])]
                # Check if the new outputs depend on the old scan node
-                old_scan_is_used = [scan_utils.find_up(new.owner, node)
+                old_scan_is_used = [gof.graph.is_in_ancestors(new.owner, node)
                                    for old, new in old_new]
                if any(old_scan_is_used):
                    return False
@@ -1829,19 +1829,21 @@ class ScanMerge(gof.Optimizer):
        except tensor.NotScalarConstantError:
            pass
+        if nsteps != rep_nsteps:
+            return False
        # Check to see if it is an input of a different node
        for nd in set_nodes:
-            if find_up(node, nd) or find_up(nd, node):
+            if gof.graph.is_in_ancestors(node, nd) or gof.graph.is_in_ancestors(nd, node):
                return False
        if not node.op.as_while:
-            return nsteps == rep_nsteps
+            return True
        cond = node.op.outputs[-1]
        rep_cond = rep.op.outputs[-1]
-        same_cond = scan_utils.equal_computations([cond], [rep_cond],
+        return scan_utils.equal_computations([cond], [rep_cond],
-                                                  node.op.inputs,
+                                             node.op.inputs,
-                                                  rep.op.inputs)
+                                             rep.op.inputs)
-        return same_cond and (nsteps == rep_nsteps)
    def apply(self, fgraph):
        # Collect all scan nodes ordered according to toposort

--- a/theano/scan_module/scan_utils.py
+++ b/theano/scan_module/scan_utils.py
@@ -152,7 +152,7 @@ def traverse(out, x, x_copy, d, visited=None):
        return d
    visited.add(out)
    from theano.sandbox import cuda
-    from theano.gpuarray.basic_ops import gpu_from_host, host_from_gpu
+    from theano.gpuarray.basic_ops import GpuFromHost, host_from_gpu
    from theano.gpuarray import pygpu_activated
    from theano.gpuarray.type import GpuArrayType
    if out == x:
@@ -160,7 +160,7 @@ def traverse(out, x, x_copy, d, visited=None):
            d[out] = cuda.gpu_from_host(x_copy)
        else:
            assert isinstance(x.type, GpuArrayType)
-            d[out] = gpu_from_host(x.type.context_name)(x_copy)
+            d[out] = GpuFromHost(x.type.context_name)(x_copy)
        return d
    elif out.owner is None:
        return d
@@ -876,10 +876,13 @@ class Validator(object):
            if out.owner is None:
                if isinstance(out, tensor.TensorConstant):
-                    if hasattr(out, 'fgraph'):
+                    if hasattr(out, 'fgraph') or getattr(out, 'cached', False):
                        # If out have an fgraph, we aren't sure if it
                        # is from the inner graph or outer graph, so
                        # clone it.
+                        # As it will be used as is in an FunctionGraph
+                        # (won't be cloned later), it can't be a
+                        # cached variable
                        cloned_out = out.clone()
                        self.valid.add(cloned_out)
                        self.invalid.add(out)
@@ -1113,20 +1116,6 @@ def compress_outs(op, not_required, inputs):
    return (op_inputs, op_outputs, info, node_inputs, map_old_new)
-def find_up(l_node, f_node):
-    r"""
-    Goes up in the graph and returns True if a node in nodes is found.
-    """
-    if isinstance(l_node, gof.Apply):
-        l_outs = l_node.outputs
-    else:
-        l_outs = l_node
-    l_ins = gof.graph.inputs(l_outs)
-    nodes = gof.graph.io_toposort(l_ins, l_outs)
-    return f_node in nodes
 def reconstruct_graph(inputs, outputs, tag=None):
    """
    Different interface to clone, that allows you to pass inputs.

--- a/theano/tensor/nnet/conv3d2d.py
+++ b/theano/tensor/nnet/conv3d2d.py
@@ -332,7 +332,7 @@ def make_gpu_optimizer(op, to_gpu):
                    new_inp[idx] = cuda.gpu_from_host(new_inp[idx])
                result_node = op()(*new_inp)
                copy_stack_trace(node.outputs[0], result_node)
-                transfer_node = cuda.host_from_gpu(result_node)
+                transfer_node = result_node.transfer('cpu')
                copy_stack_trace(node.outputs[0], transfer_node)
                return [transfer_node]
        if node.op == cuda.gpu_from_host:

--- a/theano/tensor/tests/mlp_test.py
+++ b/theano/tensor/tests/mlp_test.py
@@ -8,7 +8,7 @@ __docformat__ = 'restructedtext en'
 from collections import OrderedDict
-import numpy
+import numpy as np
 import theano
 import theano.tensor as T
@@ -17,12 +17,12 @@ import theano.tensor as T
 def gen_data():
    # generate the dataset
-    train_set = (numpy.asarray(numpy.random.rand(10000, 784), dtype='float32'),
+    train_set = (np.asarray(np.random.rand(10000, 784), dtype='float32'),
-               numpy.asarray(numpy.random.rand(10000)*10, dtype='int64'))
+               np.asarray(np.random.rand(10000)*10, dtype='int64'))
-    valid_set = (numpy.asarray(numpy.random.rand(10000, 784), dtype='float32'),
+    valid_set = (np.asarray(np.random.rand(10000, 784), dtype='float32'),
-               numpy.asarray(numpy.random.rand(10000)*10, dtype='int64'))
+               np.asarray(np.random.rand(10000)*10, dtype='int64'))
-    test_set = (numpy.asarray(numpy.random.rand(10000, 784), dtype='float32'),
+    test_set = (np.asarray(np.random.rand(10000, 784), dtype='float32'),
-               numpy.asarray(numpy.random.rand(10000)*10, dtype='int64'))
+               np.asarray(np.random.rand(10000)*10, dtype='int64'))
    def shared_dataset(data_xy):
        """ Function that loads the dataset into shared variables
@@ -33,8 +33,8 @@ def gen_data():
        variable) would lead to a large decrease in performance.
        """
        data_x, data_y = data_xy
-        shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
+        shared_x = theano.shared(np.asarray(data_x, dtype=theano.config.floatX))
-        shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
+        shared_y = theano.shared(np.asarray(data_y, dtype=theano.config.floatX))
        # When storing data on the GPU it has to be stored as floats
        # therefore we will store the labels as ``floatX`` as well
        # (``shared_y`` does exactly that). But during our computations
@@ -79,7 +79,7 @@ class LogisticRegression(object):
        """
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
-        self.W = theano.shared(value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),
+        self.W = theano.shared(value=np.zeros((n_in, n_out), dtype=theano.config.floatX),
                                name=name_prefix+'W')
        # compute vector of class-membership probabilities in symbolic form
@@ -129,7 +129,7 @@ class HiddenLayer(object):
        Hidden unit activation is given by: tanh(dot(input,W) + b)
-        :type rng: numpy.random.RandomState
+        :type rng: np.random.RandomState
        :param rng: a random number generator used to initialize weights
        :type input: theano.tensor.dmatrix
@@ -151,9 +151,9 @@ class HiddenLayer(object):
        # from -6./sqrt(n_in+n_hidden) and 6./sqrt(n_in+n_hidden)
        # the output of uniform if converted using asarray to dtype
        # theano.config.floatX so that the code is runable on GPU
-        W_values = numpy.asarray( rng.uniform( \
+        W_values = np.asarray( rng.uniform( \
-              low=-numpy.sqrt(6./(n_in+n_out)), \
+              low=-np.sqrt(6./(n_in+n_out)), \
-              high=numpy.sqrt(6./(n_in+n_out)), \
+              high=np.sqrt(6./(n_in+n_out)), \
              size=(n_in, n_out)), dtype=theano.config.floatX)
        self.W = theano.shared(value=W_values, name=name_prefix+'W')
@@ -176,7 +176,7 @@ class MLP(object):
    def __init__(self, rng, input, n_in, n_hidden, n_out):
        """Initialize the parameters for the multilayer perceptron
-        :type rng: numpy.random.RandomState
+        :type rng: np.random.RandomState
        :param rng: a random number generator used to initialize weights
        :type input: theano.tensor.TensorType
@@ -265,7 +265,7 @@ def test_mlp():
    y     = T.ivector('y')  # the labels are presented as 1D vector of
                           # [int] labels
-    rng = numpy.random.RandomState(1234)
+    rng = np.random.RandomState(1234)
    # construct the MLP class
    classifier = MLP( rng=rng, input=x, n_in=28*28, n_hidden=500, n_out=10)

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
--- a/theano/tensor/tests/test_blas_scipy.py
+++ b/theano/tensor/tests/test_blas_scipy.py
--- a/theano/tensor/tests/test_complex.py
+++ b/theano/tensor/tests/test_complex.py
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
--- a/theano/tensor/tests/test_extra_ops.py
+++ b/theano/tensor/tests/test_extra_ops.py
--- a/theano/tensor/tests/test_fft.py
+++ b/theano/tensor/tests/test_fft.py
--- a/theano/tensor/tests/test_fourier.py
+++ b/theano/tensor/tests/test_fourier.py
--- a/theano/tensor/tests/test_gc.py
+++ b/theano/tensor/tests/test_gc.py
--- a/theano/tensor/tests/test_inc_subtensor.py
+++ b/theano/tensor/tests/test_inc_subtensor.py
--- a/theano/tensor/tests/test_io.py
+++ b/theano/tensor/tests/test_io.py
--- a/theano/tensor/tests/test_keepdims.py
+++ b/theano/tensor/tests/test_keepdims.py
--- a/theano/tensor/tests/test_merge.py
+++ b/theano/tensor/tests/test_merge.py
--- a/theano/tensor/tests/test_misc.py
+++ b/theano/tensor/tests/test_misc.py
--- a/theano/tensor/tests/test_nlinalg.py
+++ b/theano/tensor/tests/test_nlinalg.py
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tensor/tests/test_opt_uncanonicalize.py
+++ b/theano/tensor/tests/test_opt_uncanonicalize.py
--- a/theano/tensor/tests/test_raw_random.py
+++ b/theano/tensor/tests/test_raw_random.py
--- a/theano/tensor/tests/test_shared_randomstreams.py
+++ b/theano/tensor/tests/test_shared_randomstreams.py
--- a/theano/tensor/tests/test_sharedvar.py
+++ b/theano/tensor/tests/test_sharedvar.py
--- a/theano/tensor/tests/test_slinalg.py
+++ b/theano/tensor/tests/test_slinalg.py
--- a/theano/tensor/tests/test_subtensor.py
+++ b/theano/tensor/tests/test_subtensor.py
--- a/theano/tensor/tests/test_utils.py
+++ b/theano/tensor/tests/test_utils.py
--- a/theano/tests/breakpoint.py
+++ b/theano/tests/breakpoint.py
--- a/theano/tests/diverse_tests.py
+++ b/theano/tests/diverse_tests.py
--- a/theano/tests/test_2nd_order_grads.py
+++ b/theano/tests/test_2nd_order_grads.py
--- a/theano/tests/test_breakpoint.py
+++ b/theano/tests/test_breakpoint.py
--- a/theano/tests/test_flake8.py
+++ b/theano/tests/test_flake8.py
--- a/theano/tests/test_ifelse.py
+++ b/theano/tests/test_ifelse.py
--- a/theano/tests/test_pickle_unpickle_theano_fn.py
+++ b/theano/tests/test_pickle_unpickle_theano_fn.py
--- a/theano/tests/test_printing.py
+++ b/theano/tests/test_printing.py
--- a/theano/tests/test_rop.py
+++ b/theano/tests/test_rop.py
--- a/theano/tests/unittest_tools.py
+++ b/theano/tests/unittest_tools.py