Merge branch 'master' of git://github.com/Theano/Theano

0a7a4c06 · Chinnadhurai Sankar · 1a53098a · 59a5dfbb · 0a7a4c06 · 0a7a4c06
--- a/.gitignore
+++ b/.gitignore
@@ -37,3 +37,4 @@ Theano.suo
 .ipynb_checkpoints
 .pydevproject
 .ropeproject
+core
\ No newline at end of file
--- a/README.txt
+++ b/README.txt
@@ -10,15 +10,14 @@ Related Projects:
   https://github.com/Theano/Theano/wiki/Related-projects
-We recommend you look at the documentation on the website, since it
+It is recommended that you look at the documentation on the website, as it will be more current than the documentation included with the package.
-will be more current than the documentation included with the package.
-If you really wish to build the documentation yourself, you will need
-sphinx. Issue the following command:
+In order to build the documentation yourself, you will need sphinx. Issue the following command:
    python ./doc/scripts/docgen.py
 Documentation is built into html/
-The PDF of the documentation is html/theano.pdf
+The PDF of the documentation can be found at html/theano.pdf
 DIRECTORY LAYOUT
@@ -31,7 +30,7 @@ Theano (current directory) is the distribution directory.
        * tensor depends upon scalar
        * sparse depends upon tensor
        * sandbox can depend on everything else
-    * Theano/examples are copies of the example on the wiki
+    * Theano/examples are copies of the example found on the wiki
    * Theano/benchmark and Theano/examples are in the distribution, but not in
      the Python package
    * Theano/bin contains executable scripts that are copied to the bin folder
@@ -39,4 +38,4 @@ Theano (current directory) is the distribution directory.
    * Tests are distributed and are part of the package, i.e. fall in
      the appropriate submodules
    * Theano/doc contains files and scripts used to generate the documentation
-    * Theano/html is the place where the documentation will be generated
+    * Theano/html is where the documentation will be generated
--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
 Testing GPU Ops
 ^^^^^^^^^^^^^^^
-Ops to be executed on the GPU should inherit from the
+When using the old GPU backend, Ops to be executed on the GPU should inherit
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
+from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
 Theano to distinguish them. Currently, we use this to test if the
 NVIDIA driver works correctly with our sum reduction code on the GPU.

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
    If you want GPU-related tests to run on a specific GPU device, and not
    the default one, you should use :attr:`~config.init_gpu_device`.
-    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
+    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
    See :ref:`libdoc_config` for more information on how to change these
    configuration options.
@@ -508,25 +508,25 @@ Any one of them is enough.
    :ref:`Ubuntu instructions <install_ubuntu_gpu>`.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
 computer, and set the default floating point computations to float32.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
+For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
 You can also set these options in the .theanorc file's ``[global]`` section:
     .. code-block:: cfg
        [global]
-        device = gpu
+        device = cuda
        floatX = float32
 Note that:
-    * If your computer has multiple GPUs and you use 'device=gpu', the driver
+    * If your computer has multiple GPUs and you use 'device=cuda', the driver
-      selects the one to use (usually gpu0).
+      selects the one to use (usually cuda0).
-    * You can use the program nvida-smi to change this policy.
+    * You can use the program ``nvidia-smi`` to change this policy.
-    * You can choose one specific GPU by specifying 'device=gpuX', with X the
+    * You can choose one specific GPU by specifying 'device=cudaX', with X the
      the corresponding GPU index (0, 1, 2, ...)
    * By default, when ``device`` indicates preference for GPU computations,
      Theano will fall back to the CPU if there is a problem with the GPU.
@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
     toggle your GPU on, which can be done with
     `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once your setup is complete, head to :ref:`using_gpu` to find how to verify
 everything is working properly.

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -43,7 +43,7 @@ For Ubuntu 11.10 through 14.04:
    sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
    sudo pip install Theano
 On 14.04, this will install Python 2 by default. If you want to use Python 3:
 .. code-block:: bash
@@ -104,30 +104,30 @@ For Ubuntu 11.04:
   The development version of Theano supports Python 3.3 and
   probably supports Python 3.2, but we do not test on it.
 Bleeding Edge Installs
 ----------------------
-If you would like, instead, to install the bleeding edge Theano (from github) 
+If you would like, instead, to install the bleeding edge Theano (from github)
-such that you can edit and contribute to Theano, replace the `pip install Theano` 
+such that you can edit and contribute to Theano, replace the `pip install Theano`
 command with:
 .. code-block:: bash
    git clone git://github.com/Theano/Theano.git
-    cd Theano 
+    cd Theano
    python setup.py develop --user
    cd ..
 VirtualEnv
 ----------
-If you would like to install Theano in a VirtualEnv, you will want to pass the 
+If you would like to install Theano in a VirtualEnv, you will want to pass the
-`--system-site-packages` flag when creating the VirtualEnv so that it will pick up 
+`--system-site-packages` flag when creating the VirtualEnv so that it will pick up
 the system-provided `Numpy` and `SciPy`.
 .. code-block:: bash
    virtualenv --system-site-packages -p python2.7 theano-env
    source theano-env/bin/activate
    pip install Theano
@@ -208,7 +208,7 @@ Updating Bleeding Edge Installs
 Change to the Theano directory and run:
 .. code-block:: bash
    git pull
@@ -303,7 +303,7 @@ Test GPU configuration
 .. code-block:: bash
-    THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
 .. note::

--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -423,16 +423,16 @@ Create a test file containing:
   print("NP time: %f[s], theano time: %f[s] (times should be close when run on CPU!)" %(
                                              np_end-np_start, t_end-t_start))
   print("Result difference: %f" % (np.abs(AB-tAB).max(), ))
 .. testoutput::
   :hide:
   :options: +ELLIPSIS
   NP time: ...[s], theano time: ...[s] (times should be close when run on CPU!)
   Result difference: ...
 .. code-block:: none
   NP time: 1.480863[s], theano time: 1.475381[s] (times should be close when run on CPU!)
   Result difference: 0.000000
@@ -445,6 +445,8 @@ routine for matrix multiplication)
 Configure Theano for GPU use
 ############################
+Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
 Theano can be configured with a ``.theanorc`` text file (or
 ``.theanorc.txt``, whichever is easier for you to create under
 Windows). It should be placed in the directory pointed to by the
@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
 .. code-block:: cfg
   [global]
-   device = gpu
+   device = cuda
   floatX = float32
   [nvcc]
@@ -498,7 +500,7 @@ within an MSYS shell if you installed Nose manually as described above.
 Compiling a faster BLAS
 ~~~~~~~~~~~~~~~~~~~~~~~
-If you installed Python through WinPython or EPD, Theano will automatically 
+If you installed Python through WinPython or EPD, Theano will automatically
 link with the MKL library, so you should not need to compile your own BLAS.
 .. note::

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -1414,7 +1414,7 @@ Mathematical
 .. function:: abs_(a)
-    Returns a variable representingthe absolute of a, ie ``|a|``.
+    Returns a variable representing the absolute of a, ie ``|a|``.
    .. note:: Can also be accessed with ``abs(a)``.

--- a/doc/optimizations.txt
+++ b/doc/optimizations.txt
@@ -32,6 +32,7 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 ========================================================= ========= ============ =============
 :term:`merge`                                             x         x
 :term:`constant folding<constant folding>`                x         x
+:term:`GPU transfer`                                      x         x
 :term:`shape promotion<shape promotion>`                  x
 :term:`fill cut<fill cut>`                                x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x
@@ -52,7 +53,6 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 :term:`inplace_elemwise`                                  x
 :term:`inplace_random`                                    x
 :term:`elemwise fusion`                                   x
-:term:`GPU transfer`                                      x
 :term:`local_log_softmax`                                 x                      x
 :term:`local_remove_all_assert`                                                   
 ========================================================= ========= ============ =============

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 hints that give more flexibility to the compilation and optimization of the
 graph.
-For GPU graphs, this borrowing can have a major speed impact.  See the following code:
-.. code-block:: python
-   from theano import function, config, shared, sandbox, tensor, Out
-   import numpy
-   import time
-   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
-   iters = 1000
-   rng = numpy.random.RandomState(22)
-   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
-   f2 = function([],
-                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
-                     borrow=True))
-   t0 = time.time()
-   for i in range(iters):
-       r = f1()
-   t1 = time.time()
-   no_borrow = t1 - t0
-   t0 = time.time()
-   for i in range(iters):
-       r = f2()
-   t1 = time.time()
-   print(
-       "Looping %s times took %s seconds without borrow "
-       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
-   )
-   if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                 ('Gpu' not in type(x.op).__name__)
-                 for x in f1.maker.fgraph.toposort()]):
-       print('Used the cpu')
-   else:
-       print('Used the gpu')
-Which produces this output:
-.. code-block:: none
-   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
-   Using gpu device 0: GeForce GTX 275
-   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
-   Used the gpu
 *Take home message:*
 When an input *x* to a function is not needed after the function
@@ -317,4 +271,3 @@ requirement.  When a return value *y* is large (in terms of memory
 footprint), and you only need to read from it once, right away when
 it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
@@ -168,8 +168,8 @@ Linkers
 =======
 A mode is composed of 2 things: an optimizer and a linker. Some modes,
-like ``NanGuardMode`` and ``DebugMode``, add logic around the optimizer and
+like ``NanGuardMode`` and ``DebugMode``, add logic around the
-linker. ``NanGuardMode`` and ``DebugMode`` use their own linker.
+optimizer and linker. ``DebugMode`` uses its own linker.
 You can select which linker to use with the Theano flag :attr:`config.linker`.
 Here is a table to compare the different linkers.
@@ -183,7 +183,7 @@ c|py [#cpy1]_  yes        yes                "+++"      Try C code. If none exis
 c|py_nogc      no         yes                "++"       As c|py, but without gc
 c              no         yes                "+"        Use only C code (if none available for an op, raise an error)
 py             yes        yes                "+++"      Use only Python code
-NanGuardMode    no         no                 "++++"    Check if nodes generate NaN
+NanGuardMode   yes        yes                "++++"     Check if nodes generate NaN
 DebugMode      no         yes                VERY HIGH  Make many checks on what Theano computes
 =============  =========  =================  =========  ===

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
@@ -81,7 +81,7 @@ single name and a single device.
   It is often the case that multi-gpu operation requires or assumes
   that all the GPUs involved are equivalent.  This is not the case
   for this implementation.  Since the user has the task of
-   distrubuting the jobs across the different device a model can be
+   distributing the jobs across the different device a model can be
   built on the assumption that one of the GPU is slower or has
   smaller memory.
@@ -140,5 +140,5 @@ is a example.
   cv = gv.transfer('cpu')
 Of course you can mix transfers and operations in any order you
-choose.  However you should try to minimize transfer operations
+choose. However you should try to minimize transfer operations
-because they will introduce overhead any may reduce performance.
+because they will introduce overhead that may reduce performance.
--- a/theano/compile/nanguardmode.py
+++ b/theano/compile/nanguardmode.py
@@ -73,7 +73,7 @@ def contains_nan(arr, node=None):
    elif arr.size == 0:
        return False
    elif cuda.cuda_available and isinstance(arr, cuda.CudaNdarray):
-        if (hasattr(theano.sandbox, 'rng_mrg') and
+        if (node and hasattr(theano.sandbox, 'rng_mrg') and
            isinstance(
                node.op,
                # It store ints in float container
@@ -119,7 +119,7 @@ def contains_inf(arr, node=None):
    elif arr.size == 0:
        return False
    elif cuda.cuda_available and isinstance(arr, cuda.CudaNdarray):
-        if (hasattr(theano.sandbox, 'rng_mrg') and
+        if (node and hasattr(theano.sandbox, 'rng_mrg') and
            isinstance(
                node.op,
                # It store ints in float container
@@ -215,7 +215,7 @@ class NanGuardMode(Mode):
        assert nan_is_error or inf_is_error or big_is_error
        compile_gpu_func(nan_is_error, inf_is_error, big_is_error)
-        def do_check_on(var, nd, f, is_input):
+        def do_check_on(var, nd):
            """
            Checks `var` for NaNs / Infs. If detected, raises an exception
            and / or prints information about `nd`, `f`, and `is_input` to
@@ -227,11 +227,6 @@ class NanGuardMode(Mode):
                The value to be checked.
            nd : theano.gof.Apply
                The Apply node being executed.
-            f : callable
-                The thunk for the apply node.
-            is_input : bool
-                If True, `var` is an input to `nd`.
-                If False, it is an output.
            """
            error = False
@@ -262,17 +257,13 @@ class NanGuardMode(Mode):
                    print('Big value detected', file=sio)
                    error = True
            if error:
-                if not is_input:
+                if nd:
-                    print("NanGuardMode found an error in the"
+                    print("NanGuardMode found an error in the "
-                          " output of a node in this variable:", file=sio)
+                          "output of a node in this variable:", file=sio)
                    print(theano.printing.debugprint(nd, file='str'), file=sio)
                else:
-                    print("NanGuardMode found an error in an"
+                    print("NanGuardMode found an error in an input of the "
-                          " input of this node.", file=sio)
+                          "graph.", file=sio)
-                    print('Node:', file=sio)
-                    print(nd, file=sio)
-                    print("The input variable that cause problem:", file=sio)
-                    print(theano.printing.debugprint(nd, file='str'), file=sio)
                msg = sio.getvalue()
                if config.NanGuardMode.action == 'raise':
                    raise AssertionError(msg)
@@ -283,36 +274,16 @@ class NanGuardMode(Mode):
                elif config.NanGuardMode.action == 'warn':
                    logger.error(msg)
-        def nan_check(i, node, fn):
+        def nan_check(node, thunk, storage_map, compute_map):
-            """
+            for var in node.outputs:
-            Runs `fn` while checking its inputs and outputs for NaNs / Infs.
-            Parameters
-            ----------
-            i :
-                Currently ignored.
-                TODO: determine why it is here or remove).
-            node : theano.gof.Apply
-                The Apply node currently being executed.
-            fn : callable
-                The thunk to execute for this Apply node.
-            """
-            inputs = fn.inputs
-            for x, var in zip(inputs, node.inputs):
-                # If the input is the result of computation, then we
-                # don't need to check it. It is already done after the
-                # computation.
-                if (var.owner is None and
-                        getattr(var.tag, 'nan_guard_mode_check', True)):
-                    do_check_on(x[0], node, fn, True)
-            fn()
-            outputs = fn.outputs
-            for x, var in zip(outputs, node.outputs):
                if getattr(var.tag, 'nan_guard_mode_check', True):
-                    do_check_on(x[0], node, fn, False)
+                    do_check_on(storage_map[var][0], node)
+        def nan_check_input(var, value):
+            if getattr(var.tag, 'nan_guard_mode_check', True):
+                do_check_on(value, None)
-        wrap_linker = theano.gof.WrapLinker([theano.gof.OpWiseCLinker()],
+        wrap_linker = theano.gof.vm.VM_Linker(callback=nan_check,
-                                            nan_check)
+                                              callback_input=nan_check_input)
        super(NanGuardMode, self).__init__(wrap_linker,
                                           optimizer=self.provided_optimizer)
--- a/theano/compile/profiling.py
+++ b/theano/compile/profiling.py
@@ -84,10 +84,15 @@ def _atexit_print_fn():
                    cum_attr[key] = val
            if cum.optimizer_profile and ps.optimizer_profile:
-                merge = cum.optimizer_profile[0].merge_profile(
+                try:
-                    cum.optimizer_profile[1],
+                    merge = cum.optimizer_profile[0].merge_profile(
-                    ps.optimizer_profile[1])
+                        cum.optimizer_profile[1],
-                cum.optimizer_profile = (cum.optimizer_profile[0], merge)
+                        ps.optimizer_profile[1])
+                    cum.optimizer_profile = (cum.optimizer_profile[0], merge)
+                except Exception as e:
+                    print("Got an exception while merging profile")
+                    print(e)
+                    cum.optimizer_profile = None
            else:
                cum.optimizer_profile = None

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -104,10 +104,9 @@ class DeviceParam(ConfigParam):
 AddConfigVar(
    'device',
-    ("Default device for computations. If gpu*, change the default to try "
+    ("Default device for computations. If cuda* or opencl*, change the"
-     "to move computation to it and to put shared variable of float32 "
+     "default to try to move computation to the GPU. Do not use upper case"
-     "on it. Do not use upper case letters, only lower case even if "
+     "letters, only lower case even if NVIDIA uses capital letters."),
-     "NVIDIA use capital letters."),
    DeviceParam('cpu', allow_override=False),
    in_c_key=False)
@@ -273,7 +272,8 @@ def safe_no_dnn_workmem_bwd(workmem):
    return True
 AddConfigVar('dnn.conv.workmem_bwd',
-             "This flag is deprecated; use dnn.conv.algo_bwd.",
+             "This flag is deprecated; use `dnn.conv.algo_bwd_filter` "
+             "and `dnn.conv.algo_bwd_data` instead.",
             ConfigParam('', allow_override=False,
                         filter=safe_no_dnn_workmem_bwd),
             in_c_key=False)
@@ -651,8 +651,8 @@ AddConfigVar('warn.ignore_bug_before',
              "bugs found after that version. "
              "Warning for specific bugs can be configured with specific "
              "[warn] flags."),
-             EnumStr('0.7', 'None', 'all', '0.3', '0.4', '0.4.1', '0.5', '0.7',
+             EnumStr('0.7', 'None', 'all', '0.3', '0.4', '0.4.1', '0.5', '0.6',
-                     '0.8',
+                     '0.7', '0.8', '0.8.1', '0.8.2',
                     allow_override=False),
             in_c_key=False)

--- a/theano/gof/link.py
+++ b/theano/gof/link.py
@@ -165,6 +165,9 @@ def raise_with_op(node, thunk=None, exc_info=None, storage_map=None):
        detailed_err_msg += ("Inputs shapes: %s" % shapes +
                             "\nInputs strides: %s" % strides +
                             "\nInputs values: %s" % scalar_values)
+        if theano.config.exception_verbosity == 'high':
+            detailed_err_msg += "\nInputs type_num: %s" % str(
+                [getattr(getattr(i[0], 'dtype', ''), 'num', '') for i in thunk.inputs])
        if hasattr(node.op, '__input_name__'):
            detailed_err_msg += "\nInputs name: %s\n" % str(node.op.__input_name__)

--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
--- a/theano/gof/optdb.py
+++ b/theano/gof/optdb.py
@@ -244,16 +244,26 @@ class EquilibriumDB(DB):
        optimization application. This could result in less fgraph iterations,
        but this doesn't mean it will be faster globally.
+    tracks_on_change_inputs
+        If True, we will re-apply local opt on nodes whose inputs
+        changed during local optimization application. This could
+        result in less fgraph iterations, but this doesn't mean it
+        will be faster globally.
    Notes
    -----
    We can put LocalOptimizer and Optimizer as EquilibriumOptimizer
    suppor both.
+    It is probably not a good idea to have ignore_newtrees=False and
+    tracks_on_change_inputs=True
    """
-    def __init__(self, ignore_newtrees=True):
+    def __init__(self, ignore_newtrees=True, tracks_on_change_inputs=False):
        super(EquilibriumDB, self).__init__()
        self.ignore_newtrees = ignore_newtrees
+        self.tracks_on_change_inputs = tracks_on_change_inputs
        self.__final__ = {}
        self.__cleanup__ = {}
@@ -281,6 +291,7 @@ class EquilibriumDB(DB):
            opts,
            max_use_ratio=config.optdb.max_use_ratio,
            ignore_newtrees=self.ignore_newtrees,
+            tracks_on_change_inputs=self.tracks_on_change_inputs,
            failure_callback=opt.NavigatorOptimizer.warn_inplace,
            final_optimizers=final_opts,
            cleanup_optimizers=cleanup_opts)

--- a/theano/gof/vm.py
+++ b/theano/gof/vm.py
@@ -332,7 +332,7 @@ class Stack(VM):
    def __init__(self, nodes, thunks, pre_call_clear,
                 storage_map, compute_map, fgraph, allow_gc,
-                 dependencies=None, callback=None):
+                 dependencies=None, callback=None, callback_input=None):
        super(Stack, self).__init__(nodes, thunks, pre_call_clear)
        self.allow_gc = allow_gc
@@ -345,6 +345,7 @@ class Stack(VM):
        self.compute_map = compute_map
        self.node_idx = node_idx = {}
        self.callback = callback
+        self.callback_input = callback_input
        ords = fgraph.orderings()
@@ -411,6 +412,8 @@ class Stack(VM):
        for k in self.storage_map:
            compute_map[k][0] = (k.owner is None)
+            if self.callback_input and compute_map[k][0]:
+                self.callback_input(k, self.storage_map[k][0])
        # apply_stack contains nodes
        if output_subset is not None:
@@ -684,6 +687,11 @@ class VM_Linker(link.LocalLinker):
        A callable object to call after each call to a thunk within
        the virtual machine.  It will be called with four arguments called
        'node', 'thunk', 'storage_map', and 'compute_map'.
+    callback_input
+        A callable object to call on each input to the graph
+        (variables with no owner).  This includes constants and shared
+        variables values.  It will be called with two arguments:
+        'var', 'value'.
    lazy
        Useful only when use_cloop is False. When lazy is None, use the
        theano flag vm.lazy value. Then if we have a None (default) we auto
@@ -700,8 +708,8 @@ class VM_Linker(link.LocalLinker):
    """
    def __init__(self, allow_gc=None, use_cloop=False, callback=None,
-                 lazy=None, schedule=None, c_thunks=None,
+                 callback_input=None, lazy=None, schedule=None,
-                 allow_partial_eval=None):
+                 c_thunks=None, allow_partial_eval=None):
        # Note: if more parameters are added to __init__, make sure to forward
        # them in the "type(self)(...)" call in the "accept" method below.
        if allow_gc is None:
@@ -710,6 +718,7 @@ class VM_Linker(link.LocalLinker):
        self.allow_gc = allow_gc
        self.use_cloop = use_cloop
        self.callback = callback
+        self.callback_input = callback_input
        self.lazy = lazy
        self.c_thunks = c_thunks
        self.allow_partial_eval = allow_partial_eval
@@ -760,9 +769,11 @@ class VM_Linker(link.LocalLinker):
                allow_gc=self.allow_gc,
                use_cloop=self.use_cloop,
                callback=self.callback,
+                callback_input=self.callback_input,
                lazy=self.lazy,
                schedule=self.schedule,
                c_thunks=self.c_thunks,
+                allow_partial_eval=self.allow_partial_eval
            ).accept(fgraph, no_recycling)
        self.fgraph = fgraph
        self.no_recycling = no_recycling
@@ -829,16 +840,17 @@ class VM_Linker(link.LocalLinker):
        pre_call_clear = [storage_map[v] for v in self.no_recycling]
-        if (self.callback is not None or
+        if (self.callback is not None or self.callback_input is not None or
                (config.profile and config.profile_memory) or
-                getattr(self, 'allow_partial_eval', False)):
+                self.allow_partial_eval):
-            if self.use_cloop and self.callback is not None:
+            if self.use_cloop and (self.callback is not None or
+                                   self.callback_input is not None):
                logger.warn('CVM does not support callback, using Stack VM.')
            if self.use_cloop and config.profile_memory:
                warnings.warn(
                    'CVM does not support memory profile, using Stack VM.')
-            if self.use_cloop and getattr(self, 'allow_partial_eval', False):
+            if self.use_cloop and self.allow_partial_eval:
                warnings.warn(
                    'CVM does not support partial evaluation yet, '
                    'using Stack VM.')
@@ -849,7 +861,8 @@ class VM_Linker(link.LocalLinker):
                storage_map, compute_map,
                self.fgraph, self.allow_gc,
                dependencies=deps,
-                callback=self.callback)
+                callback=self.callback,
+                callback_input=self.callback_input)
        elif self.use_cloop:
            # create a map from nodes to ints and vars to ints
            nodes_idx = {}
@@ -1046,7 +1059,7 @@ class VM_Linker(link.LocalLinker):
        if lazy is None:
            lazy = not all([(not th.lazy) for th in thunks])
        if not (lazy or (config.profile and config.profile_memory) or
-                self.use_cloop or self.callback):
+                self.use_cloop or self.callback or self.callback_input):
            for pair in itervalues(reallocated_info):
                storage_map[pair[1]] = storage_map[pair[0]]
@@ -1088,3 +1101,7 @@ class VM_Linker(link.LocalLinker):
        self.__dict__.update(d)
        if not hasattr(self, 'c_thunks'):
            self.c_thunks = True
+        if not hasattr(self, 'allow_partial_eval'):
+            self.allow_partial_eval = None
+        if not hasattr(self, 'callback_input'):
+            self.callback_input = None
--- a/theano/gpuarray/__init__.py
+++ b/theano/gpuarray/__init__.py
@@ -42,7 +42,7 @@ register_transfer(transfer)
 def init_dev(dev, name=None):
    v = pygpu.gpuarray.api_version()
-    expected = -9998
+    expected = -9997
    if v[0] != expected:
        raise RuntimeError("Wrong major API version for gpuarray:", v[0],
                           "Make sure Theano and libgpuarray/pygpu "
@@ -50,6 +50,15 @@ def init_dev(dev, name=None):
    if v[1] < 0:
        raise RuntimeError("Wrong minor API version for gpuarray:", v[1],
                           "Please update libgpuarray/pygpu.")
+    if len(v) < 3:
+        vpy = -1
+    else:
+        vpy = v[2]
+    vpye = 0
+    if vpy < vpye:
+        print("Wrong python API version for gpuarray:", vpy, "expected:", vpye,
+              "Some python ops may not work correctly and/or crash. "
+              "Consider updating pygpu.", file=sys.stderr)
    global pygpu_activated
    if dev not in init_dev.devmap:
        ctx = pygpu.init(dev,

--- a/theano/gpuarray/basic_ops.py
+++ b/theano/gpuarray/basic_ops.py
@@ -259,14 +259,14 @@ class GpuKernelBase(object):
  int types[%(numargs)u] = {%(types)s};
  const char *bcode = %(bvar)s;
  size_t sz = sizeof(%(bvar)s);
-  if (GpuKernel_init(&%(ovar)s, %(ctx)s->ops, %(ctx)s->ctx, 1, &bcode, &sz,
+  if (GpuKernel_init(&%(ovar)s, %(ctx)s->ctx, 1, &bcode, &sz,
                     "%(kname)s", %(numargs)u, types, GA_USE_BINARY, NULL)
      != GA_NO_ERROR) {
-    if ((err = GpuKernel_init(&%(ovar)s, %(ctx)s->ops, %(ctx)s->ctx, 1,
+    if ((err = GpuKernel_init(&%(ovar)s, %(ctx)s->ctx, 1,
                              &%(cname)s, NULL, "%(kname)s", %(numargs)u,
                              types, %(flags)s, NULL)) != GA_NO_ERROR) {
      PyErr_Format(PyExc_RuntimeError, "GpuKernel_init error %%d: %%s",
-                   err, Gpu_error(%(ctx)s->ops, %(ctx)s->ctx, err));
+                   err, gpucontext_error(%(ctx)s->ctx, err));
      %(fail)s
    }
  }
@@ -310,7 +310,7 @@ class GpuKernelBase(object):
            The node that we need the cache version for.
        """
-        return (3, self.get_params(node).bin_id)
+        return (4, self.get_params(node).bin_id)
 class HostFromGpu(Op):
@@ -529,15 +529,22 @@ class GpuToGpu(Op):
    def c_code(self, node, name, inputs, outputs, sub):
        return """
        Py_XDECREF(%(out)s);
-        %(out)s = pygpu_transfer(%(inp)s, %(ctx)s, 0);
+        %(out)s = pygpu_empty(%(inp)s->ga.nd,
+                              %(inp)s->ga.dimensions,
+                              %(inp)s->ga.typecode,
+                              GpuArray_IS_C_CONTIGUOUS(&(%(inp)s->ga)) ? GA_C_ORDER:GA_F_ORDER,
+                              %(ctx)s, Py_None);
        if (%(out)s == NULL) {
            %(fail)s
        }
+        if (pygpu_transfer(%(out)s, %(inp)s)) {
+            %(fail)s
+        }
        """ % {'inp': inputs[0], 'ctx': sub['params'],
               'out': outputs[0], 'fail': sub['fail']}
    def c_code_cache_version(self):
-        return (0,)
+        return (1,)
 class GpuAlloc(HideC, Alloc):

--- a/theano/gpuarray/blockgemv.c
+++ b/theano/gpuarray/blockgemv.c
@@ -24,16 +24,9 @@ int APPLY_SPECIFIC(blockgemv)(PyGpuArrayObject *o, PyGpuArrayObject *W,
  size_t *offW = NULL;
  size_t *offInp = NULL;
  size_t *offOut = NULL;
-  gpuarray_blas_ops *blas_ops;
  int err;
-  err = ctx->ops->property(ctx->ctx, NULL, NULL,
+  err = gpublas_setup(ctx->ctx);
-                           GA_CTX_PROP_BLAS_OPS, &blas_ops);
-  if (err != GA_NO_ERROR) {
-    PyErr_SetString(PyExc_RuntimeError, "Can't get blas ops");
-    return -1;
-  }
-  err = blas_ops->setup(ctx->ctx);
  if (err != GA_NO_ERROR) {
    PyErr_SetString(PyExc_RuntimeError, "Can't setup blas");
    return -1;
@@ -93,29 +86,29 @@ int APPLY_SPECIFIC(blockgemv)(PyGpuArrayObject *o, PyGpuArrayObject *W,
  }
  if (out->ga.typecode == GA_FLOAT) {
-    err = blas_ops->sgemvBatch(cb_fortran, transA,
+    err = gpublas_sgemvBatch(cb_fortran, transA,
-                               PyGpuArray_DIMS(out)[2],
+                             PyGpuArray_DIMS(out)[2],
-                               PyGpuArray_DIMS(h)[2], 1,
+                             PyGpuArray_DIMS(h)[2], 1,
-                               W_list, offW, lda,
+                             W_list, offW, lda,
-                               inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
+                             inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
-                               1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
+                             1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
-                               PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
+                             PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
  } else if (out->ga.typecode == GA_DOUBLE) {
-    err = blas_ops->dgemvBatch(cb_fortran, transA,
+    err = gpublas_dgemvBatch(cb_fortran, transA,
-                               PyGpuArray_DIMS(out)[2],
+                             PyGpuArray_DIMS(out)[2],
-                               PyGpuArray_DIMS(h)[2], 1,
+                             PyGpuArray_DIMS(h)[2], 1,
-                               W_list, offW, lda,
+                             W_list, offW, lda,
-                               inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
+                             inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
-                               1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
+                             1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
-                               PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
+                             PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
  } else if (out->ga.typecode == GA_HALF) {
-    err = blas_ops->sgemvBatch(cb_fortran, transA,
+    err = gpublas_sgemvBatch(cb_fortran, transA,
-                               PyGpuArray_DIMS(out)[2],
+                             PyGpuArray_DIMS(out)[2],
-                               PyGpuArray_DIMS(h)[2], 1,
+                             PyGpuArray_DIMS(h)[2], 1,
-                               W_list, offW, lda,
+                             W_list, offW, lda,
-                               inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
+                             inp_list, offInp, PyGpuArray_STRIDES(h)[2] / gpuarray_get_elsize(h->ga.typecode),
-                               1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
+                             1, out_list, offOut, PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode),
-                               PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
+                             PyGpuArray_DIMS(out)[1] * PyGpuArray_DIMS(h)[1] * PyGpuArray_DIMS(out)[0], 0);
  } else {
    err = GA_INVALID_ERROR;
  }

--- a/theano/gpuarray/blockger.c
+++ b/theano/gpuarray/blockger.c
@@ -12,16 +12,9 @@ int APPLY_SPECIFIC(blockger)(PyGpuArrayObject *o, PyGpuArrayObject *x,
  size_t *offOut = NULL;
  size_t *offX = NULL;
  size_t *offY = NULL;
-  gpuarray_blas_ops *blas_ops;
  int err;
-  err = ctx->ops->property(ctx->ctx, NULL, NULL,
+  err = gpublas_setup(ctx->ctx);
-                           GA_CTX_PROP_BLAS_OPS, &blas_ops);
-  if (err != GA_NO_ERROR) {
-    PyErr_SetString(PyExc_RuntimeError, "Can't get blas ops");
-    return -1;
-  }
-  err = blas_ops->setup(ctx->ctx);
  if (err != GA_NO_ERROR) {
    PyErr_SetString(PyExc_RuntimeError, "Can't setup blas");
    return -1;
@@ -84,26 +77,26 @@ int APPLY_SPECIFIC(blockger)(PyGpuArrayObject *o, PyGpuArrayObject *x,
  ssize_t str_out = PyGpuArray_STRIDES(out)[2] / gpuarray_get_elsize(out->ga.typecode);
  if (out->ga.typecode == GA_FLOAT) {
-    err = blas_ops->sgerBatch(cb_fortran,
+    err = gpublas_sgerBatch(cb_fortran,
-                              PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
+                            PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
-                              *(float *)PyArray_GETPTR1(alpha, 0),
+                            *(float *)PyArray_GETPTR1(alpha, 0),
-                              y_list, offY, str_y, x_list, offX, str_x,
+                            y_list, offY, str_y, x_list, offX, str_x,
-                              o_list, offOut, str_out,
+                            o_list, offOut, str_out,
-                              PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
+                            PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
  } else if (out->ga.typecode == GA_DOUBLE) {
-    err = blas_ops->dgerBatch(cb_fortran,
+    err = gpublas_dgerBatch(cb_fortran,
-                              PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
+                            PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
-                              *(double *)PyArray_GETPTR1(alpha, 0),
+                            *(double *)PyArray_GETPTR1(alpha, 0),
-                              y_list, offY, str_y, x_list, offX, str_x,
+                            y_list, offY, str_y, x_list, offX, str_x,
-                              o_list, offOut, str_out,
+                            o_list, offOut, str_out,
-                              PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
+                            PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
  } else if (out->ga.typecode == GA_HALF) {
-    err = blas_ops->hgerBatch(cb_fortran,
+    err = gpublas_hgerBatch(cb_fortran,
-                              PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
+                            PyGpuArray_DIMS(y)[2], PyGpuArray_DIMS(x)[2],
-                              *(float *)PyArray_GETPTR1(alpha, 0),
+                            *(float *)PyArray_GETPTR1(alpha, 0),
-                              y_list, offY, str_y, x_list, offX, str_x,
+                            y_list, offY, str_y, x_list, offX, str_x,
-                              o_list, offOut, str_out,
+                            o_list, offOut, str_out,
-                              PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
+                            PyGpuArray_DIMS(x)[0] * PyGpuArray_DIMS(x)[1] * PyGpuArray_DIMS(y)[1], 0);
  } else {
    err = GA_INVALID_ERROR;
  }

--- a/theano/gpuarray/dnn.py
+++ b/theano/gpuarray/dnn.py
@@ -125,7 +125,7 @@ def dnn_available(context_name):
    ctx = get_context(context_name)
-    if not ctx.kind == 'cuda':
+    if not ctx.kind == b'cuda':
        dnn_available.msg = "Not on a CUDA device."
        return False
@@ -1493,7 +1493,7 @@ def local_dnn_convi_output_merge(node, *inputs):
    return [GpuDnnConvGradI(algo=node.op.algo)(*inputs)]
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @op_lifter([Pool])
 def local_pool_dnn_alternative(node, ctx_name):
    if not dnn_available(ctx_name):
@@ -1509,7 +1509,7 @@ def local_pool_dnn_alternative(node, ctx_name):
    return dnn_pool(gpu_contiguous(img), ds, stride=stride, pad=pad, mode=mode)
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @op_lifter([MaxPoolGrad])
 def local_pool_dnn_grad_stride(node, ctx_name):
    if not dnn_available(ctx_name):
@@ -1533,7 +1533,7 @@ def local_pool_dnn_grad_stride(node, ctx_name):
                                     pad)
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @op_lifter([AveragePoolGrad])
 def local_avg_pool_dnn_grad_stride(node, ctx_name):
    if not dnn_available(ctx_name):
@@ -1556,7 +1556,7 @@ def local_avg_pool_dnn_grad_stride(node, ctx_name):
    return GpuDnnPoolGrad(mode=mode)(gpu_contiguous(inp), cg, cg, ds, st, pad)
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @local_optimizer([GpuSoftmax])
 def local_softmax_dnn(node):
    if isinstance(node.op, GpuSoftmax):
@@ -1569,7 +1569,7 @@ def local_softmax_dnn(node):
        return [out]
-@register_opt('cudnn')
+@register_opt('cudnn', 'stabilize')
 @local_optimizer([GpuElemwise])
 def local_log_softmax_dnn(node):
    # This looks for GpuDnnSoftmax so we know that we have cudnn.
@@ -1586,7 +1586,7 @@ def local_log_softmax_dnn(node):
        return [new_softmax(softmax_node.inputs[0])]
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @op_lifter([LogSoftmax])
 def local_logsoftmax_to_dnn(node, ctx_name):
    # Transform the input in the format expected by GpuDnnSoftmax
@@ -1624,7 +1624,7 @@ class NoCuDNNRaise(Optimizer):
 gpu_seqopt.register("NoCuDNNRaise", NoCuDNNRaise(), 0, 'cudnn')
-@register_opt('cudnn')
+@register_opt('cudnn', 'fast_compile')
 @op_lifter([SoftmaxGrad])
 def local_softmax_dnn_grad(node, ctx_name):
    if not dnn_available(ctx_name):

--- a/theano/gpuarray/dnn_fwd.c
+++ b/theano/gpuarray/dnn_fwd.c
@@ -105,7 +105,7 @@ APPLY_SPECIFIC(conv_fwd)(PyGpuArrayObject *input, PyGpuArrayObject *kerns,
    algo = choice.algo;
 #else
    size_t free;
-    int err2 = c->ops->property(c->ctx, NULL, NULL, GA_CTX_PROP_FREE_GMEM, &free);
+    int err2 = gpucontext_property(c->ctx, GA_CTX_PROP_FREE_GMEM, &free);
    if (err2 != GA_NO_ERROR) {
      PyErr_Format(PyExc_RuntimeError, "Error when trying to find the "
@@ -234,7 +234,7 @@ APPLY_SPECIFIC(conv_fwd)(PyGpuArrayObject *input, PyGpuArrayObject *kerns,
     * to place a nice get_work_mem() function in.
     */
    if (worksize != 0) {
-      workspace = c->ops->buffer_alloc(c->ctx, worksize, NULL, 0, NULL);
+      workspace = gpudata_alloc(c->ctx, worksize, NULL, 0, NULL);
      if (workspace == NULL) {
        PyErr_SetString(PyExc_RuntimeError,
                        "Could not allocate working memory");
@@ -258,7 +258,7 @@ APPLY_SPECIFIC(conv_fwd)(PyGpuArrayObject *input, PyGpuArrayObject *kerns,
      APPLY_SPECIFIC(output), PyGpuArray_DEV_DATA(*output));
    if (worksize != 0)
-      c->ops->buffer_release(workspace);
+      gpudata_release(workspace);
    cuda_record(input->ga.data, GPUARRAY_CUDA_WAIT_READ);
    cuda_record(kerns->ga.data, GPUARRAY_CUDA_WAIT_READ);

--- a/theano/gpuarray/dnn_gi.c
+++ b/theano/gpuarray/dnn_gi.c
@@ -106,7 +106,7 @@ APPLY_SPECIFIC(conv_gi)(PyGpuArrayObject *kerns, PyGpuArrayObject *output,
    algo = choice.algo;
 #else
    size_t free;
-    int err2 = c->ops->property(c->ctx, NULL, NULL, GA_CTX_PROP_FREE_GMEM, &free);
+    int err2 = gpucontext_property(c->ctx, GA_CTX_PROP_FREE_GMEM, &free);
    if (err2 != GA_NO_ERROR) {
      PyErr_Format(PyExc_RuntimeError, "Error when trying to find the "
@@ -204,7 +204,7 @@ APPLY_SPECIFIC(conv_gi)(PyGpuArrayObject *kerns, PyGpuArrayObject *output,
  }
  if (worksize != 0) {
-    workspace = c->ops->buffer_alloc(c->ctx, worksize, NULL, 0, NULL);
+    workspace = gpudata_alloc(c->ctx, worksize, NULL, 0, NULL);
    if (workspace == NULL) {
      PyErr_SetString(PyExc_RuntimeError,
                      "Could not allocate working memory");
@@ -227,7 +227,7 @@ APPLY_SPECIFIC(conv_gi)(PyGpuArrayObject *kerns, PyGpuArrayObject *output,
    APPLY_SPECIFIC(input), PyGpuArray_DEV_DATA(*input));
  if (worksize != 0)
-    c->ops->buffer_release(workspace);
+    gpudata_release(workspace);
  cuda_record(kerns->ga.data, GPUARRAY_CUDA_WAIT_READ);
  cuda_record(output->ga.data, GPUARRAY_CUDA_WAIT_READ);

--- a/theano/gpuarray/dnn_gw.c
+++ b/theano/gpuarray/dnn_gw.c
@@ -107,7 +107,7 @@ APPLY_SPECIFIC(conv_gw)(PyGpuArrayObject *input, PyGpuArrayObject *output,
    algo = choice.algo;
 #else
    size_t free;
-    int err2 = c->ops->property(c->ctx, NULL, NULL, GA_CTX_PROP_FREE_GMEM, &free);
+    int err2 = gpucontext_property(c->ctx, GA_CTX_PROP_FREE_GMEM, &free);
    if (err2 != GA_NO_ERROR) {
      PyErr_Format(PyExc_RuntimeError, "Error when trying to find the "
@@ -192,7 +192,7 @@ APPLY_SPECIFIC(conv_gw)(PyGpuArrayObject *input, PyGpuArrayObject *output,
  }
  if (worksize != 0) {
-    workspace = c->ops->buffer_alloc(c->ctx, worksize, NULL, 0, NULL);
+    workspace = gpudata_alloc(c->ctx, worksize, NULL, 0, NULL);
    if (workspace == NULL) {
      PyErr_SetString(PyExc_RuntimeError, "Could not allocate working memory");
      cuda_exit(c->ctx);
@@ -214,7 +214,7 @@ APPLY_SPECIFIC(conv_gw)(PyGpuArrayObject *input, PyGpuArrayObject *output,
    APPLY_SPECIFIC(kerns), PyGpuArray_DEV_DATA(*kerns));
  if (worksize != 0)
-    c->ops->buffer_release(workspace);
+    gpudata_release(workspace);
  cuda_record(input->ga.data, GPUARRAY_CUDA_WAIT_READ);
  cuda_record(output->ga.data, GPUARRAY_CUDA_WAIT_READ);

--- a/theano/gpuarray/elemwise.py
+++ b/theano/gpuarray/elemwise.py
@@ -199,7 +199,7 @@ class GpuElemwise(HideC, Elemwise):
                           typecode=o.type.typecode)
        res += """
-        ge = GpuElemwise_new(%(ctx)s->ops, %(ctx)s->ctx, %(support)s, %(kop)s, %(nargs)s, args, %(nd)s, 0);
+        ge = GpuElemwise_new(%(ctx)s->ctx, %(support)s, %(kop)s, %(nargs)s, args, %(nd)s, 0);
        if (ge == NULL) {
           PyErr_SetString(PyExc_RuntimeError, "Could not initialize elemwise support");
           %(fail)s
@@ -360,7 +360,7 @@ class GpuElemwise(HideC, Elemwise):
    def c_code_cache_version(self):
        ver = self.scalar_op.c_code_cache_version()
        if ver:
-            return (6, ver)
+            return (7, ver)
        else:
            return ver
@@ -554,7 +554,7 @@ class GpuCAReduceCuda(GpuKernelBase, HideC, CAReduceDtype):
    def make_node(self, x):
        x = as_gpuarray_variable(x, infer_context_name(x))
-        if x.type.context.kind != 'cuda':
+        if x.type.context.kind != b'cuda':
            raise TypeError("GpuCAReduceCuda doesn't work for non-cuda devices")
        ret = super(GpuCAReduceCuda, self).make_node(x)
        self = copy.copy(self)

--- a/theano/gpuarray/extra_ops.py
+++ b/theano/gpuarray/extra_ops.py
@@ -26,11 +26,8 @@ class GpuCumsum(GpuKernelBase, Op):
    def __init__(self, axis):
        self.axis = axis
-    def __str__(self):
+    def c_code_cache_version(self):
-        return "%s{%s}" % (self.__class__.__name__, self.axis)
+        return (3,)
-    def c_code_cache_version_apply(self, node):
-        return (1,)
    def c_headers(self):
        return ['<numpy_compat.h>', '<gpuarray/types.h>', '<gpuarray_helper.h>']
@@ -221,7 +218,7 @@ class GpuCumsum(GpuKernelBase, Op):
        return kernels
    def c_code(self, node, nodename, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError("cuda only")
        x, = inp
        z, = out
@@ -249,17 +246,17 @@ class GpuCumsum(GpuKernelBase, Op):
                size_t max_grid_size1;
                size_t max_grid_size2;
                int err;
-                err = %(ctx)s->ops->property(%(ctx)s->ctx, NULL, NULL, GA_CTX_PROP_MAXLSIZE0, &max_threads_dim0);
+                err = gpucontext_property(%(ctx)s->ctx, GA_CTX_PROP_MAXLSIZE0, &max_threads_dim0);
                if (err != GA_NO_ERROR){
                    PyErr_SetString(PyExc_RuntimeError, "Could not fetch max_threads_dims0");
                    %(fail)s;
                }
-                err = %(ctx)s->ops->property(%(ctx)s->ctx, NULL, NULL, GA_CTX_PROP_MAXGSIZE1, &max_grid_size1);
+                err = gpucontext_property(%(ctx)s->ctx, GA_CTX_PROP_MAXGSIZE1, &max_grid_size1);
                if (err != GA_NO_ERROR){
                    PyErr_SetString(PyExc_RuntimeError, "Could not fetch max_grid_size1");
                    %(fail)s;
                }
-                err = %(ctx)s->ops->property(%(ctx)s->ctx, NULL, NULL, GA_CTX_PROP_MAXGSIZE2, &max_grid_size2);
+                err = gpucontext_property(%(ctx)s->ctx, GA_CTX_PROP_MAXGSIZE2, &max_grid_size2);
                if (err != GA_NO_ERROR){
                    PyErr_SetString(PyExc_RuntimeError, "Could not fetch max_grid_size2");
                    %(fail)s;

--- a/theano/gpuarray/gemm16.c
+++ b/theano/gpuarray/gemm16.c
@@ -117,7 +117,7 @@ int gemm16(PyGpuArrayObject *C, float alpha,
        if (48 < n128 && n128 <= 64) {
          n64 = n / 64;
          if (nprocs == 0)
-            if (A->ga.ops->property(A->context->ctx, NULL, NULL,
+            if (gpucontext_property(A->context->ctx,
                                    GA_CTX_PROP_NUMPROCS, &nprocs)) {
              nprocs = 0;
              res = 1;

--- a/theano/gpuarray/neighbours.py
+++ b/theano/gpuarray/neighbours.py
@@ -243,7 +243,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
        return kernels
    def c_code(self, node, name, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError("cuda only")
        dtype_ten4 = node.inputs[0].dtype
        dtype_neib_shape = node.inputs[1].dtype

--- a/theano/gpuarray/nerv.py
+++ b/theano/gpuarray/nerv.py
@@ -105,7 +105,7 @@ class Gemm16(COp):
        return """
 bcode = bin_%(name)s;
 sz = sizeof(bin_%(name)s);
-if (GpuKernel_init(&k_%(name)s, c->ops, c->ctx, 1, &bcode, &sz,
+if (GpuKernel_init(&k_%(name)s, c->ctx, 1, &bcode, &sz,
                   "hgemm_%(name)s", 13, types, GA_USE_BINARY, NULL)
    != GA_NO_ERROR) {
  PyErr_SetString(PyExc_RuntimeError, "Could not initialize kernel %(name)s");

--- a/theano/gpuarray/nnet.py
+++ b/theano/gpuarray/nnet.py
@@ -189,7 +189,7 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
                       flags=flags, objvar=k_var)]
    def c_code(self, node, nodename, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError('cuda only')
        typecode_x = pygpu.gpuarray.dtype_to_typecode(node.inputs[0].dtype)
        typecode_b = pygpu.gpuarray.dtype_to_typecode(node.inputs[1].dtype)
@@ -375,7 +375,7 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op):
        return ['<numpy_compat.h>', '<gpuarray/types.h>']
    def c_code(self, node, nodename, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError("cuda only")
        typecode_dx = pygpu.gpuarray.dtype_to_typecode(node.outputs[0].dtype)
        itemsize_dnll = numpy.dtype(node.inputs[0].dtype).itemsize
@@ -584,7 +584,7 @@ class GpuSoftmax(GpuKernelBase, Op):
        return ['<numpy_compat.h>', '<gpuarray/types.h>']
    def c_code(self, node, nodename, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError("cuda only")
        dtype_x = node.inputs[0].dtype
        work_x = work_dtype(dtype_x)
@@ -783,7 +783,7 @@ class GpuSoftmaxWithBias(GpuKernelBase, Op):
        return ['<numpy_compat.h>', '<gpuarray/types.h>']
    def c_code(self, node, nodename, inp, out, sub):
-        if node.inputs[0].type.context.kind != 'cuda':
+        if node.inputs[0].type.context.kind != b'cuda':
            raise NotImplementedError('cuda only')
        dtype_x = node.inputs[0].dtype
        dtype_b = node.inputs[1].dtype

--- a/theano/gpuarray/opt.py
+++ b/theano/gpuarray/opt.py
@@ -33,12 +33,16 @@ from .basic_ops import (as_gpuarray_variable, infer_context_name,
                        GpuSplit, GpuContiguous, gpu_contiguous,
                        GpuAlloc, GpuAllocEmpty, GpuReshape,
                        GpuEye, gpu_join, GpuJoin)
-from .blas import (gpu_dot22, GpuGemv, GpuGemm, GpuGer, GpuGemmBatch,
+from .blas import (gpu_dot22, GpuGemm, GpuGer, GpuGemmBatch,
-                   gpugemm_no_inplace, gpugemmbatch_no_inplace)
+                   gpugemm_no_inplace, gpugemm_inplace, gpugemmbatch_no_inplace,
-from .blocksparse import GpuSparseBlockGemv, GpuSparseBlockOuter
+                   gpugemv_no_inplace, gpugemv_inplace)
-from .nnet import (GpuCrossentropySoftmaxArgmax1HotWithBias,
+from .blocksparse import (GpuSparseBlockGemv, GpuSparseBlockOuter,
-                   GpuCrossentropySoftmax1HotWithBiasDx,
+                          gpu_sparse_block_outer, gpu_sparse_block_outer_inplace,
-                   GpuSoftmaxWithBias, GpuSoftmax)
+                          gpu_sparse_block_gemv, gpu_sparse_block_gemv_inplace)
+from .nnet import (gpu_crossentropy_softmax_1hot_with_bias_dx,
+                   gpu_crossentropy_softmax_argmax_1hot_with_bias,
+                   gpu_softmax_with_bias, gpu_softmax)
 from .elemwise import (GpuElemwise, GpuDimShuffle, GpuCAReduceCuda,
                       GpuCAReduceCPY)
 from .subtensor import (GpuIncSubtensor, GpuSubtensor,
@@ -49,6 +53,7 @@ from .opt_util import alpha_merge, output_merge
 _logger = logging.getLogger("theano.gpuarray.opt")
 gpu_optimizer = EquilibriumDB()
 gpu_cut_copies = EquilibriumDB()
@@ -146,7 +151,7 @@ def op_lifter(OP, cuda_only=False):
                # Check if we should replace
                if (not replace or
                    (cuda_only and
-                     get_context(context_name).kind != 'cuda')):
+                     get_context(context_name).kind != b'cuda')):
                    return False
                # tag the inputs with the context in case
@@ -643,7 +648,7 @@ def local_gpua_advanced_subtensor(node, context_name):
 def local_gpua_advanced_incsubtensor(node, context_name):
    context = get_context(context_name)
    # This is disabled on non-cuda contexts
-    if context.kind != 'cuda':
+    if context.kind != b'cuda':
        return None
    x, y, ilist = node.inputs
@@ -674,12 +679,12 @@ def local_gpua_careduce(node, context_name):
    if isinstance(node.op.scalar_op, (scalar.Add, scalar.Mul,
                                      scalar.Maximum, scalar.Minimum)):
        ctx = get_context(context_name)
-        if ctx.kind == 'opencl':
+        if ctx.kind == b'opencl':
            op = GpuCAReduceCPY
            if node.op.scalar_op not in [scalar.add, scalar.mul]:
                # We don't support yet all reduction with cpy code.
                return
-        elif ctx.kind == 'cuda':
+        elif ctx.kind == b'cuda':
            op = GpuCAReduceCuda
        else:
            return False
@@ -711,18 +716,14 @@ def local_gpua_careduce(node, context_name):
                    assert reduce_mask[a] == 0
                    reduce_mask[a] = 1
-            shape_of = node.fgraph.shape_feature.shape_of
+            new_in_shp = [shape_i(x, 0)]
-            x_shape = shape_of[x]
-            new_in_shp = [x_shape[0]]
            new_mask = [reduce_mask[0]]
            for i in xrange(1, x.type.ndim):
                if reduce_mask[i] == reduce_mask[i - 1]:
-                    new_in_shp[-1] *= x_shape[i]
+                    new_in_shp[-1] *= shape_i(x, i)
                else:
                    new_mask.append(reduce_mask[i])
-                    new_in_shp.append(x_shape[i])
+                    new_in_shp.append(shape_i(x, i))
            new_axis = []
            for idx, m in enumerate(new_mask):
                if m == 1:
@@ -744,8 +745,12 @@ def local_gpua_careduce(node, context_name):
                    greduce(gpu_reshaped_x))
                if reduce_reshaped_x.ndim != node.outputs[0].ndim:
+                    out_shp = []
+                    for i in range(x.ndim):
+                        if i not in node.op.axis:
+                            out_shp.append(shape_i(x, i))
                    unreshaped_reduce = reduce_reshaped_x.reshape(
-                        tensor.stack(shape_of[node.outputs[0]]))
+                        tensor.stack(out_shp))
                else:
                    unreshaped_reduce = reduce_reshaped_x
                return [unreshaped_reduce]
@@ -754,13 +759,19 @@ def local_gpua_careduce(node, context_name):
 @register_opt('fast_compile')
 @op_lifter([tensor.blas.Gemv, tensor.blas_c.CGemv])
 def local_gpua_gemv(node, context_name):
-    return GpuGemv(inplace=node.op.inplace)
+    if node.op.inplace:
+        return gpugemv_inplace
+    else:
+        return gpugemv_no_inplace
 @register_opt('fast_compile')
 @op_lifter([tensor.blas.Gemm])
 def local_gpua_gemm(node, context_name):
-    return GpuGemm(inplace=node.op.inplace)
+    if node.op.inplace:
+        return gpugemm_inplace
+    else:
+        return gpugemm_no_inplace
 @register_opt('fast_compile')
@@ -834,7 +845,7 @@ def local_gpua_dot22scalar(node, context_name):
    x = as_gpuarray_variable(x, context_name)
    y = as_gpuarray_variable(y, context_name)
    z = GpuAllocEmpty(x.dtype, context_name)(x.shape[0], y.shape[1])
-    return [GpuGemm(inplace=False)(z, a, x, y, 0)]
+    return [gpugemm_no_inplace(z, a, x, y, 0)]
 @register_opt('fast_compile')
@@ -846,25 +857,25 @@ def local_gpua_eye(node, context_name):
 @register_opt('fast_compile')
 @op_lifter([tensor.nnet.CrossentropySoftmaxArgmax1HotWithBias], cuda_only=True)
 def local_gpua_crossentropysoftmaxargmax1hotwithbias(node, context_name):
-    return GpuCrossentropySoftmaxArgmax1HotWithBias()
+    return gpu_crossentropy_softmax_argmax_1hot_with_bias
 @register_opt('fast_compile')
 @op_lifter([tensor.nnet.CrossentropySoftmax1HotWithBiasDx], cuda_only=True)
 def local_gpua_crossentropysoftmax1hotwithbiasdx(node, context_name):
-    return GpuCrossentropySoftmax1HotWithBiasDx()
+    return gpu_crossentropy_softmax_1hot_with_bias_dx
 @register_opt('fast_compile')
 @op_lifter([tensor.nnet.Softmax], cuda_only=True)
 def local_gpua_softmax(node, context_name):
-    return GpuSoftmax()
+    return gpu_softmax
 @register_opt('fast_compile')
 @op_lifter([tensor.nnet.SoftmaxWithBias], cuda_only=True)
 def local_gpua_softmaxwithbias(node, context_name):
-    return GpuSoftmaxWithBias()
+    return gpu_softmax_with_bias
 @register_opt('fast_compile')
@@ -889,20 +900,26 @@ theano.tensor.nnet.conv2d()
 @register_opt('fast_compile')
 @op_lifter([SparseBlockGemv])
 def local_lift_sparseblockgemv(node, context_name):
-    return GpuSparseBlockGemv(node.op.inplace)
+    if node.op.inplace:
+        return gpu_sparse_block_gemv_inplace
+    else:
+        return gpu_sparse_block_gemv
 @register_opt('fast_compile')
 @op_lifter([SparseBlockOuter])
 def local_lift_sparseblockouter(node, context_name):
-    return GpuSparseBlockOuter(node.op.inplace)
+    if node.op.inplace:
+        return gpu_sparse_block_outer_inplace
+    else:
+        return gpu_sparse_block_outer
 @register_inplace()
 @local_optimizer([GpuSparseBlockGemv], inplace=True)
 def local_inplace_sparseblockgemv(node):
    if isinstance(node.op, GpuSparseBlockGemv) and not node.op.inplace:
-        return [GpuSparseBlockGemv(inplace=True)(*node.inputs)]
+        return [gpu_sparse_block_gemv_inplace(*node.inputs)]
 @register_inplace()

--- a/theano/gpuarray/subtensor.py
+++ b/theano/gpuarray/subtensor.py
--- a/theano/gpuarray/tests/test_basic_ops.py
+++ b/theano/gpuarray/tests/test_basic_ops.py
@@ -18,7 +18,7 @@ from theano.tests import unittest_tools as utt
 from ..type import (GpuArrayType, get_context,
                    gpuarray_shared_constructor)
 from ..basic_ops import (
-    host_from_gpu, HostFromGpu, GpuFromHost, GpuReshape,
+    host_from_gpu, HostFromGpu, GpuFromHost, GpuReshape, GpuToGpu,
    GpuAlloc, GpuAllocEmpty, GpuContiguous,
    gpu_join, GpuJoin, GpuSplit, GpuEye, gpu_contiguous)
 from ..subtensor import GpuSubtensor
@@ -182,6 +182,21 @@ def test_transfer_cpu_gpu():
    assert numpy.all(fv == av)
+def test_transfer_gpu_gpu():
+    g = GpuArrayType(dtype='float32', broadcastable=(False, False),
+                     context_name=test_ctx_name)()
+    av = numpy.asarray(rng.rand(5, 4), dtype='float32')
+    gv = gpuarray.array(av, context=get_context(test_ctx_name))
+    mode = mode_with_gpu.excluding('cut_gpua_host_transfers', 'local_cut_gpua_host_gpua')
+    f = theano.function([g], GpuToGpu(test_ctx_name)(g), mode=mode)
+    topo = f.maker.fgraph.toposort()
+    assert len(topo) == 1
+    assert isinstance(topo[0].op, GpuToGpu)
+    fv = f(gv)
+    assert GpuArrayType.values_eq(fv, gv)
 def test_transfer_strided():
    # This is just to ensure that it works in theano
    # libgpuarray has a much more comprehensive suit of tests to

--- a/theano/gpuarray/tests/test_elemwise.py
+++ b/theano/gpuarray/tests/test_elemwise.py
@@ -197,7 +197,7 @@ class test_GpuCAReduceCuda(test_GpuCAReduceCPY):
    def setUp(self):
        super(test_GpuCAReduceCuda, self).setUp()
-        if get_context(test_ctx_name).kind != 'cuda':
+        if get_context(test_ctx_name).kind != b'cuda':
            raise SkipTest("Cuda specific tests")
@@ -212,7 +212,7 @@ class T_gpureduce_dtype(test_elemwise.T_reduce_dtype):
              'float32', 'float64']
    def setUp(self):
-        if get_context(test_ctx_name).kind != 'cuda':
+        if get_context(test_ctx_name).kind != b'cuda':
            raise SkipTest("Cuda specific tests")

--- a/theano/gpuarray/tests/test_extra_ops.py
+++ b/theano/gpuarray/tests/test_extra_ops.py
@@ -24,7 +24,7 @@ class TestGpuCumsum(theano.tensor.tests.test_extra_ops.TestCumsumOp):
    def setUp(self):
        super(TestGpuCumsum, self).setUp()
        test_ctx = get_context(test_ctx_name)
-        if test_ctx.kind != 'cuda':
+        if test_ctx.kind != b'cuda':
            raise SkipTest("Cuda specific tests")
        self.max_threads_dim0 = test_ctx.maxlsize0
        self.max_grid_size1 = test_ctx.maxgsize2

--- a/theano/gpuarray/tests/test_opt.py
+++ b/theano/gpuarray/tests/test_opt.py
@@ -125,7 +125,7 @@ def test_reduce():
        topo = f.maker.fgraph.toposort()
        ops = [type(node.op) for node in topo]
-        if kind == 'opencl' and method in ["max", "min"]:
+        if kind == b'opencl' and method in ["max", "min"]:
            assert not(GpuCAReduceCuda in ops or GpuCAReduceCPY in ops)
        else:
            assert GpuCAReduceCuda in ops or GpuCAReduceCPY in ops

--- a/theano/gpuarray/tests/test_subtensor.py
+++ b/theano/gpuarray/tests/test_subtensor.py
@@ -56,3 +56,32 @@ def test_advinc_subtensor1():
        rep = xval.copy()
        rep[[0, 2]] += yval
        assert numpy.allclose(rval, rep)
+def test_incsub_f16():
+    shp = (3, 3)
+    shared = gpuarray_shared_constructor
+    xval = numpy.arange(numpy.prod(shp), dtype='float16').reshape(shp) + 1
+    yval = numpy.empty((2,) + shp[1:], dtype='float16')
+    yval[:] = 2
+    x = shared(xval, name='x')
+    y = tensor.tensor(dtype='float16',
+                      broadcastable=(False,) * len(shp),
+                      name='y')
+    expr = tensor.advanced_inc_subtensor1(x, y, [0, 2])
+    f = theano.function([y], expr, mode=mode_with_gpu)
+    assert sum([isinstance(node.op, GpuAdvancedIncSubtensor1)
+                for node in f.maker.fgraph.toposort()]) == 1
+    rval = f(yval)
+    rep = xval.copy()
+    rep[[0, 2]] += yval
+    assert numpy.allclose(rval, rep)
+    expr = tensor.inc_subtensor(x[1:], y)
+    f = theano.function([y], expr, mode=mode_with_gpu)
+    assert sum([isinstance(node.op, GpuIncSubtensor)
+                for node in f.maker.fgraph.toposort()]) == 1
+    rval = f(yval)
+    rep = xval.copy()
+    rep[1:] += yval
+    assert numpy.allclose(rval, rep)
--- a/theano/gpuarray/type.py
+++ b/theano/gpuarray/type.py
@@ -301,20 +301,14 @@ class GpuArrayType(Type):
                raise NotImplementedError(
                    "GpuArrayType.values_eq_approx() don't implemented the"
                    " allow_remove_inf and allow_remove_nan parameter")
-            if a.dtype == 'float16' or b.dtype == 'float16':
-                an = numpy.asarray(a)
-                bn = numpy.asarray(b)
-                return tensor.TensorType.values_eq_approx(
-                    an, bn, allow_remove_inf=allow_remove_inf,
-                    allow_remove_nan=allow_remove_nan, rtol=rtol, atol=atol)
            atol_, rtol_ = theano.tensor.basic._get_atol_rtol(a, b)
            if rtol is not None:
                rtol_ = rtol
            if atol is not None:
                atol_ = atol
            res = elemwise2(a, '', b, a, odtype=numpy.dtype('bool'),
-                            op_tmpl="res[i] = (fabs(%%(a)s - %%(b)s) <"
+                            op_tmpl="res = (fabs(a - b) <"
-                            "(%(atol_)s + %(rtol_)s * fabs(%%(b)s)))" %
+                            "(%(atol_)s + %(rtol_)s * fabs(b)))" %
                            locals())
            ret = numpy.asarray(res).all()
            if ret:

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
    t0 = 0
    t1 = -1
+    f()  # Ignore first function call to get representative time.
    if execute:
        sync = (hasattr(theano, "sandbox") and
                hasattr(theano.sandbox, "cuda") and
                theano.sandbox.cuda.cuda_available)
+        sync2 = (hasattr(theano, "gpuarray") and
+                 theano.gpuarray.pygpu_activated)
        t0 = time.time()
        for i in range(iters):
            f()
        if sync:
            theano.sandbox.cuda.synchronize()
+        if sync2:
+            c.get_value(borrow=True, return_internal_type=True).sync()
        t1 = time.time()
    return t1 - t0, impl
@@ -244,6 +249,7 @@ if __name__ == "__main__":
        cuda version      7.5    7.0    6.5
        gpu
+        M40               0.47s
        k80               0.96s
        K6000/NOECC              0.69s
        K40                             0.88s

--- a/theano/sandbox/cuda/dnn.py
+++ b/theano/sandbox/cuda/dnn.py
@@ -2526,7 +2526,8 @@ if True:
            out = as_cuda_ndarray_variable(out.dimshuffle(0, 1))
            return [out]
-    @register_opt('cudnn')
+    @register_opt('cudnn', 'stabilize', 'fast_compile')
+    # We put fast_compile as otherwise it won't be on the GPU.
    @local_optimizer([GpuElemwise, LogSoftmax])
    def local_log_softmax_dnn(node):
        # The log-softmax implementation is only available starting at cuDNN V3

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -14,6 +14,7 @@ from . import dnn
 import theano
 from theano import scalar as scal
 from theano import config, tensor, gof
+from theano.compile.ops import shape_i
 import theano.ifelse
 import theano.tensor.signal.pool
 import theano.tensor.nnet
@@ -900,18 +901,14 @@ def local_gpu_careduce(node):
                    # to make them a single dimension, do the reduction, and
                    # then reshape to get them back.
-                    shape_of = node.fgraph.shape_feature.shape_of
+                    new_in_shp = [shape_i(x, 0)]
-                    x_shape = shape_of[x]
-                    new_in_shp = [x_shape[0]]
                    new_mask = [reduce_mask[0]]
                    for i in xrange(1, x.type.ndim):
                        if reduce_mask[i] == reduce_mask[i - 1]:
-                            new_in_shp[-1] *= x_shape[i]
+                            new_in_shp[-1] *= shape_i(x, i)
                        else:
                            new_mask.append(reduce_mask[i])
-                            new_in_shp.append(x_shape[i])
+                            new_in_shp.append(shape_i(x, i))
                    new_greduce = GpuCAReduce(new_mask, scalar_op)
                    new_x = x.reshape(tensor.stack(new_in_shp))
@@ -936,8 +933,11 @@ def local_gpu_careduce(node):
                    # Restore the expected shape of the output
                    if rval.ndim != out.ndim:
-                        rval = rval.reshape(
+                        out_shp = []
-                            tensor.stack(shape_of[out]))
+                        for i in range(x.ndim):
+                            if i not in node.op.axis:
+                                out_shp.append(shape_i(x, i))
+                        rval = rval.reshape(tensor.stack(out_shp))
                if rval.type == out.type:
                    return [rval]

--- a/theano/sandbox/gpuarray/__init__.py
+++ b/theano/sandbox/gpuarray/__init__.py
@@ -4,6 +4,7 @@ which refered to theano.sandbox.gpuarray."""
 import warnings
 from theano.gpuarray import *
-message = "theano.sandbox.gpuarray has been moved to theano.gpuarray." + \
+message = ("theano.sandbox.gpuarray has been moved to theano.gpuarray. "
-    " Please update your code and pickles."
+    "Please update your code and pickles. If the warning persists, "
+    "clear theano's cache ('$theano/bin/theano-cache clear').")
 warnings.warn(message)
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -2543,7 +2543,7 @@ class Log2(UnaryScalarOp):
            else:
                return [x.zeros_like()]
-        return gz / (x * math.log(2.0)),
+        return gz / (x * numpy.asarray(math.log(2.0)).astype(x.dtype)),
    def c_code(self, node, name, inputs, outputs, sub):
        (x,) = inputs

--- a/theano/scan_module/scan_opt.py
+++ b/theano/scan_module/scan_opt.py
@@ -202,7 +202,7 @@ def remove_constants_and_unused_inputs_scan(node):
        # DEBUG CHECK
        nwScan = scan_op.Scan(nw_inner, op_outs, nw_info)
        nw_outs = nwScan(*nw_outer, **dict(return_list=True))
-        return dict([("remove", [node])] + list(zip(node.outputs, nw_outs)))
+        return OrderedDict([("remove", [node])] + list(zip(node.outputs, nw_outs)))
    else:
        return False
@@ -2072,8 +2072,8 @@ def scan_merge_inouts(node):
            new_outer_out_mit_mot.append(outer_omm)
    na.outer_out_mit_mot = new_outer_out_mit_mot
    if remove:
-        return dict([("remove", remove)] +
+        return OrderedDict([("remove", remove)] +
-                    list(zip(node.outputs, na.outer_outputs)))
+                           list(zip(node.outputs, na.outer_outputs)))
    return na.outer_outputs

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -612,14 +612,14 @@ def get_scalar_constant_value(orig_v, elemwise=True,
            return numpy.asarray(v)
        if isinstance(v, numpy.ndarray):
-            return numpy_scalar(v)
+            return numpy_scalar(v).copy()
        if isinstance(v, Constant):
            if getattr(v.tag, 'unique_value', None) is not None:
                data = v.tag.unique_value
            else:
                data = v.data
-            return numpy_scalar(data)
+            return numpy_scalar(data).copy()
        if not only_process_constants and getattr(v, 'owner', None):
            if isinstance(v.owner.op, (Alloc, DimShuffle, Rebroadcast,
@@ -649,7 +649,7 @@ def get_scalar_constant_value(orig_v, elemwise=True,
                             for i in v.owner.inputs]
                    ret = [[None]]
                    v.owner.op.perform(v.owner, const, ret)
-                    return ret[0][0]
+                    return ret[0][0].copy()
            elif elemwise and isinstance(v.owner.op, Elemwise):
                if isinstance(v.owner.op.scalar_op, scal.Second):
                    # We don't need both input to be constant for second
@@ -662,13 +662,13 @@ def get_scalar_constant_value(orig_v, elemwise=True,
                             for i in v.owner.inputs]
                    ret = [[None]]
                    v.owner.op.perform(v.owner, const, ret)
-                    return ret[0][0]
+                    return ret[0][0].copy()
            elif (isinstance(v.owner.op, theano.tensor.subtensor.Subtensor) and
                  v.ndim == 0):
                if isinstance(v.owner.inputs[0], TensorConstant):
                    cdata = tuple(v.owner.op.get_constant_idx(v.owner.inputs))
                    try:
-                        return v.owner.inputs[0].data.__getitem__(cdata)
+                        return v.owner.inputs[0].data.__getitem__(cdata).copy()
                    except IndexError:
                        raise IndexError(
                            str(tuple(v.owner.op.idx_list)) +
@@ -1399,8 +1399,6 @@ class MaxAndArgmax(Op):
        %(axis_code)s
        %(max)s = (PyArrayObject*)PyArray_Max(%(x)s, axis, NULL);
        if(%(max)s == NULL){
-            PyErr_SetString(PyExc_ValueError,
-                         "MaxAndArgmax, max failed");
            %(fail)s;
        }
        if(!PyArray_CheckExact(%(max)s)){
@@ -1412,7 +1410,6 @@ class MaxAndArgmax(Op):
        %(argmax)s = (PyArrayObject*)PyArray_ArgMax(%(x)s, axis, NULL);
        if(%(argmax)s == NULL){
-            PyErr_SetString(PyExc_ValueError, "MaxAndArgmax, argmax failed");
            Py_CLEAR(%(max)s);
            %(fail)s;
        }
@@ -1434,7 +1431,7 @@ class MaxAndArgmax(Op):
        return ret % locals()
    def c_code_cache_version(self):
-        return (3,)
+        return (4,)
    def infer_shape(self, node, shapes):
        ishape, axis_shape = shapes

--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
@@ -152,6 +152,7 @@ from theano.tensor import basic as T
 from theano.tensor.blas_headers import blas_header_text
 from theano.tensor.blas_headers import blas_header_version
 from theano.tensor.opt import in2out, local_dimshuffle_lift
+from theano.tensor.type import values_eq_approx_remove_inf_nan
 _logger = logging.getLogger('theano.tensor.blas')
@@ -1435,7 +1436,8 @@ class GemmOptimizer(Optimizer):
            if new_node is not node:
                nodelist.append(new_node)
-        u = theano.gof.opt.Updater(on_import, None, None)
+        u = theano.gof.opt.Updater(on_import, None, None,
+                                   name="GemmOptimizer")
        fgraph.attach_feature(u)
        while did_something:
            nb_iter += 1
@@ -1465,6 +1467,7 @@ class GemmOptimizer(Optimizer):
                if new_outputs:
                    new_outputs, old_dot22 = new_outputs
                    assert len(new_outputs) == len(node.outputs)
+                    new_outputs[0].tag.values_eq_approx = values_eq_approx_remove_inf_nan
                    try:
                        fgraph.replace_all_validate_remove(
                            list(zip(node.outputs, new_outputs)),

--- a/theano/tensor/nlinalg.py
+++ b/theano/tensor/nlinalg.py
@@ -726,3 +726,62 @@ def norm(x, ord):
            raise ValueError(0)
    elif ndim > 2:
        raise NotImplementedError("We don't support norm witn ndim > 2")
+class TensorInv(Op):
+    """
+    Class wrapper for tensorinv() function;
+    Theano utilization of numpy.linalg.tensorinv;
+    """
+    _numop = staticmethod(numpy.linalg.tensorinv)
+    __props__ = ('ind',)
+    def __init__(self, ind=2):
+        self.ind = ind
+    def make_node(self, a):
+        a = as_tensor_variable(a)
+        out = a.type()
+        return Apply(self, [a], [out])
+    def perform(self, node, inputs, outputs):
+        (a,) = inputs
+        (x,) = outputs
+        x[0] = self._numop(a, self.ind)
+    def infer_shape(self, node, shapes):
+        sp = shapes[0][self.ind:] + shapes[0][:self.ind]
+        return [sp]
+def tensorinv(a, ind=2):
+    """
+    Does not run on GPU;
+    Theano utilization of numpy.linalg.tensorinv;
+    Compute the 'inverse' of an N-dimensional array.
+    The result is an inverse for `a` relative to the tensordot operation
+    ``tensordot(a, b, ind)``, i. e., up to floating-point accuracy,
+    ``tensordot(tensorinv(a), a, ind)`` is the "identity" tensor for the
+    tensordot operation.
+    Parameters
+    ----------
+    a : array_like
+        Tensor to 'invert'. Its shape must be 'square', i. e.,
+        ``prod(a.shape[:ind]) == prod(a.shape[ind:])``.
+    ind : int, optional
+        Number of first indices that are involved in the inverse sum.
+        Must be a positive integer, default is 2.
+    Returns
+    -------
+    b : ndarray
+        `a`'s tensordot inverse, shape ``a.shape[ind:] + a.shape[:ind]``.
+    Raises
+    ------
+    LinAlgError
+        If `a` is singular or not 'square' (in the above sense).
+    """
+    return TensorInv(ind)(a)
--- a/theano/tensor/nnet/sigm.py
+++ b/theano/tensor/nnet/sigm.py
@@ -413,6 +413,7 @@ log1msigm_to_softplus = gof.PatternSub(
    values_eq_approx=values_eq_approx_remove_inf,
    skip_identities_fn=_skip_mul_1)
 log1pexp_to_softplus = gof.PatternSub(
    (tensor.log1p,
     (tensor.exp, 'x')),
@@ -420,12 +421,20 @@ log1pexp_to_softplus = gof.PatternSub(
    values_eq_approx=values_eq_approx_remove_inf,
    allow_multiple_clients=True)
+log1p_neg_sigmoid = gof.PatternSub(
+    (tensor.log1p,
+     (tensor.neg, (sigmoid, 'x'))),
+    (tensor.neg, (softplus, 'x')),
+    values_eq_approx=values_eq_approx_remove_inf,
+    allow_multiple_clients=True)
 opt.register_stabilize(logsigm_to_softplus, name='logsigm_to_softplus')
 opt.register_stabilize(log1msigm_to_softplus, name='log1msigm_to_softplus')
 opt.register_stabilize(log1pexp_to_softplus, name='log1pexp_to_softplus')
+opt.register_stabilize(log1p_neg_sigmoid, name='log1p_neg_sigmoid,')
-def is_1pexp(t):
+def is_1pexp(t, only_process_constants=True):
    """
    Returns
@@ -437,8 +446,9 @@ def is_1pexp(t):
    """
    if t.owner and t.owner.op == tensor.add:
        scalars, scalar_inputs, nonconsts = \
-            opt.scalarconsts_rest(t.owner.inputs)
+            opt.scalarconsts_rest(t.owner.inputs,
-        # scalar_inputs are potentially dimshuffled and fill'd scalars
+                                  only_process_constants=only_process_constants)
+        # scalar_inputs are potentially dimshuffled and filled with scalars
        if len(nonconsts) == 1:
            maybe_exp = nonconsts[0]
            if maybe_exp.owner and maybe_exp.owner.op == tensor.exp:
@@ -947,7 +957,7 @@ def local_inv_1_plus_exp(node):
        inv_arg = node.inputs[0]
        if inv_arg.owner and inv_arg.owner.op == tensor.add:
            scalars, scalar_inputs, nonconsts = \
-                opt.scalarconsts_rest(inv_arg.owner.inputs)
+                opt.scalarconsts_rest(inv_arg.owner.inputs, only_process_constants=True)
            # scalar_inputs are potentially dimshuffled and fill'd scalars
            if len(nonconsts) == 1:
                if nonconsts[0].owner and nonconsts[0].owner.op == tensor.exp:

--- a/theano/tensor/nnet/tests/test_sigm.py
+++ b/theano/tensor/nnet/tests/test_sigm.py
@@ -356,7 +356,6 @@ class T_sigmoid_opts(unittest.TestCase):
        f = theano.function([x], s, mode=mode)
        assert hasattr(f.maker.fgraph.outputs[0].tag, 'trace')
        topo = f.maker.fgraph.toposort()
-        assert len(topo) > 1
        assert not any([n.op == sigmoid for n in topo])
        ux_v = f([[-50, -10, -4, -1, 0, 1, 4, 10, 50]])
@@ -467,15 +466,17 @@ class T_sigmoid_utils(unittest.TestCase):
        try:
            x = tensor.vector('x')
            exp = tensor.exp
-            assert is_1pexp(1 + exp(x)) == (False, x)
+            assert is_1pexp(1 + exp(x), False) == (False, x)
-            assert is_1pexp(exp(x) + 1) == (False, x)
+            assert is_1pexp(exp(x) + 1, False) == (False, x)
-            for neg, exp_arg in imap(is_1pexp, [(1 + exp(-x)), (exp(-x) + 1)]):
+            for neg, exp_arg in imap(lambda x:
+                                     is_1pexp(x, only_process_constants=False),
+                                     [(1 + exp(-x)), (exp(-x) + 1)]):
                assert not neg and theano.gof.graph.is_same_graph(exp_arg, -x)
-            assert is_1pexp(1 - exp(x)) is None
+            assert is_1pexp(1 - exp(x), False) is None
-            assert is_1pexp(2 + exp(x)) is None
+            assert is_1pexp(2 + exp(x), False) is None
-            assert is_1pexp(exp(x) + 2) is None
+            assert is_1pexp(exp(x) + 2, False) is None
-            assert is_1pexp(exp(x) - 1) is None
+            assert is_1pexp(exp(x) - 1, False) is None
-            assert is_1pexp(-1 + exp(x)) is None
+            assert is_1pexp(-1 + exp(x), False) is None
-            assert is_1pexp(1 + 2 * exp(x)) is None
+            assert is_1pexp(1 + 2 * exp(x), False) is None
        finally:
            config.warn.identify_1pexp_bug = backup
--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
--- a/theano/tensor/signal/pool.py
+++ b/theano/tensor/signal/pool.py
@@ -186,8 +186,12 @@ class Pool(Op):
        if st is None:
            st = ds
        r, c = imgshape[-2:]
-        r += padding[0] * 2
+        r = tensor.extract_constant(r)
-        c += padding[1] * 2
+        c = tensor.extract_constant(c)
+        if padding[0]:
+            r += padding[0] * 2
+        if padding[1]:
+            c += padding[1] * 2
        if ignore_border:
            if ds[0] == st[0]:
@@ -216,7 +220,7 @@ class Pool(Op):
            elif st[0] >= ds[0]:
                nr = (r - 1) // st[0] + 1
            else:
-                nr = max(0, (r - 1 - ds[0]) // st[0] + 1) + 1
+                nr = max(0, (r - 1 - ds[0] + st[0]) // st[0]) + 1
            if isinstance(c, theano.Variable):
                nc = tensor.switch(tensor.ge(st[1], ds[1]),
@@ -226,7 +230,7 @@ class Pool(Op):
            elif st[1] >= ds[1]:
                nc = (c - 1) // st[1] + 1
            else:
-                nc = max(0, (c - 1 - ds[1]) // st[1] + 1) + 1
+                nc = max(0, (c - 1 - ds[1] + st[1]) // st[1]) + 1
        rval = list(imgshape[:-2]) + [nr, nc]
        return rval
@@ -257,10 +261,10 @@ class Pool(Op):
        self.mode = mode
    def make_node(self, x):
-        if x.type.ndim != 4:
-            raise TypeError()
        # TODO: consider restricting the dtype?
        x = tensor.as_tensor_variable(x)
+        if x.type.ndim != 4:
+            raise TypeError()
        # If the input shape are broadcastable we can have 0 in the output shape
        broad = x.broadcastable[:2] + (False, False)
        out = tensor.TensorType(x.dtype, broad)
@@ -274,6 +278,9 @@ class Pool(Op):
                'Pool requires 4D input for now')
        z_shape = self.out_shape(x.shape, self.ds, self.ignore_border, self.st,
                                 self.padding)
+        if not self.ignore_border:
+            assert z_shape[2] > 0
+            assert z_shape[3] > 0
        if (z[0] is None) or (z[0].shape != z_shape):
            z[0] = numpy.empty(z_shape, dtype=x.dtype)
        zz = z[0]
@@ -403,7 +410,7 @@ class Pool(Op):
            }
            else
            {
-                z_r = std::max(0, (r - 1 - %(ds0)s) / %(st0)s + 1) + 1;
+                z_r = std::max(0, (r - 1 - %(ds0)s + %(st0)s) / %(st0)s) + 1;
            }
            // decide how many columns the output has
            if (%(st1)s >= %(ds1)s)
@@ -412,8 +419,10 @@ class Pool(Op):
            }
            else
            {
-                z_c = std::max(0, (c - 1 - %(ds1)s) / %(st1)s + 1) + 1;
+                z_c = std::max(0, (c - 1 - %(ds1)s + %(st0)s) / %(st1)s) + 1;
            }
+            assert(z_r > 0);
+            assert(z_c > 0);
        }
        // memory allocation of z if necessary
        if ((!%(z)s)
@@ -522,7 +531,7 @@ class Pool(Op):
        return ccode % locals()
    def c_code_cache_version(self):
-        return (0, 6, 8, 3)
+        return (0, 6, 8, 4)
 class PoolGrad(Op):
@@ -632,12 +641,12 @@ class MaxPoolGrad(PoolGrad):
    def make_node(self, x, maxout, gz):
        # make_node should only be called by the grad function of
        # Pool, so these asserts should not fail.
-        assert isinstance(x, Variable) and x.ndim == 4
-        assert isinstance(maxout, Variable) and maxout.ndim == 4
-        assert isinstance(gz, Variable) and gz.ndim == 4
        x = tensor.as_tensor_variable(x)
        maxout = tensor.as_tensor_variable(maxout)
        gz = tensor.as_tensor_variable(gz)
+        assert isinstance(x, Variable) and x.ndim == 4
+        assert isinstance(maxout, Variable) and maxout.ndim == 4
+        assert isinstance(gz, Variable) and gz.ndim == 4
        return Apply(self, [x, maxout, gz], [x.type()])
@@ -814,10 +823,10 @@ class AveragePoolGrad(PoolGrad):
    def make_node(self, x, gz, dummy=None):
        # make_node should only be called by the grad function of
        # Pool, so these asserts should not fail.
-        assert isinstance(x, Variable) and x.ndim == 4
-        assert isinstance(gz, Variable) and gz.ndim == 4
        x = tensor.as_tensor_variable(x)
        gz = tensor.as_tensor_variable(gz)
+        assert isinstance(x, Variable) and x.ndim == 4
+        assert isinstance(gz, Variable) and gz.ndim == 4
        return Apply(self, [x, gz], [x.type()])

--- a/theano/tensor/signal/tests/test_pool.py
+++ b/theano/tensor/signal/tests/test_pool.py
--- a/theano/tensor/slinalg.py
+++ b/theano/tensor/slinalg.py
--- a/theano/tensor/subtensor.py
+++ b/theano/tensor/subtensor.py
--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
--- a/theano/tensor/tests/test_blas_c.py
+++ b/theano/tensor/tests/test_blas_c.py
--- a/theano/tensor/tests/test_nlinalg.py
+++ b/theano/tensor/tests/test_nlinalg.py
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tensor/tests/test_slinalg.py
+++ b/theano/tensor/tests/test_slinalg.py
--- a/theano/tests/test_flake8.py
+++ b/theano/tests/test_flake8.py