Merged -- solved conflicts in .hgignore and doc/install.txt

a69b6045 · Olivier Delalleau · c31235ee · 4aab9edc · a69b6045 · a69b6045
--- a/.hgignore
+++ b/.hgignore
@@ -8,6 +8,14 @@ syntax: glob
 *.so
 *.sw?
 *~
+*.aux
+*.log
+*.nav
+*.out
+*.pdf
+*.snm
+*.toc
+*.vrb
 .noseids
 Theano.egg-info
 \#*\#

--- a/HISTORY.txt
+++ b/HISTORY.txt
@@ -5,6 +5,81 @@
 Release Notes
 =============

+Theano 0.3.1 (2011-02-21)
+=========================
+
+Deprecation:
+ * The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
+    See http://deeplearning.net/software/theano/tutorial/aliasing.html
+
+Bugs fixed:
+ * The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
+    * In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
+
+ * In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
+ * Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
+   The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
+   To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
+   call the first function to cache the number of steps, and then call the second function with a different number of steps.
+ * In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
+   The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
+
+Crash fixed:
+ * Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
+ * Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
+ * Disabled C code generator that make gcc crash on complex type.
+ * Crash in optimization when an Op has no input.
+ * Output shape is now computed correctly for matrix-vector multiplication on GPU.
+ * In Scan, when using numbers as inputs, not symbolic variables.
+ * In GradScan, when there is only 1 inputs in the Scan.
+ * In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
+ * Some segfault at exit with GPU code.
+
+Optimization:
+ * New SpecifyShape op that allow to pass more shape info in the graph.
+ * Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
+ * Remove join of only 1 element.
+ * During optimization, consider one more case in get_constant_value.
+
+GPU:
+ * cuda_shared.value = X now works inplace!
+     * cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
+ * Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
+ * New init_gpu_device theano flags.
+ * Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
+ * CPU join of only 1 element that was not moved to the GPU.
+
+New features:
+ * tensor.reshape now makes dimensions of length 1 broadcastable.
+ * tensor.prod now implements the gradient.
+ * DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
+    * This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
+ * Sparse.structured_dot now works when both matrices are sparse
+ * Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
+ * New 3D convolution ops, with CPU and GPU implementations.
+ * New colors in pydotprint.
+
+Documentation:
+ * Documented lib.amdlibm and (new) init_gpu_device config variables.
+ * A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
+ * Revision to the Windows installation instructions.
+ * The cuda documentation is now generated on the web server.
+ * Better documentation of .theanorc and its sections.
+
+Unit tests:
+ * Stop usage of deprecated functions or syntax in the unit tests.
+ * Better testing of GPU convolution nets.
+ * Make more tests able to use different random seeds.
+ * Tests of sparse now use default mode, not a hard-coded one.
+ * Remove some tests of unimplemented features.
+
+Other:
+ * The name of compiledir now includes the Python version to make it easier for people with many Python versions
+ * Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
+ * Whitespace, tabulation and indentation clean-up in the code.
+ * Better detection of memory sharing between variables.
+
+
 Theano 0.3 (2010-11-23)
 =======================


--- a/NEWS.txt
+++ b/NEWS.txt
@@ -3,108 +3,71 @@ Modifications in the trunk since the last release
 Partial of what is in trunk since the last release
 --------------------------------------------------
 Deprecation:
+ * tag.shape attribute deprecated (#633)
+ * FAST_RUN_NOGC mode deprecated
+ * CudaNdarray_new_null is deprecated in favour of CudaNdarray_New

 Bugs fixed:
 * Bugfix in CudaNdarray.__iadd__. When it is not implemented, return the error.
+ * Typo fixed in tensor/opt.py
+ * THEANO_FLAGS='optimizer=None' now works as expected
+ * Fixed memory leak in error handling on GPU-to-host copy
+ * Fix relating specifically to Python 2.7 on Mac OS X
+ * infer_shape can now handle Python longs
+ * Fixed behaviour of pydotprint's max_label_size option

 Crash fixed:
- * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution crash.
+ * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
+   crash.

 Optimization:
 * Optimize 4 pattern of subtensor followed by subtensor.
+ * Gemm inplace optimization on the GPU re-enabled

 GPU:
- * Move to the gpu fused elemwise that have other dtype then float32 in them(except float64) if the input and output are float32.
-   * This allow to move elemwise comparaison to the gpu if we cast it to float32 after that.
+ * Move to the gpu fused elemwise that have other dtype then float32 in them
+   (except float64) if the input and output are float32.
+   * This allow to move elemwise comparisons to the GPU if we cast it to
+     float32 after that.
 * Implemented CudaNdarray.ndim to have the same interface in ndarray.
+ * Fixed slowdown caused by multiple chained views on CudaNdarray objects
+ * CudaNdarray_alloc_contiguous changed so as to never try to free
+   memory on a view: new "base" property
+ * Safer decref behaviour in CudaNdarray in case of failed allocations
+ * New GPU implementation of tensor.basic.outer

 New features:
 * ProfileMode
    * profile the scan overhead
    * simple hook system to add profiler
    * reordered the output to be in the order of more general to more specific
- * var[vector of index] now work, (grad work recursivly, the direct grad work inplace, gpu work)
+ * var[vector of index] now work, (grad work recursively, the direct grad
+   work inplace, gpu work)
    * limitation: work only of the outer most dimensions.
+ * test_value implementation to allow quick debugging at graph creation time
+ * cuda.root inferred if nvcc is on the path, otherwise defaults to
+   /usr/local/cuda
+ * Better graph printing for graphs involving a scan subgraph
+ *

 Documentation:
+ * Better commenting of cuda_ndarray.cu
+ * Fixes in the scan documentation: add missing declarations/print statements
+ * Better error message on failed __getitem__
+ * Updated documentation on profile mode

 Unit tests:
 * More strict float comparaison by default
 * Reuse test for subtensor of tensor for gpu tensor(more gpu test)
+ * Tests that check for aliased function inputs and assure appropriate copying
+   (#374)
+ * Better test of copies in CudaNdarray
+ * New tests relating to the new base pointer requirements

 Other:
- * ?? a bug?? Correctly put the broadcast flag to True in the output var of a Rehapse op when we receive an int 1 in the new shape.
+ * ?? a bug?? Correctly put the broadcast flag to True in the output var of
+   a Rehapse op when we receive an int 1 in the new shape.
+ * pydotprint: high contrast mode is now the default
+ * More compact printing (ignore leading "Composite" in op names)

-Theano 0.3.1 (2011-02-21)
----------------------------
-
-Deprecation:
- * The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
-    See http://deeplearning.net/software/theano/tutorial/aliasing.html
-
-Bugs fixed:
- * The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
-    * In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
-
- * In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
- * Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
-   The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
-   To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
-   call the first function to cache the number of steps, and then call the second function with a different number of steps.
- * In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
-   The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
-
-Crash fixed:
- * Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
- * Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
- * Disabled C code generator that make gcc crash on complex type.
- * Crash in optimization when an Op has no input.
- * Output shape is now computed correctly for matrix-vector multiplication on GPU.
- * In Scan, when using numbers as inputs, not symbolic variables.
- * In GradScan, when there is only 1 inputs in the Scan.
- * In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
- * Some segfault at exit with GPU code.
-
-Optimization:
- * New SpecifyShape op that allow to pass more shape info in the graph.
- * Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
- * Remove join of only 1 element.
- * During optimization, consider one more case in get_constant_value.
-
-GPU:
- * cuda_shared.value = X now works inplace!
-     * cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
- * Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
- * New init_gpu_device theano flags.
- * Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
- * CPU join of only 1 element that was not moved to the GPU.
-
-New features:
- * tensor.reshape now makes dimensions of length 1 broadcastable.
- * tensor.prod now implements the gradient.
- * DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
-    * This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
- * Sparse.structured_dot now works when both matrices are sparse
- * Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
- * New 3D convolution ops, with CPU and GPU implementations.
- * New colors in pydotprint.
-
-Documentation:
- * Documented lib.amdlibm and (new) init_gpu_device config variables.
- * A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
- * Revision to the Windows installation instructions.
- * The cuda documentation is now generated on the web server.
- * Better documentation of .theanorc and its sections.
-
-Unit tests:
- * Stop usage of deprecated functions or syntax in the unit tests.
- * Better testing of GPU convolution nets.
- * Make more tests able to use different random seeds.
- * Tests of sparse now use default mode, not a hard-coded one.
- * Remove some tests of unimplemented features.
-
-Other:
- * The name of compiledir now includes the Python version to make it easier for people with many Python versions
- * Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
- * Whitespace, tabulation and indentation clean-up in the code.
- * Better detection of memory sharing between variables.
+(To be continued...)
--- a/bin/theano-cache
+++ b/bin/theano-cache
@@ -6,7 +6,7 @@ from theano.gof.cc import get_module_cache
 if len(sys.argv) == 1:
    print config.compiledir
 elif sys.argv[1] in ('clear'):
-    get_module_cache().clear()
+    get_module_cache().clear(unversioned_min_age=-1, clear_base_files=True)
 else:
    print 'command "%s" not recognized' % sys.argv[1]
    print 'Type "theano-cache" to print the cache location'

--- a/doc/conf.py
+++ b/doc/conf.py
@@ -51,9 +51,9 @@ copyright = '2008--2011, LISA lab'
 # other places throughout the built documents.
 #
 # The short X.Y version.
-version = '0.3.1'
+version = '0.4'
 # The full version, including alpha/beta/rc tags.
-release = '0.3.1'
+release = '0.4.0rc3'

 # There are two options for replacing |today|: either, you set today to some
 # non-false value, then it is used:

--- a/doc/developer/index.txt
+++ b/doc/developer/index.txt

 .. _developer:

-======================
+==============================================
 Theano Design and Implementation Documentation
-======================
+==============================================


 .. toctree::

--- a/doc/developer/tensor.txt
+++ b/doc/developer/tensor.txt
@@ -7,7 +7,7 @@ Tensor
 This file describes the design of theano.tensor.

 Elemwise grad and R_op 
-=================
+======================

 Here's another straightforward example, though a bit more elaborate
 than adding two numbers together. Let's say that you want to compute

--- a/doc/hpcs2011_tutorial/Makefile
+++ b/doc/hpcs2011_tutorial/Makefile
+all:
+	pdflatex presentation.tex
--- a/doc/hpcs2011_tutorial/double_op.py
+++ b/doc/hpcs2011_tutorial/double_op.py
+import numpy
+import theano
+
+class DoubleOp(theano.Op):
+    def __eq__(self, other):
+        return type(self) == type(other)
+    def __hash__(self):
+        return hash(type(self))
+    def __str__(self):
+        return self.__class__.__name__
+    def make_node(self, x):
+        x = theano.tensor.as_tensor_variable(x)
+        return theano.Apply(self, [x], [x.type()])
+    def perform(self, node, inputs, output_storage):
+        x = inputs[0]
+        z = output_storage[0]
+        z[0] = x * 2
+
+x = theano.tensor.matrix()
+
+f = theano.function([x], DoubleOp()(x))
+
+inp = numpy.random.rand(5,5)
+out = f(inp)
+assert numpy.allclose(inp*2, out)
+print inp
+print out
--- a/doc/hpcs2011_tutorial/logreg_example.py
+++ b/doc/hpcs2011_tutorial/logreg_example.py
+import numpy
+import theano
+import theano.tensor as T
+rng = numpy.random
+
+N = 400
+feats = 784
+D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+training_steps = 10
+
+# Declare Theano symbolic variables
+x = T.matrix("x")
+y = T.vector("y")
+w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+x.tag.test_value = D[0]
+y.tag.test_value = D[1]
+print "Initial model:"
+print w.get_value(), b.get_value()
+
+
+# Construct Theano expression graph
+p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+gw,gb = T.grad(cost, [w,b])
+
+# Compile expressions to functions
+train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            updates={w:w-0.1*gw, b:b-0.1*gb},
+            name = "train")
+predict = theano.function(inputs=[x], outputs=prediction,
+            name = "predict")
+
+for i in range(training_steps):
+    pred, err = train(D[0], D[1])
+print "Final model:"
+print w.get_value(), b.get_value()
+
+print "target values for D"
+print D[1]
+
+print "prediction on D"
+print predict(D[0])
+
+# Print the graph used in the slides
+theano.printing.pydotprint(predict,
+                           outfile="pics/logreg_pydotprint_predic.png",
+                           var_with_name_simple=True)
+theano.printing.pydotprint_variables(prediction,
+                           outfile="pics/logreg_pydotprint_prediction.png",
+                           var_with_name_simple=True)
+theano.printing.pydotprint(train,
+                           outfile="pics/logreg_pydotprint_train.png",
+                           var_with_name_simple=True)
--- a/doc/hpcs2011_tutorial/presentation.tex
+++ b/doc/hpcs2011_tutorial/presentation.tex
--- a/doc/hpcs2011_tutorial/pycuda_double_op.py
+++ b/doc/hpcs2011_tutorial/pycuda_double_op.py
+import numpy, theano
+import theano.misc.pycuda_init
+from pycuda.compiler import SourceModule
+import theano.sandbox.cuda as cuda
+
+class PyCUDADoubleOp(theano.Op):
+    def __eq__(self, other):
+        return type(self) == type(other)
+    def __hash__(self):
+        return hash(type(self))
+    def __str__(self):
+        return self.__class__.__name__
+    def make_node(self, inp):
+        inp = cuda.basic_ops.gpu_contiguous(
+           cuda.basic_ops.as_cuda_ndarray_variable(inp))
+        assert inp.dtype == "float32"
+        return theano.Apply(self, [inp], [inp.type()])
+
+    def make_thunk(self, node, storage_map, _, _2):
+        mod = SourceModule("""
+    __global__ void my_fct(float * i0, float * o0, int size) {
+    int i = blockIdx.x*blockDim.x + threadIdx.x;
+    if(i<size){
+        o0[i] = i0[i]*2;
+    }
+  }""")
+        pycuda_fct = mod.get_function("my_fct")
+        inputs = [ storage_map[v] for v in node.inputs]
+        outputs = [ storage_map[v] for v in node.outputs]
+        def thunk():
+            z = outputs[0]
+            if z[0] is None or z[0].shape!=inputs[0][0].shape:
+                z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
+            grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
+            pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
+                       block=(512,1,1), grid=grid)
+
+        return thunk
+
+x = theano.tensor.fmatrix()
+f = theano.function([x], PyCUDADoubleOp()(x))
+xv=numpy.ones((4,5), dtype="float32")
+
+assert numpy.allclose(f(xv), xv*2)
+print numpy.asarray(f(xv))
--- a/doc/hpcs2011_tutorial/simple_example.py
+++ b/doc/hpcs2011_tutorial/simple_example.py
+import theano
+a = theano.tensor.vector("a") # declare variable
+b = a + a**10                 # build symbolic expression
+f = theano.function([a], b)   # compile function
+print f([0,1,2])
+# prints `array([0,2,1026])`
+
+theano.printing.pydotprint_variables(b, outfile="pics/f_unoptimized.png", var_with_name_simple=True)
+theano.printing.pydotprint(f, outfile="pics/f_optimized.png", var_with_name_simple=True)
--- a/doc/install.txt
+++ b/doc/install.txt
@@ -557,7 +557,7 @@ used within a MinGW Shell (not available if you only installed Python(x,y)).

  You do not need to do the following now, because it is not usually needed, but if
  later on, when running Theano, you see an error message that looks like:
-    *error: 'assert' was not declared in this scope*
+  *error: 'assert' was not declared in this scope*
  then you will have to add another section:

    .. code-block:: cfg

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -144,7 +144,7 @@ import theano and print the config variable, as in:

 .. attribute:: floatX

-    String value: either 'float64' or 'float32'.
+    String value: either 'float64' or 'float32'

    Default: 'float64'

@@ -152,6 +152,50 @@ import theano and print the config variable, as in:
    and similar functions.  It also sets the default theano bit width for
    arguments passed as Python floating-point numbers.

+.. attribute:: cast_policy
+
+    String value: either 'numpy+floatX' or 'custom'
+
+    Default: 'custom'
+
+    This specifies how data types are implicitly figured out in Theano, e.g. for
+    constants or in the results of arithmetic operations. The 'custom' value
+    corresponds to a set of custom rules originally used in
+    Theano (which can be partially customized, see e.g. the in-code help of
+    ``tensor.NumpyAutocaster``), and will be deprecated in the future.
+    The 'numpy+floatX' setting attempts to mimic the numpy casting rules,
+    although it prefers to use float32 numbers instead of float64 when
+    ``config.floatX`` is set to 'float32' and the user uses data that is not
+    explicitly typed as float64 (e.g. regular Python floats).
+    Note that 'numpy+floatX' is not currently behaving exactly as planned (it
+    is a work-in-progress), and thus you should consider it as experimental.
+    At the moment it behaves differently from numpy in the following
+    situations:
+       * Depending on the value of ``config.int_division``, the resulting type
+         of a division of integer types with the ``/`` operator may not match
+         that of numpy.
+       * On mixed scalar / array operations, numpy tries to prevent the scalar
+         from upcasting the array's type unless it is of a fundamentally
+         different type. Theano does not attempt to do the same at this point,
+         so you should be careful that scalars may upcast arrays when they
+         would not when using numpy. This behavior should change in the near
+         future.
+
+.. attribute:: int_division
+
+    String value: either 'int', 'floatX' or 'raise'
+
+    Default: 'int'
+
+    Specifies what to do when one tries to compute ``x / y``, where both ``x`` and
+    ``y`` are of integer types (possibly unsigned). 'int' means an integer is
+    returned (as in Python 2.X), but this behavior is deprecated. 'floatX'
+    returns a number of type given by ``config.floatX``. 'raise' is the safest
+    choice (and will become default in a future release of Theano) and raises
+    an error when one tries to do such an operation, enforcing the use of the
+    integer division operator (``//``) (if a float result is intended, either
+    cast one of the arguments to a float, or use ``x.__truediv__(y)``).
+
 .. attribute:: mode

    String value: 'Mode', 'ProfileMode', 'DebugMode', 'FAST_RUN', 'FAST_COMPILE'
@@ -385,3 +429,23 @@ import theano and print the config variable, as in:
    means using the default, defined by :attr:`config.numpy.seterr_all`.

    This flag's value cannot be modified during the program execution.
+
+.. attribute:: config.compute_test_value
+
+    String Value: ``'off'``, ``'ignore'``, ``'warn'``, ``'raise'``.
+
+    Default: ``'off'``
+
+    Setting this attribute to something other than ``'off'`` activates a
+    debugging mechanism, where Theano executes the graph on-the-fly, as it is
+    being built. This allows the user to spot errors early on (such as
+    dimension mis-match), **before** optimizations are applied.
+   
+    Theano will execute the graph using the Constants and/or shared variables
+    provided by the user. Purely symbolic variables (e.g. x = T.dmatrix()) can be
+    augmented with test values, by writing to their ``'tag.test_value'``
+    attribute (e.g. x.tag.test_value = numpy.random.rand(5,4)).
+
+    ``'warn'`` will result in a UserWarning being raised when some Op inputs
+    do not contain an appropriate test value. ``'raise'`` will instead raise
+    an Exception when a problem is encountered during this debugging phase.
--- a/doc/library/sandbox/index.txt
+++ b/doc/library/sandbox/index.txt
@@ -14,6 +14,7 @@
    :maxdepth: 1

    cuda/index
+    linalg



--- a/doc/library/sandbox/linalg.txt
+++ b/doc/library/sandbox/linalg.txt
+..  ../../../../theano/sandbox/linalg/ops.py
+..  ../../../../theano/sandbox/linalg
+
+.. _libdoc_linalg:
+
+===================================================================
+:mod:`sandbox.linalg` --  Linear Algebra Ops
+===================================================================
+
+.. module:: sandbox.linalg
+   :platform: Unix, Windows
+   :synopsis: Linear Algebra Ops
+.. moduleauthor:: LISA
+
+API
+===
+
+.. automodule:: theano.sandbox.linalg.ops
+    :members:
+
--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -728,11 +728,11 @@ row of a matrix x:


 Index-assignment is *not* supported.  If you want to do something like ``a[5]
-= b`` or ``a[5]+=b``, see :func:`setsubtensor` and :func:`incsubtensor` below.
+= b`` or ``a[5]+=b``, see :func:`set_subtensor` and :func:`inc_subtensor` below.

-.. autofunction:: theano.tensor.basic.setsubtensor
+.. autofunction:: theano.tensor.basic.set_subtensor

-.. autofunction:: theano.tensor.basic.incsubtensor
+.. autofunction:: theano.tensor.basic.inc_subtensor

 .. _tensor_operator_support:


--- a/doc/sandbox/sparse.txt
+++ b/doc/sandbox/sparse.txt
@@ -112,7 +112,7 @@ Misc
 ----
 The sparse equivalent of dmatrix is csc_matrix and csr_matrix.

-:api:`TrueDot` vs. :api:`StructuredDot`
+:api:`Dot` vs. :api:`StructuredDot`
 ----------------------------------------

 Often when you use a sparse matrix it is because there is a meaning to the

--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
@@ -17,6 +17,132 @@ Isolating the problem/Testing Theano compiler
 You can run your Theano function in a DebugMode(:ref:`using_debugmode`). This test the Theano optimizations and help to find where NaN, inf and other problem come from.


+Interactive Debugger
+--------------------
+
+As of v.0.4.0, Theano has a new mechanism by which graphs are executed
+on-the-fly, before a theano.function is ever compiled. Since optimizations
+haven't been applied at this stage, it is easy for the user to locate the
+source of this bug. This functionality is enabled through the config flag
+``theano.config.compute_test_value``. Its use is best shown through the
+following example.
+
+
+.. code-block:: python
+
+    # compute_test_value is 'off' by default, meaning this feature is inactive
+    theano.config.compute_test_value = 'off'
+
+    # configure shared variables
+    W1val = numpy.random.rand(2,10,10).astype(theano.config.floatX)
+    W1 = theano.shared(W1val, 'W1')
+    W2val = numpy.random.rand(15,20).astype(theano.config.floatX)
+    W2 = theano.shared(W2val, 'W2')
+
+    # input which will be of shape (5,10)
+    x  = T.matrix('x')
+
+    # transform the shared variable in some way. Theano does not
+    # know off hand that the matrix func_of_W1 has shape (20,10)
+    func_of_W1 = W1.dimshuffle(2,0,1).flatten(2).T
+
+    # source of error: dot product of 5x10 with 20x10
+    h1 = T.dot(x,func_of_W1)  
+
+    # do more stuff
+    h2 = T.dot(h1,W2.T)  
+
+    # compile and call the actual function
+    f = theano.function([x], h2)
+    f(numpy.random.rand(5,10))
+   
+Running the above code generates the following error message:
+
+.. code-block:: bash
+
+    Definition in: 
+      File "/u/desjagui/workspace/PYTHON/theano/gof/opt.py", line 1102, in apply
+        lopt_change = self.process_node(env, node, lopt)
+      File "/u/desjagui/workspace/PYTHON/theano/gof/opt.py", line 882, in process_node
+        replacements = lopt.transform(node)
+      File "/u/desjagui/workspace/PYTHON/Theano/theano/tensor/blas.py", line 1030, in local_dot_to_dot22
+        return [_dot22(*node.inputs)]
+      File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/op.py", line 324, in __call__
+        self.add_tag_trace(node)
+    For the full definition stack trace set the Theano flags traceback.limit to -1
+
+    Traceback (most recent call last):
+      File "test.py", line 29, in <module>
+        f(numpy.random.rand(5,10))
+      File "/u/desjagui/workspace/PYTHON/theano/compile/function_module.py", line 596, in __call__
+        self.fn()
+      File "/u/desjagui/workspace/PYTHON/theano/gof/link.py", line 288, in streamline_default_f
+        raise_with_op(node)
+      File "/u/desjagui/workspace/PYTHON/theano/gof/link.py", line 284, in streamline_default_f
+        thunk()
+      File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/cc.py", line 1111, in execute
+        raise exc_type, exc_value, exc_trace
+    ValueError: ('Shape mismatch: x has 10 cols but y has 20 rows', 
+    _dot22(x, <TensorType(float64, matrix)>), [_dot22.0], 
+    _dot22(x, InplaceDimShuffle{1,0}.0), 'Sequence id of Apply node=4')
+
+Needless to say the above is not very informative and does not provide much in
+the way of guidance. However, by instrumenting the code ever so slightly, we
+can get Theano to give us the exact source of the error.
+
+.. code-block:: python
+
+    # enable on-the-fly graph computations
+    theano.config.compute_test_value = 'warn'
+
+    ...
+
+    # input which will be of shape (5,10)
+    x  = T.matrix('x')
+    # provide Theano with a default test-value
+    x.tag.test_value = numpy.random.rand(5,10)
+
+In the above, we're tagging the symbolic matrix ``x`` with a special test
+value. This allows Theano to evaluate symbolic expressions on-the-fly (by
+calling the ``perform`` method of each Op), as they are being defined. Sources
+of error can thus be identified with much more precision and much earlier in
+the compilation pipeline. For example, running the above code yields the
+following error message, which properly identifies line 23 as the culprit.
+
+.. code-block:: bash
+
+    Traceback (most recent call last):
+      File "test2.py", line 23, in <module>
+        h1 = T.dot(x,func_of_W1)  
+      File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/op.py", line 360, in __call__
+        node.op.perform(node, input_vals, output_storage)
+      File "/u/desjagui/workspace/PYTHON/Theano/theano/tensor/basic.py", line 4458, in perform
+        z[0] = numpy.asarray(numpy.dot(x, y))
+    ValueError: ('matrices are not aligned', (5, 10), (20, 10))
+
+The compute_test_value mechanism works as follows:
+
+* Theano Constants and SharedVariable are used as is. No need to instrument them.
+* A Theano ``Variable`` (i.e. ``dmatrix``, ``vector``, etc.) should be
+  given a special test value through the attribute ``tag.test_value``.
+* Theano automatically instruments intermediate results. As such, any quantity
+  derived from ``x`` will be given a `tag.test_value` automatically.
+
+`compute_test_value` can take the following values:
+
+* ``off``: default behavior. This debugging mechanism is inactive.
+* ``raise``: compute test values on the fly. Any variable for which a test
+  value is required, but not provided by the user, is treated as an error. An
+  exception is raised accordingly.
+* ``warn``: idem, but a warning is issued instead of an Exception.
+* ``ignore``: silently ignore the computation of intermediate test values, if a
+  variable is missing a test value.
+
+.. note::
+  This feature is currently not compatible with ``Scan`` and also with Ops
+  which do not implement a ``perform`` method.
+
+
 How do I print an intermediate value in a Function/Method?
 ----------------------------------------------------------


--- a/setup.py
+++ b/setup.py
@@ -46,9 +46,9 @@ AUTHOR              = "LISA laboratory, University of Montreal"
 AUTHOR_EMAIL        = "theano-dev@googlegroups.com"
 PLATFORMS           = ["Windows", "Linux", "Solaris", "Mac OS-X", "Unix"]
 MAJOR               = 0
-MINOR               = 3
-MICRO               = 1
-SUFFIX              = ""  # Should be blank except for rc's, betas, etc.
+MINOR               = 4
+MICRO               = 0
+SUFFIX              = "rc3"  # Should be blank except for rc's, betas, etc.
 ISRELEASED          = False

 VERSION             = '%d.%d.%d%s' % (MAJOR, MINOR, MICRO, SUFFIX)
@@ -105,12 +105,13 @@ if not release:

    a = open(filename, 'w')
    try:
-        a.write(cnt % {'version': VERSION,
-                       'full_version' : FULL_VERSION,
-                       'hg_revision' : HG_REVISION,
-                       'isrelease': str(ISRELEASED)})
-    except Exception, e:
-        print e
+    	try:
+        	a.write(cnt % {'version': VERSION,
+                       	'full_version' : FULL_VERSION,
+                       	'hg_revision' : HG_REVISION,
+                       	'isrelease': str(ISRELEASED)})
+    	except Exception, e:
+        	print e
    finally:
        a.close()


--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -937,9 +937,14 @@ class FunctionMaker(object):
        optimizer, linker = mode.optimizer, copy.copy(mode.linker)

        # optimize the env
-        start_optimizer = time.time()
-        optimizer(env)
-        end_optimizer = time.time()
+        compute_test_value_orig = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = "off"
+            start_optimizer = time.time()
+            optimizer(env)
+            end_optimizer = time.time()
+        finally:
+            theano.config.compute_test_value = compute_test_value_orig
        mode.optimizer_time += end_optimizer - start_optimizer
        _logger.debug('Optimizing took %f seconds' % (end_optimizer - start_optimizer))


--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
@@ -411,7 +411,10 @@ class ProfileMode(Mode):
                    apply_time, op_cimpl, message, outputs_size,
                    other_time)

-        if outputs_size:
+        if not outputs_size:
+            print """\nProfile of Theano intermediate memory disabled.
+ To enabled, put the Theano flag ProfileMode.profile_memory to True."""
+        else:
            fct_memory={}#env->dict(node->(outputs size))
            var_mem = {}
            for node,val in outputs_size.items():
@@ -421,6 +424,7 @@ class ProfileMode(Mode):
                    var_mem[out]=v
            print
            print "Profile of Theano functions memory:"
+            print "(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)"
            nb_skipped = 0
            for env,nodes_mem in fct_memory.iteritems():
                size_sum=sum([sum(val) for key,val in nodes_mem.iteritems()])

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
--- a/theano/configparser.py
+++ b/theano/configparser.py
-# For flag of bool type, we consider the string 'False','false' and '0' as False 
+# For flag of bool type, we consider the string 'False','false' and '0' as False
 # and the string 'True', 'true', '1' as true.
 # We also accept the bool type as its corresponding value!

@@ -7,6 +7,8 @@ import ConfigParser
 import logging
 import warnings

+import theano
+
 _logger = logging.getLogger('theano.config')

 class TheanoConfigWarning(Warning):
@@ -103,6 +105,21 @@ def _config_print(thing, buf):
        print >> buf, "    Value: ", cv.val
        print >> buf, ""

+
+def get_config_md5():
+    """
+    Return a string md5 of the current config options. It should be such that
+    we can safely assume that two different config setups will lead to two
+    different strings.
+
+    We only take into account config options for which `in_c_key` is True.
+    """
+    all_opts = sorted([c for c in _config_var_list if c.in_c_key],
+                      key=lambda cv: cv.fullname)
+    return theano.gof.cc.hash_from_code('\n'.join(
+                    ['%s = %s' % (cv.fullname, cv.val) for cv in all_opts]))
+
+
 class TheanoConfigParser(object):
    #properties are installed by AddConfigVar
    _i_am_a_config_class = True
@@ -110,6 +127,7 @@ class TheanoConfigParser(object):
        sio = StringIO.StringIO()
        _config_print(self.__class__, sio)
        return sio.getvalue()
+
 # N.B. all instances of TheanoConfigParser give access to the same properties.
 config = TheanoConfigParser()

@@ -124,17 +142,27 @@ config = TheanoConfigParser()
 # - The subtrees provide the same interface as the root
 # - ConfigParser subclasses control get/set of config properties to guard against craziness.

-def AddConfigVar(name, doc, configparam, root=config):
+def AddConfigVar(name, doc, configparam, root=config, in_c_key=True):
    """Add a new variable to theano.config

    :type name: string for form "[section0.[section1.[etc]]].option"
    :param name: the full name for this configuration variable.
+
    :type doc: string
    :param doc: What does this variable specify?
+
    :type configparam: ConfigParam instance
-    :param configparam: an object for getting and setting this configuration  parameter
+    :param configparam: an object for getting and setting this configuration parameter
+
    :type root: object
-    :param root: used for recusive calls -- don't provide an argument for this parameter.
+    :param root: used for recusive calls -- do not provide an argument for this parameter.
+
+    :type in_c_key: boolean
+    :param in_c_key: If True, then whenever this config option changes, the
+    key associated to compiled C modules also changes, i.e. it may trigger a
+    compilation of these modules (this compilation will only be partial if it
+    turns out that the generated C code is unchanged). Set this option to False
+    only if you are confident this option should not affect C code compilation.

    :returns: None
    """
@@ -155,11 +183,13 @@ def AddConfigVar(name, doc, configparam, root=config):
        newroot = getattr(root, sections[0])
        if not getattr(newroot, '_i_am_a_config_class', False) or isinstance(newroot, type):
            raise TypeError('Internal config nodes must be config class instances', newroot)
-        return AddConfigVar('.'.join(sections[1:]), doc, configparam, root=newroot)
+        return AddConfigVar('.'.join(sections[1:]), doc, configparam,
+                            root=newroot, in_c_key=in_c_key)
    else:
        if hasattr(root, name):
            raise AttributeError('This name is already taken', configparam.fullname)
        configparam.doc = doc
+        configparam.in_c_key = in_c_key
        configparam.__get__() # trigger a read of the value from config files and env vars
        setattr(root.__class__, sections[0], configparam)
        _config_var_list.append(configparam)
@@ -171,12 +201,16 @@ class ConfigParam(object):
        So the value should be the same during all the execution
        """
        self.default = default
-        self.filter=filter
+        self.filter = filter
        self.allow_override = allow_override
        # N.B. --
        # self.fullname  # set by AddConfigVar
        # self.doc       # set by AddConfigVar

+        # Check that default is a valid value
+        if self.filter:
+            self.filter(self.default)
+
    def __get__(self, *args):
        #print "GETTING PARAM", self.fullname, self, args
        if not hasattr(self, 'val'):
@@ -203,6 +237,13 @@ class EnumStr(ConfigParam):
    def __init__(self, default, *options, **kwargs):
        self.default = default
        self.all = (default,) + options
+
+        # All options should be strings
+        for val in self.all:
+            if not isinstance(val, str):
+                raise ValueError('Valid values for an EnumStr parameter '
+                        'should be strings', val, type(val))
+
        def filter(val):
            if val in self.all:
                return val
@@ -248,7 +289,7 @@ def BoolParam(default, is_valid=None, allow_override=True):
    def is_valid_bool(s):
        if s in ['False', 'false', '0', 'True', 'true', '1', False, True]:
            return True
-        else: 
+        else:
            return False
    if is_valid is None:
        is_valid = is_valid_bool

--- a/theano/gof/apply_shape.py
+++ b/theano/gof/apply_shape.py
-"""Apply for use with Tensors that implements shape propagation via variable.tag.shape
-
-This is not used currently very used. It appear in some case, but I'm not sure it if work or if it is used by default.
-It could help the current system to make it detect problem earlier when contructing the graph instead of during optimization.
-"""
-import sys
-from theano import gof
-
-def ishape(v):
-    try:
-        return (True, v.tag.shape)
-    except AttributeError:
-        return (False, (None,)*v.type.ndim)
-
-
-class Apply(gof.Apply):
-    def __init__(self, op, inputs, outputs):
-        super(Apply, self).__init__(op, inputs, outputs)
-        if not inputs:
-            return
-        # if any input has any shape info, then propagate it
-        try:
-            provided, ishapes = zip(*[ishape(i) for i in inputs])
-        except AttributeError:
-            # i.type.ndim didn't make sense for some i
-            return
-        if provided == [False for i in inputs]:
-            # no input had a tag.shape
-            return
-        try:
-            infer_shape = op.infer_shape
-        except AttributeError:
-            # op has no infer_shape, that's fine
-            return
-
-        try:
-            oshapes = infer_shape(self, ishapes)
-        except NotImplementedError:
-            return
-
-        for o, oshp in zip(outputs, oshapes):
-            o.tag.shape = oshp
--- a/theano/gof/cc.py
+++ b/theano/gof/cc.py
@@ -7,6 +7,7 @@ from copy import copy
 import re #for set_compiledir
 import os, sys, StringIO

+
 if sys.version_info[:2] >= (2,5):
    import hashlib
    def hash_from_code(msg):
@@ -16,6 +17,13 @@ else:
    def hash_from_code(msg):
        return md5.new(msg).hexdigest()

+
+def hash_from_file(file_path):
+    """Return the MD5 hash of a file."""
+    return hash_from_code(open(file_path, 'rb').read())
+
+
+import theano
 from theano.gof.python25 import all
 from theano import config

@@ -43,6 +51,7 @@ import cmodule

 import logging
 _logger=logging.getLogger("theano.gof.cc")
+_logger.setLevel(logging.WARN)
 def info(*args):
    _logger.info(' '.join(str(a) for a in args))
 def debug(*args):
@@ -791,7 +800,7 @@ class CLinker(link.Linker):
        The key returned by this function is of the form (version, signature)
        The signature has the following form:
        {{{
-            'CLinker.cmodule_key', compilation args, libraries,
+            'CLinker.cmodule_key', compilation args, libraries, config md5,
            (op0, input_signature0, output_signature0),
            (op1, input_signature1, output_signature1),
            ...
@@ -858,10 +867,16 @@ class CLinker(link.Linker):
        constant_ids = dict()
        op_pos = {} # Apply -> topological position

-        # first we put the header, compile_args, library names into the signature
+        # First we put the header, compile_args, library names and config md5
+        # into the signature.
        sig = ['CLinker.cmodule_key'] # will be cast to tuple on return
        if compile_args is not None: sig.append(tuple(compile_args))
        if libraries is not None: sig.append(tuple(libraries))
+        # IMPORTANT: The 'md5' prefix is used to isolate the compilation
+        # parameters from the rest of the key. If you want to add more key
+        # elements, they should be before this md5 hash if and only if they
+        # can lead to a different compiled file with the same source code.
+        sig.append('md5:' + theano.configparser.get_config_md5())

        # technically this should only be appended for gcc-compiled Ops
        # and the flags of other compilers should be inserted here... but it's not clear how to
@@ -943,11 +958,30 @@ class CLinker(link.Linker):

    def compile_cmodule(self, location=None):
        """
-        This method is a callback for `ModuleCache.module_from_key`
+        Compile the module and return it.
+        """
+        # Go through all steps of the compilation process.
+        for step_result in self.compile_cmodule_by_step(location=location):
+            pass
+        # And return the output of the last step, which should be the module
+        # itself.
+        return step_result
+
+    def compile_cmodule_by_step(self, location=None):
+        """
+        This method is a callback for `ModuleCache.module_from_key`.
+
+        It is a generator (thus the 'by step'), so that:
+            - it first yields the module's C code
+            - it last yields the module itself
+            - it may yield other intermediate outputs in-between if needed
+              in the future (but this is not currently the case)
        """
        if location is None:
            location = cmodule.dlimport_workdir(config.compiledir)
        mod = self.build_dynamic_module()
+        src_code = mod.code()
+        yield src_code
        get_lock()
        try:
            debug("LOCATION", location)
@@ -955,7 +989,7 @@ class CLinker(link.Linker):
            libs = self.libraries()
            preargs = self.compile_args()
            if c_compiler.__name__=='nvcc_module_compile_str' and config.lib.amdlibm:
-                #this lib don't work correctly with nvcc in device code.
+                # This lib does not work correctly with nvcc in device code.
                if '<amdlibm.h>' in mod.includes:
                    mod.includes.remove('<amdlibm.h>')
                if '-DREPLACE_WITH_AMDLIBM' in preargs:
@@ -965,7 +999,7 @@ class CLinker(link.Linker):
            try:
                module = c_compiler(
                    module_name=mod.name,
-                    src_code = mod.code(),
+                    src_code=src_code,
                    location=location,
                    include_dirs=self.header_dirs(),
                    lib_dirs=self.lib_dirs(),
@@ -977,8 +1011,7 @@ class CLinker(link.Linker):
        finally:
            release_lock()

-        return module
-
+        yield module

    def build_dynamic_module(self):
        """Return a cmodule.DynamicModule instance full of the code for our env.
@@ -1041,10 +1074,10 @@ class CLinker(link.Linker):
        except KeyError:
            key = None
        if key is None:
-            #if we can't get a key, then forget the cache mechanism
+            # If we can't get a key, then forget the cache mechanism.
            module = self.compile_cmodule()
        else:
-            module = get_module_cache().module_from_key(key=key, fn=self.compile_cmodule, keep_lock=keep_lock)
+            module = get_module_cache().module_from_key(key=key, fn=self.compile_cmodule_by_step, keep_lock=keep_lock)

        vars = self.inputs + self.outputs + self.orphans
        # List of indices that should be ignored when passing the arguments
@@ -1174,54 +1207,21 @@ class OpWiseCLinker(link.LocalLinker):
            else:
                post_thunk_old_storage = None

+            compute_map = {}
+            for k in storage_map:
+                compute_map[k] = [k.owner is None]
+
            thunks = []
+            for node in order:
+                # Maker sure we use the C version of the code whenever
+                # possible
+                node._op_use_c_code = True
+                thunks += [node.op.make_thunk(node,
+                                        storage_map,
+                                        compute_map,
+                                        no_recycling)]
+
            for node_idx, node in enumerate(order):
-                node_input_storage = [storage_map[r] for r in node.inputs]
-                node_output_storage = [storage_map[r] for r in node.outputs]
-                debug('Compiling node %i of graph' % node_idx)
-                thunk = None
-                # If the op don't override the c_code function, we don't try
-                # to generate a cthunk! Otherwise we won't find it in the compilation cache
-                # and try to compile it. This will get the lock even if we don't need it!
-                if node.op.c_code.im_func is not op.Op.c_code.im_func:
-                    try:
-                        e = Env(*graph.clone(node.inputs, node.outputs))
-                        if self.allow_gc:
-                            # if we allow garbage collection of intermediate nodes
-                            # we must forbid this C implementatio from cacheing its own
-                            # reference to its output
-                            node_no_recycling = e.outputs
-                        else:
-                            node_no_recycling = [r for r, r2 in zip(e.outputs, node.outputs) if r2 in no_recycling]
-                        cl = CLinker().accept(e, node_no_recycling)
-
-                        debug('Trying CLinker.make_thunk')
-                        thunk, node_input_filters, node_output_filters = cl.make_thunk(
-                            input_storage = node_input_storage,
-                            output_storage = node_output_storage,
-                            keep_lock=getattr(get_lock,"n_lock",0) != orig_n_lock)
-                        assert callable(thunk)
-                        thunk.inputs = node_input_storage
-                        thunk.outputs = node_output_storage
-                        thunks.append(thunk)
-                        do_python_thunk = False
-                    except (NotImplementedError, utils.MethodNotDefined):
-                        thunk = None
-
-                if thunk is None:
-                    if self.fallback_on_perform:
-                        debug('Falling back on perform')
-                        p = node.op.perform
-                        # default arguments are stored in the closure of `thunk`
-                        def thunk(p=p, i=node_input_storage, o=node_output_storage,n=node):
-                            return p(n, [x[0] for x in i], o)
-                        #thunk = lambda p = p, i = node_input_storage, o = node_output_storage, n = node: p(n, [x[0] for x in i], o)
-                        thunk.inputs = node_input_storage
-                        thunk.outputs = node_output_storage
-                        thunk.perform = p
-                        thunks.append(thunk)
-                    else:
-                        raise NotImplementedError("We where not able to use c_code and perform code for this node", node)

                if self.allow_gc:
                    post_thunk_old_storage.append([storage_map[input]

--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
--- a/theano/gof/link.py
+++ b/theano/gof/link.py
@@ -6,7 +6,6 @@ from type import Type
 import sys, traceback
 from copy import copy
 from theano.gof.python25 import all
-import numpy

 __excepthook = sys.excepthook
 def thunk_hook(type, value, trace):
@@ -329,7 +328,7 @@ class LocalLinker(Linker):
        # 3. output storage
        # 4. thunks: list of nodes' functions in the order they will be run by the function in (1)
        # 5. order: list of nodes, in the order they will be run by the function in (1)
-        raise MethodNotDefined("make_all", type(self), self.__class__.__name__)
+        raise utils.MethodNotDefined("make_all", type(self), self.__class__.__name__)

 def gc_helper(node_list):
    """
@@ -391,10 +390,23 @@ class PerformLinker(LocalLinker):
        order = list(env.toposort())
        no_recycling = self.no_recycling

-        thunks = []
-
        input_storage, output_storage, storage_map = map_storage(env, order, input_storage, output_storage)

+
+        compute_map = {}
+        for k in storage_map:
+            compute_map[k] = [k.owner is None]
+
+        thunks = []
+        for node in order:
+            # Maker sure we don't use C version of the code, but rather only
+            # the python version
+            node.op._op_use_c_code = False
+            thunks += [node.op.make_thunk(node,
+                                    storage_map,
+                                    compute_map,
+                                    no_recycling)]
+
        computed, last_user = gc_helper(order)
        if self.allow_gc:
            post_thunk_old_storage = []
@@ -402,18 +414,6 @@ class PerformLinker(LocalLinker):
            post_thunk_old_storage = None

        for node in order:
-            node_input_storage = tuple(storage_map[input] for input in node.inputs)
-            node_output_storage = tuple(storage_map[output] for output in node.outputs)
-            p = node.op.perform
-            # Thunk is meant to be called without arguments.
-            # The arguments are given in the lambda expression so that they are saved in the lambda expression.
-            # Using the closure in a simple way didn't work.
-            thunk = lambda p = p, i = node_input_storage, o = node_output_storage, n = node: p(n, [x[0] for x in i], o)
-            thunk.inputs = node_input_storage
-            thunk.outputs = node_output_storage
-            thunk.perform = p
-            thunks.append(thunk)
-
            if self.allow_gc:
                post_thunk_old_storage.append([storage_map[input]
                    for input in node.inputs

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -2,16 +2,20 @@

 The `Op` class is the base interface for all operations
 compatible with `gof`'s :doc:`graph` routines.
-
-
 """

 __docformat__ = "restructuredtext en"

-from .. import config
+from theano import config
 import graph
 import numpy
 import utils
+import warnings
+import logging
+from theano import config
+from env import Env
+import graph
+import cc


 class CLinkerObject(object):
@@ -323,45 +327,64 @@ class PureOp(object):
        """
        node = self.make_node(*inputs, **kwargs)
        self.add_tag_trace(node)
-    
-        if config.compute_test_value:
+
+        if config.compute_test_value != 'off':
            # avoid circular import
-            from ..compile.sharedvalue import SharedVariable
+            from theano.compile.sharedvalue import SharedVariable
            run_perform = True

            # build test input-values
            input_vals = []
-            for ins in inputs:
+            for i, ins in enumerate(node.inputs):
                if isinstance(ins, graph.Constant):
                    input_vals.append(ins.value)
-                elif isinstance(ins,numpy.ndarray):
-                    input_vals.append(ins)
                elif isinstance(ins,SharedVariable):
-                    input_vals.append(ins.get_value(borrow=True))
+                    input_vals.append(ins.get_value(borrow=True, return_internal_type=True))
                elif isinstance(ins,graph.Variable) and hasattr(ins.tag, 'test_value'):
-                    input_vals.append(ins.tag.test_value)
+                    # ensure that the test value is correct
+                    input_vals.append(ins.type.filter(ins.tag.test_value))
                else:
                    # no test-value was specified, act accordingly
                    if config.compute_test_value == 'warn':
-                        raise Warning('Cannot compute test value: input %s of Op %s missing default value')
+                        warnings.warn('Warning, Cannot compute test value: input %i (%s) of Op %s missing default value' % (i, ins, node), stacklevel=2)
                        run_perform = False
-                    elif config.compute_test_value == 'err':
-                        raise ValueError('Cannot compute test value: input %s of Op %s missing default value')
-                    else:
+                    elif config.compute_test_value == 'raise':
+                        raise ValueError('Cannot compute test value: input %i (%s) of Op %s missing default value' % (i, ins, node))
+                    elif config.compute_test_value == 'ignore':
                        # silently skip test
                        run_perform = False
-          
+                    else:
+                        raise ValueError('%s is invalid for option config.compute_Test_value' % config.compute_test_value)
+
            # if all inputs have test-values, run the actual op
            if run_perform:

+                # Original values should not be destroyed:
+                # copy the values of the inputs in destroy_map
+                destroyed_inputs_idx = []
+                if getattr(node.op, 'destroy_map', None):
+                    for i_pos_list in node.op.destroy_map.itervalues():
+                        destroyed_inputs_idx.extend(i_pos_list)
+                for i in destroyed_inputs_idx:
+                    input_vals[i] = input_vals[i].copy()
+
                # compute output value once with test inputs to validate graph
-                output_storage = [[None] * len(node.outputs)]
-                node.op.perform(node, input_vals, output_storage)
-           
-                # add 'test_value' to output tags, so that downstream ops can use these
-                # numerical values as inputs to their perform method.
-                for (outval, node_output) in zip(output_storage, node.outputs):
-                    node_output.tag.test_value = outval[0]
+                output_storage = [[None]] * len(node.outputs)
+                try:
+                    node.op.perform(node, input_vals, output_storage)
+
+                    # add 'test_value' to output tags, so that downstream ops can use these
+                    # numerical values as inputs to their perform method.
+                    for (outval, node_output) in zip(output_storage, node.outputs):
+                        node_output.tag.test_value = outval[0]
+                except utils.MethodNotDefined, e:
+                    # This case happens when the perform method is not defined
+                    # for a certain Op.
+                    #TODO: use the c_thunk?
+                    if config.compute_test_value == 'warn':
+                        warnings.warn('Warning, in compute_test_value:' + type(e), stacklevel=2)
+                    elif config.compute_test_value == 'raise':
+                        raise

        if self.default_output is not None:
            return node.outputs[self.default_output]
@@ -405,4 +428,82 @@ class PureOp(object):

 class Op(utils.object2, PureOp, CLinkerOp):
    """Convenience class to bundle `PureOp` and `CLinkerOp`"""
-    pass
+    def __new__(cls, *args, **kwargs):
+        # this function exists to silently and transparently ensure that all
+        # existing Ops get a _op_use_c_code attribute
+        obj = object.__new__(cls, *args, **kwargs)
+        if not hasattr(obj, '_op_use_c_code'):
+            obj._op_use_c_code = True
+        return obj
+
+    def __init__(self, use_c_code=True):
+        self._op_use_c_code = use_c_code
+
+    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+        """
+        :param node: something previously returned by self.make_node
+
+        :param storage_map: dict variable -> one-element-list where a computed
+                value for this variable may be found.
+
+        :param compute_map: dict variable -> one-element-list where a boolean
+                value will be found.  The boolean indicates whether the
+                variable's storage_map container contains a valid value (True)
+                or if it has not been computed yet (False).
+
+        :param no_recycling: list of variables for which it is forbidden to
+                reuse memory allocated by a previous call.
+
+        :note: If the thunk consults the storage_map on every call, it is safe
+            for it to ignore the no_recycling argument, because elements of the
+            no_recycling list will have a value of None in the storage map.  If
+            the thunk can potentially cache return values (like CLinker does),
+            then it must not do so for variables in the no_recycling list.
+        """
+        logger = logging.getLogger('theano.Op')
+
+        node_input_storage = [storage_map[r] for r in node.inputs]
+        node_output_storage = [storage_map[r] for r in node.outputs]
+        node_input_compute = [compute_map[r] for r in node.inputs]
+        node_output_compute = [compute_map[r] for r in node.outputs]
+        #logger.debug('Compiling node %i of graph' % node_idx)
+        if self._op_use_c_code:
+            try:
+                e = Env(*graph.clone(node.inputs, node.outputs))
+
+                e_no_recycling = [new_o
+                        for (new_o, old_o) in zip(e.outputs, node.outputs)
+                        if old_o in no_recycling]
+                cl = cc.CLinker().accept(e,
+                        no_recycling=e_no_recycling)
+
+                logger.debug('Trying CLinker.make_thunk')
+                fill_storage, node_input_filters, node_output_filters = cl.make_thunk(
+                    input_storage = node_input_storage,
+                    output_storage = node_output_storage)
+                def rval():
+                    fill_storage()
+                    for o in node.outputs:
+                        compute_map[o][0] = True
+                rval.cthunk = fill_storage.cthunk
+                rval.inputs = node_input_storage
+                rval.outputs = node_output_storage
+                rval.lazy = False
+                return rval
+            except (NotImplementedError, utils.MethodNotDefined):
+                logger.debug('Falling back on perform')
+
+        # condition: either there was no c_code, or it failed
+
+        p = node.op.perform
+        # default arguments are stored in the closure of `rval`
+        def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
+            r = p(n, [x[0] for x in i], o)
+            for o in node.outputs:
+                compute_map[o][0] = True
+            return r
+        rval.inputs = node_input_storage
+        rval.outputs = node_output_storage
+        rval.perform = p
+        rval.lazy = False
+        return rval
--- a/theano/gof/tests/test_compute_test_value.py
+++ b/theano/gof/tests/test_compute_test_value.py
@@ -2,97 +2,193 @@ import numpy
 import unittest

 import theano
+import warnings
+from theano import config
 from theano import tensor as T
+from theano.tensor.basic import _allclose
+from theano.scan_module import scan


 class TestComputeTestValue(unittest.TestCase):

    def test_variable_only(self):
-        theano.config.compute_test_value = True
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'

-        x = T.matrix('x')
-        x.tag.test_value = numpy.random.rand(3,4)
-        y = T.matrix('y')
-        y.tag.test_value = numpy.random.rand(4,5)
-        
-        # should work
-        z = T.dot(x,y)
+            x = T.matrix('x')
+            x.tag.test_value = numpy.random.rand(3,4).astype(config.floatX)
+            y = T.matrix('y')
+            y.tag.test_value = numpy.random.rand(4,5).astype(config.floatX)

-        # this test should fail
-        y.tag.test_value = numpy.random.rand(6,5)
-        self.assertRaises(ValueError, T.dot, x, y)
+            # should work
+            z = T.dot(x,y)
+            assert hasattr(z.tag, 'test_value')
+            f = theano.function([x,y], z)
+            assert _allclose(f(x.tag.test_value, y.tag.test_value),
+                             z.tag.test_value)

-    def test_compute_flag(self):
+            # this test should fail
+            y.tag.test_value = numpy.random.rand(6,5).astype(config.floatX)
+            self.assertRaises(ValueError, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value

-        x = T.matrix('x')
-        y = T.matrix('y')
-        y.tag.test_value = numpy.random.rand(4,5)
-        
-        # should skip computation of test value
-        theano.config.compute_test_value = False
-        z = T.dot(x,y)

-        # should fail one or another when flag is set
-        theano.config.compute_test_value = 'warn'
-        self.assertRaises(Warning, T.dot, x, y)
-        theano.config.compute_test_value = 'err'
-        self.assertRaises(ValueError, T.dot, x, y)
+    def test_compute_flag(self):
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            x = T.matrix('x')
+            y = T.matrix('y')
+            y.tag.test_value = numpy.random.rand(4,5).astype(config.floatX)
+
+            # should skip computation of test value
+            theano.config.compute_test_value = 'off'
+            z = T.dot(x,y)
+            assert not hasattr(z.tag, 'test_value')
+
+            # should fail when asked by user
+            theano.config.compute_test_value = 'raise'
+            self.assertRaises(ValueError, T.dot, x, y)
+
+            # test that a warning is raised if required
+            theano.config.compute_test_value = 'warn'
+            warnings.simplefilter('error', UserWarning)
+            self.assertRaises(UserWarning, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value
+            warnings.resetwarnings()

    def test_string_var(self):
-        theano.config.compute_test_value = True
-
-        x = T.matrix('x')
-        x.tag.test_value = numpy.random.rand(3,4)
-        y = T.matrix('y')
-        y.tag.test_value = numpy.random.rand(4,5)
-        
-        z = theano.shared(numpy.random.rand(5,6))
-
-        # should work
-        out = T.dot(T.dot(x,y), z)
-        def f(x,y,z):
-            return T.dot(T.dot(x,y),z)
-
-        # this test should fail
-        z.set_value(numpy.random.rand(7,6))
-        self.assertRaises(ValueError, f, x, y, z)
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'
+
+            x = T.matrix('x')
+            x.tag.test_value = numpy.random.rand(3,4).astype(config.floatX)
+            y = T.matrix('y')
+            y.tag.test_value = numpy.random.rand(4,5).astype(config.floatX)
+
+            z = theano.shared(numpy.random.rand(5,6).astype(config.floatX))
+
+            # should work
+            out = T.dot(T.dot(x,y), z)
+            assert hasattr(out.tag, 'test_value')
+            tf = theano.function([x,y], out)
+            assert _allclose(
+                    tf(x.tag.test_value, y.tag.test_value),
+                    out.tag.test_value)
+
+            def f(x,y,z):
+                return T.dot(T.dot(x,y),z)
+
+            # this test should fail
+            z.set_value(numpy.random.rand(7,6).astype(config.floatX))
+            self.assertRaises(ValueError, f, x, y, z)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value

    def test_shared(self):
-        theano.config.compute_test_value = True
-
-        x = T.matrix('x')
-        x.tag.test_value = numpy.random.rand(3,4)
-        y = theano.shared(numpy.random.rand(4,6), 'y')
-        
-        # should work
-        z = T.dot(x,y)
-
-        # this test should fail
-        y.set_value(numpy.random.rand(5,6))
-        self.assertRaises(ValueError, T.dot, x, y)
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'
+
+            x = T.matrix('x')
+            x.tag.test_value = numpy.random.rand(3,4).astype(config.floatX)
+            y = theano.shared(numpy.random.rand(4,6).astype(config.floatX), 'y')
+
+            # should work
+            z = T.dot(x,y)
+            assert hasattr(z.tag, 'test_value')
+            f = theano.function([x], z)
+            assert _allclose(f(x.tag.test_value), z.tag.test_value)
+
+            # this test should fail
+            y.set_value(numpy.random.rand(5,6).astype(config.floatX))
+            self.assertRaises(ValueError, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value

    def test_ndarray(self):
-        theano.config.compute_test_value = True
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'

-        x = numpy.random.rand(2,3)
-        y = theano.shared(numpy.random.rand(3,6), 'y')
-        
-        # should work
-        z = T.dot(x,y)
+            x = numpy.random.rand(2,3).astype(config.floatX)
+            y = theano.shared(numpy.random.rand(3,6).astype(config.floatX), 'y')

-        # this test should fail
-        x = numpy.random.rand(2,4)
-        self.assertRaises(ValueError, T.dot, x, y)
+            # should work
+            z = T.dot(x,y)
+            assert hasattr(z.tag, 'test_value')
+            f = theano.function([], z)
+            assert _allclose(f(), z.tag.test_value)

-    def test_constant(self):
-        theano.config.compute_test_value = True
+            # this test should fail
+            x = numpy.random.rand(2,4).astype(config.floatX)
+            self.assertRaises(ValueError, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value

-        x = T.constant(numpy.random.rand(2,3))
-        y = theano.shared(numpy.random.rand(3,6), 'y')
-        
-        # should work
-        z = T.dot(x,y)
-
-        # this test should fail
-        x = T.constant(numpy.random.rand(2,4))
-        self.assertRaises(ValueError, T.dot, x, y)
+    def test_constant(self):
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'
+
+            x = T.constant(numpy.random.rand(2,3), dtype=config.floatX)
+            y = theano.shared(numpy.random.rand(3,6).astype(config.floatX), 'y')
+
+            # should work
+            z = T.dot(x,y)
+            assert hasattr(z.tag, 'test_value')
+            f = theano.function([], z)
+            assert _allclose(f(), z.tag.test_value)
+
+            # this test should fail
+            x = T.constant(numpy.random.rand(2,4), dtype=config.floatX)
+            self.assertRaises(ValueError, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value
+
+    def test_incorrect_type(self):
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'
+
+            x = T.fmatrix('x')
+            # Incorrect dtype (float64) for test_value
+            x.tag.test_value = numpy.random.rand(3,4)
+            y = T.dmatrix('y')
+            y.tag.test_value = numpy.random.rand(4,5)
+
+            self.assertRaises(TypeError, T.dot, x, y)
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value
+
+    def notest_scan(self):
+        """
+        Do not run this test as the compute_test_value mechanism is known not to work with Scan.
+        TODO: fix scan to work with compute_test_value
+        """
+        orig_compute_test_value = theano.config.compute_test_value
+        try:
+            theano.config.compute_test_value = 'raise'
+            k = T.iscalar("k")
+            A = T.vector("A")
+            k.tag.test_value = 3
+            A.tag.test_value = numpy.random.rand(5)
+
+            def fx(prior_result, A):
+                return prior_results * A
+            # Symbolic description of the result
+            result, updates = theano.scan(fn=lambda prior_result, A: prior_result * A,
+                                          outputs_info=T.ones_like(A),
+                                          non_sequences=A,
+                                          n_steps=k)
+
+            # We only care about A**k, but scan has provided us with A**1 through A**k.
+            # Discard the values that we don't care about. Scan is smart enough to
+            # notice this and not waste memory saving them.
+            final_result = result[-1]
+            assert hasattr(final_result.tag, 'test_value')
+        finally:
+            theano.config.compute_test_value = orig_compute_test_value
--- a/theano/misc/do_nightly_build
+++ b/theano/misc/do_nightly_build
@@ -10,7 +10,10 @@ ls ${COMPILEDIR}|wc -l
 FLAGS=warn.argmax_pushdown_bug=False,warn.gpusum_01_011_0111_bug=False,warn.sum_sum_bug=False,warn.sum_div_dimshuffle_bug=False,compiledir=${COMPILEDIR}
 export PYTHONPATH=${ROOT_CWD}:$PYTHONPATH

-cd ${ROOT_CWD}
+cd ${ROOT_CWD}/Theano
+hg summary
+cd ..
+
 echo "executing nosetests with mode=FAST_COMPILE"
 THEANO_FLAGS=${FLAGS},mode=FAST_COMPILE ${NOSETESTS} Theano
 echo "nb element in the compiledir:"

--- a/theano/misc/pycuda_example.py
+++ b/theano/misc/pycuda_example.py
@@ -106,20 +106,22 @@ class PycudaElemwiseSourceModuleOp(Op):
        otype = CudaNdarrayType(broadcastable=[False]*_inputs[0].type.ndim)
        assert self.nout == 1

-        #TODO change the scalar op with the good c_code!
        fct_name = "pycuda_elemwise_%s"%str(self.scalar_op)
        out_node = Apply(self, _inputs, [otype() for o in xrange(self.nout)])
        in_name = ["i"+str(id) for id in range(len(inputs))]
        out_name = ["o"+str(id) for id in range(self.nout)]
        c_code = self.scalar_op.c_code(out_node, "some_name", tuple([n+"[i]"for n in in_name]), tuple(n+"[i]"for n in out_name), {})
-        c_code_param = ", ".join([var.type.dtype_specs()[1]+" *"+name for var,name in zip(inputs,in_name) + zip(out_node.outputs,out_name)])
+        c_code_param = ", ".join([var.type.dtype_specs()[1]+" *"+name for var,name in zip(inputs,in_name) + zip(out_node.outputs,out_name)]+["int size"])
        mod = SourceModule("""
 #include<Python.h>
 #include <numpy/arrayobject.h>
  __global__ void %s(%s)
  {
-    int i = threadIdx.x + threadIdx.y*blockDim.x;
-    %s
+    int i = (blockIdx.x+blockIdx.y*gridDim.x)*(blockDim.x*blockDim.y);
+    i += threadIdx.x + threadIdx.y*blockDim.x;
+    if(i<size){
+        %s
+    }
  }
  """%(fct_name,c_code_param,c_code))
        self.pycuda_fct = mod.get_function(fct_name)
@@ -131,7 +133,16 @@ class PycudaElemwiseSourceModuleOp(Op):
        z, = out
        if z[0] is None or z[0].shape!=inputs[0].shape:
            z[0] = theano.sandbox.cuda.CudaNdarray.zeros(inputs[0].shape)
-        self.pycuda_fct(inputs[0],inputs[1],z[0], block=(inputs[0].shape[0],inputs[0].shape[1],1))
+        if inputs[0].shape != inputs[1].shape:
+            raise TypeError("PycudaElemwiseSourceModuleOp: inputs don't have the same shape!")
+
+        if inputs[0].size > 512:
+            grid = (int(numpy.ceil(inputs[0].size / 512.)),1)
+            block = (512,1,1)
+        else:
+            grid = (1,1)
+            block = (inputs[0].shape[0],inputs[0].shape[1],1)
+        self.pycuda_fct(inputs[0], inputs[1], z[0], numpy.intc(inputs[1].size), block=block, grid=grid)


 class PycudaElemwiseKernelOp(Op):

--- a/theano/misc/tests/test_pycuda.py
+++ b/theano/misc/tests/test_pycuda.py
@@ -24,23 +24,27 @@ else:
    mode_without_gpu = theano.compile.mode.get_default_mode().excluding('gpu')

 def test_pycuda_elemwise_source_module():
-    x=T.fmatrix('x')
-    y=T.fmatrix('y')
-    f=theano.function([x,y],x*y, mode=mode_with_gpu)
-    print f.maker.env.toposort()
-    f2 = theano.function([x,y],x*y, mode=mode_with_gpu.including("local_pycuda_gpu_elemwise"))
-    print f2.maker.env.toposort()
+    for shape in [(5,5), (10,49), (50,49),(500,501),(5000,5001)]:
+        for op in [theano.scalar.basic.mul, theano.scalar.basic.add]:
+            x=T.fmatrix('x')
+            y=T.fmatrix('y')
+            pycuda_op = PycudaElemwiseSourceModuleOp(op)
+            elemwise_op = theano.tensor.Elemwise(op)
+            f=theano.function([x,y], elemwise_op(x,y), mode=mode_with_gpu)
+            f2 = theano.function([x,y], theano.sandbox.cuda.host_from_gpu(pycuda_op(x,y)))
+            f3 = theano.function([x,y], elemwise_op(x,y),
+                                 mode=mode_with_gpu.including("local_pycuda_gpu_elemwise"))

-    assert any([ isinstance(node.op, theano.sandbox.cuda.GpuElemwise) for node in f.maker.env.toposort()])
-    assert any([ isinstance(node.op, PycudaElemwiseSourceModuleOp) for node in f2.maker.env.toposort()])
+            assert any([ isinstance(node.op, theano.sandbox.cuda.GpuElemwise) for node in f.maker.env.toposort()])
+            assert any([ isinstance(node.op, PycudaElemwiseSourceModuleOp) for node in f2.maker.env.toposort()])
+            assert any([ isinstance(node.op, PycudaElemwiseSourceModuleOp) for node in f3.maker.env.toposort()])

-    val1 = numpy.asarray(numpy.random.rand(5,5), dtype='float32')
-    val2 = numpy.asarray(numpy.random.rand(5,5), dtype='float32')
-    #val1 = numpy.ones((5,5))
-    #val2 = numpy.arange(25).reshape(5,5)
-    assert (f(val1,val2) == f2(val1,val2)).all()
-    print f(val1,val2)
-    print f2(val1,val2)
+            val1 = numpy.asarray(numpy.random.rand(*shape), dtype='float32')
+            val2 = numpy.asarray(numpy.random.rand(*shape), dtype='float32')
+            assert (f(val1,val2) == f2(val1,val2)).all()
+            assert (f(val1,val2) == f3(val1,val2)).all()
+            #print f(val1,val2)
+            #print f2(val1,val2)

 def test_pycuda_elemwise_kernel():
    x=T.fmatrix('x')

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -392,8 +392,10 @@ default_colorCodes = {'GpuFromHost' : 'red',

 def pydotprint(fct, outfile=None,
               compact=True, format='png', with_ids=False,
-               high_contrast=True, cond_highlight = None, colorCodes = None,
-               max_label_size=50, scan_graphs = False):
+               high_contrast=True, cond_highlight=None, colorCodes=None,
+               max_label_size=50, scan_graphs=False,
+               var_with_name_simple=False
+               ):
    """
    print to a file in png format the graph of op of a compile theano fct.

@@ -493,14 +495,20 @@ def pydotprint(fct, outfile=None,
            return var_str[var]

        if var.name is not None:
-            varstr = 'name='+var.name+" "+str(var.type)
+            if var_with_name_simple:
+                varstr = var.name
+            else:
+                varstr = 'name='+var.name+" "+str(var.type)
        elif isinstance(var,gof.Constant):
            dstr = 'val='+str(numpy.asarray(var.data))
            if '\n' in dstr:
                dstr = dstr[:dstr.index('\n')]
-            varstr = '%s [%s]'% (dstr, str(var.type))
+            varstr = '%s %s'% (dstr, str(var.type))
        elif var in input_update and input_update[var].variable.name is not None:
-            varstr = input_update[var].variable.name+" "+str(var.type)
+            if var_with_name_simple:
+                varstr = input_update[var].variable.name
+            else:
+                varstr = input_update[var].variable.name+" "+str(var.type)
        else:
            #a var id is needed as otherwise var with the same type will be merged in the graph.
            varstr = str(var.type)
@@ -667,7 +675,8 @@ def pydotprint_variables(vars,
                         format='png',
                         depth=-1,
                         high_contrast=True, colorCodes=None,
-                         max_label_size=50):
+                         max_label_size=50,
+                         var_with_name_simple=False):
    ''' Identical to pydotprint just that it starts from a variable instead
    of a compiled function. Could be useful ? '''

@@ -692,12 +701,15 @@ def pydotprint_variables(vars,
            return var_str[var]

        if var.name is not None:
-            varstr = 'name='+var.name
+            if var_with_name_simple:
+                varstr = var.name
+            else:
+                varstr = 'name='+var.name+" "+str(var.type)
        elif isinstance(var,gof.Constant):
            dstr = 'val='+str(var.data)
            if '\n' in dstr:
                dstr = dstr[:dstr.index('\n')]
-            varstr = '%s [%s]'% (dstr, str(var.type))
+            varstr = '%s %s'% (dstr, str(var.type))
        else:
            #a var id is needed as otherwise var with the same type will be merged in the graph.
            varstr = str(var.type)

--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -154,8 +154,11 @@ outdated!""")
    import cuda_ndarray


-def use(device, force=False, default_to_move_computation_to_gpu = True,
-        move_shared_float32_to_gpu = True):
+def use(device,
+        force=False,
+        default_to_move_computation_to_gpu=True,
+        move_shared_float32_to_gpu=True,
+        enable_cuda=True):
    global cuda_enabled, cuda_initialization_error_message
    if force and not cuda_available and device.startswith('gpu'):
        raise EnvironmentError("You forced use of device %s, but CUDA initialization failed "
@@ -191,7 +194,9 @@ def use(device, force=False, default_to_move_computation_to_gpu = True,
            if move_shared_float32_to_gpu:
                handle_shared_float32(True)
            use.device_number = device
-            cuda_enabled = True
+
+            if enable_cuda:
+                cuda_enabled = True
            print >> sys.stderr, "Using gpu device %d: %s" % (active_device_number(), active_device_name())
        except (EnvironmentError, ValueError), e:
            _logger.error(("ERROR: Not using GPU."
@@ -251,4 +256,5 @@ elif config.init_gpu_device:
    use(device=config.init_gpu_device,
        force=config.force_device,
        default_to_move_computation_to_gpu=False,
-        move_shared_float32_to_gpu=False)
+        move_shared_float32_to_gpu=False,
+        enable_cuda=False)
--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -2049,6 +2049,7 @@ CudaNdarray_gpu_shutdown(PyObject* _unused, PyObject* _unused_args) {
 PyObject *
 CudaNdarray_from_gpu_pointer(PyObject* _unused, PyObject* args)
 {
+    int verbose = 0;
    PyObject *gpu_ptr = NULL;
    PyObject *shapes = NULL;
    PyObject *strides = NULL;
@@ -2062,7 +2063,7 @@ CudaNdarray_from_gpu_pointer(PyObject* _unused, PyObject* args)
    if (! PyArg_ParseTuple(args, "OOOO", &gpu_ptr, &shapes, &strides, &base))
        return NULL;

-    printf("In CudaNdarray_from_gpu_pointer\n");
+    if (verbose) printf("In CudaNdarray_from_gpu_pointer\n");
    if (!PyLong_Check(gpu_ptr))
    {
        PyErr_Format(PyExc_Exception, "CudaNdarray_from_gpu_pointer: The gpu pointor is not an long");
@@ -2133,7 +2134,7 @@ CudaNdarray_from_gpu_pointer(PyObject* _unused, PyObject* args)
        Py_DECREF(dim_);
        Py_DECREF(strd_);
    }
-    printf("CudaNdarray_from_gpu_pointer normal return\n");
+    if (verbose) printf("CudaNdarray_from_gpu_pointer normal return\n");
    return rval;
 }

@@ -2188,7 +2189,7 @@ CudaNdarray_Dot(PyObject* _unused, PyObject* args)
 }

 static PyObject *
-filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, strict)
+filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, strict, storage)
 {
    /*
     * TODO: DOC what this function should do in the various cases of
@@ -2282,10 +2283,10 @@ filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, s
                Py_DECREF(rval);
                rval = NULL;
            }
-            Py_DECREF(data);
-            Py_DECREF(py_data);
-            Py_DECREF(broadcastable);
        }
+        Py_DECREF(data);
+        Py_DECREF(py_data);
+        Py_DECREF(broadcastable);
        return (PyObject*)rval;
    }
 }
@@ -2490,6 +2491,11 @@ CudaNdarray_new_nd(int nd)
    return (PyObject *) rval;
 }

+
+/**
+ * Initialize 'self' as a view of 'base', with memory storage 'data'
+ */
+
 int CudaNdarray_set_device_data(CudaNdarray * self, float * data, PyObject * base)
 {
    if (self->data_allocated)
@@ -2503,12 +2509,15 @@ int CudaNdarray_set_device_data(CudaNdarray * self, float * data, PyObject * bas
        }
    }
    // Get the original base object (base.base.base...)
-    // TODO: check that base is indeed a CudaNdarray?
    PyObject * orig_base = base;
-    while (((CudaNdarray*) orig_base)->base)
+    // base is not always a CudaNdarray. It can be a GpuArray from pycuda, ...
+    if (orig_base && CudaNdarray_Check(orig_base))
    {
-        // base_base is itself a view
-        orig_base = ((CudaNdarray*) orig_base)->base;
+        while (((CudaNdarray*) orig_base)->base)
+        {
+            // base_base is itself a view
+            orig_base = ((CudaNdarray*) orig_base)->base;
+        }
    }
    //N.B. XDECREF and XINCREF are no-ops for NULL pointers
    if (self->base != orig_base)

--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
@@ -26,7 +26,7 @@ typedef float real;
 #endif


-#ifndef SHARED_SIZE 
+#ifndef SHARED_SIZE
 #define SHARED_SIZE (16*1024)
 #endif

@@ -48,10 +48,10 @@ static T ceil_intdiv(T a, T b)
 /**
 * struct CudaNdarray
 *
- * This is a Python type.  
+ * This is a Python type.
 *
 */
-struct CudaNdarray 
+struct CudaNdarray
 {
    PyObject_HEAD

@@ -65,40 +65,46 @@ struct CudaNdarray
    /* Type-specific fields go here. */
    //GpuTensorType::VoidTensor * vt;
    int nd; //the number of dimensions of the tensor
-	// Client should acces host_structure via CudaNdarray_HOST_DIMS / CudaNdarray_HOST_STRIDES macros
+    // Client should acces host_structure via CudaNdarray_HOST_DIMS / CudaNdarray_HOST_STRIDES macros
    int * host_structure; //dim0, dim1, ... stride0, stride1, ...
    int data_allocated; //the number of bytes allocated for devdata


    //device pointers (allocated by cudaMalloc)
    int dev_structure_fresh;
-	//dev_structure should be accessed via macros, otherwise may not be synchronized
-    int * dev_structure; //dim0, dim1, ..., stride0, stride1, ...  
+    //dev_structure should be accessed via macros, otherwise may not be synchronized
+    int * dev_structure; //dim0, dim1, ..., stride0, stride1, ...
    real* devdata; //pointer to data element [0,..,0].
 };

 /*
 * Return a CudaNdarray whose 'nd' dimensions are all 0.
 */
-PyObject * 
+PyObject *
 CudaNdarray_New(int nd=-1);

 /**
 * Return 1 for a CudaNdarray otw 0
 */
-int 
+int
 CudaNdarray_Check(const PyObject * ob);

 /**
 * Return 1 for a CudaNdarray otw 0
 */
-int 
+int
 CudaNdarray_CheckExact(const PyObject * ob);

+/**
+ * Return true for a C-contiguous CudaNdarray, else false
+ */
+bool
+CudaNdarray_is_c_contiguous(const CudaNdarray * self);
+
 /****
 * Returns the number of elements necessary in host_structure and dev_structure for a given number of dimensions.
 */
-int 
+int
 cnda_structure_size(int nd)
 {
    // dim0, dim1, ...
@@ -107,23 +113,23 @@ cnda_structure_size(int nd)
    return nd + nd + nd;
 }

-const int * 
+const int *
 CudaNdarray_HOST_DIMS(const CudaNdarray * self)
 {
    return self->host_structure;
 }
-const int * 
+const int *
 CudaNdarray_HOST_STRIDES(const CudaNdarray * self)
 {
    return self->host_structure + self->nd;
 }
-const int * 
+const int *
 CudaNdarray_HOST_LOG2DIMS(const CudaNdarray * self)
 {
    return self->host_structure + 2*self->nd;
 }

-void 
+void
 cnda_mark_dev_structure_dirty(CudaNdarray * self)
 {
    self->dev_structure_fresh = 0;
@@ -190,7 +196,7 @@ CudaNdarray_Equal(CudaNdarray *cnda1, CudaNdarray *cnda2)
 *
 *  Does not sync structure to host.
 */
-void 
+void
 CudaNdarray_set_dim(CudaNdarray * self, int idx, int d)
 {
    if ((idx >= self->nd) || (idx < 0) || (d < 0))
@@ -206,7 +212,7 @@ CudaNdarray_set_dim(CudaNdarray * self, int idx, int d)
        cnda_mark_dev_structure_dirty(self);
    }
 }
-void 
+void
 CudaNdarray_set_stride(CudaNdarray * self, int idx, int s)
 {
    if ((idx >= self->nd) || (idx < 0))
@@ -225,7 +231,7 @@ CudaNdarray_set_stride(CudaNdarray * self, int idx, int s)
 *
 *  This means: recalculate the log2dims and transfer structure to the card
 */
-int 
+int
 cnda_copy_structure_to_device(CudaNdarray * self)
 {
    cublasSetVector(cnda_structure_size(self->nd), sizeof(int), self->host_structure, 1, self->dev_structure, 1);
@@ -239,7 +245,7 @@ cnda_copy_structure_to_device(CudaNdarray * self)
    return 0;
 }

-const int * 
+const int *
 CudaNdarray_DEV_DIMS(CudaNdarray * self)
 {
    if (!self->dev_structure_fresh)
@@ -249,7 +255,7 @@ CudaNdarray_DEV_DIMS(CudaNdarray * self)
    }
    return self->dev_structure;
 }
-const int * 
+const int *
 CudaNdarray_DEV_STRIDES(CudaNdarray * self)
 {
    if (!self->dev_structure_fresh)
@@ -259,7 +265,7 @@ CudaNdarray_DEV_STRIDES(CudaNdarray * self)
    }
    return self->dev_structure + self->nd;
 }
-const int * 
+const int *
 CudaNdarray_DEV_LOG2DIMS(CudaNdarray * self)
 {
    if (!self->dev_structure_fresh)
@@ -269,7 +275,7 @@ CudaNdarray_DEV_LOG2DIMS(CudaNdarray * self)
    }
    return self->dev_structure + 2*self->nd;
 }
-float * 
+float *
 CudaNdarray_DEV_DATA(const CudaNdarray * self)
 {
    return self->devdata;
@@ -278,7 +284,7 @@ CudaNdarray_DEV_DATA(const CudaNdarray * self)
 /**
 * Return the number of elements in the ndarray (product of the dimensions)
 */
-int 
+int
 CudaNdarray_SIZE(const CudaNdarray *self)
 {
    if (self->nd == -1) return 0;
@@ -289,7 +295,7 @@ CudaNdarray_SIZE(const CudaNdarray *self)
    }
    return size;
 }
-static PyObject * 
+static PyObject *
 CudaNdarray_SIZE_Object(const CudaNdarray *self, void *closure)
 {
    return PyInt_FromLong(CudaNdarray_SIZE(self));
@@ -320,7 +326,7 @@ int CudaNdarray_set_nd(CudaNdarray * self, const int nd)
            }
            self->dev_structure = NULL;
        }
-        if (self->host_structure) 
+        if (self->host_structure)
        {
            free(self->host_structure);
            self->host_structure = NULL;
@@ -386,29 +392,41 @@ int CudaNdarray_alloc_contiguous(CudaNdarray *self, const int nd, const inttype
        size = size * dim[i];
    }

-    if (self->data_allocated != size)
+    if (CudaNdarray_is_c_contiguous(self) && (self->data_allocated == size))
    {
-        if (device_free(self->devdata))
-        {
-            // Does this ever happen??  Do we need to set data_allocated or devdata to 0?
-            return -1;
-        }
-        assert(size>0);
-        self->devdata = (float*)device_malloc(size*sizeof(real));
-        if (!self->devdata)
-        {
-            CudaNdarray_set_nd(self,-1);
-            self->data_allocated = 0;
-            self->devdata = 0;
-            return -1;
-        }
-        if (0)
-            fprintf(stderr,
-                "Allocated devdata %p (self=%p)\n",
-                self->devdata,
-                self);
-        self->data_allocated = size;
+        return 0;
+    }
+
+    // The structure of self will be reused with newly allocated memory.
+    // If self was a view, we should remove the reference to its base.
+    // (If base was already NULL, the following has no effect.)
+    Py_XDECREF(self->base);
+    self->base = NULL;
+
+    // If self is a view, do not try to free its memory
+    if (self->data_allocated && device_free(self->devdata))
+    {
+        self->devdata = NULL;
+        self->data_allocated = 0;
+        return -1;
    }
+
+    assert(size>0);
+    self->devdata = (float*)device_malloc(size*sizeof(real));
+    if (!self->devdata)
+    {
+        CudaNdarray_set_nd(self,-1);
+        self->data_allocated = 0;
+        self->devdata = 0;
+        return -1;
+    }
+    if (0)
+        fprintf(stderr,
+            "Allocated devdata %p (self=%p)\n",
+            self->devdata,
+            self);
+    self->data_allocated = size;
+
    return 0;
 }

@@ -416,7 +434,7 @@ int CudaNdarray_alloc_contiguous(CudaNdarray *self, const int nd, const inttype
 * Return a CudaNdarray whose 'nd' dimensions are set to dims, and allocated.
 */
 template<typename inttype>
-PyObject * 
+PyObject *
 CudaNdarray_NewDims(int nd, const inttype * dims)
 {
    CudaNdarray * rval = (CudaNdarray*)CudaNdarray_New();
@@ -440,7 +458,7 @@ CudaNdarray_NewDims(int nd, const inttype * dims)
 int CudaNdarray_set_device_data(CudaNdarray * self, float * data, PyObject * base);
 int CudaNdarray_set_device_data(CudaNdarray * self, float * data, CudaNdarray * base)
 {
-  return CudaNdarray_set_device_data(self, data, (PyObject *) base);
+    return CudaNdarray_set_device_data(self, data, (PyObject *) base);
 }

 /**
@@ -475,10 +493,10 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo
 /**
 * Transfer the contents of CudaNdarray `self` to a new numpy ndarray.
 */
-PyObject * 
+PyObject *
 CudaNdarray_CreateArrayObj(CudaNdarray * self);

-PyObject * 
+PyObject *
 CudaNdarray_ZEROS(int n, int * dims);

 /**
@@ -499,7 +517,7 @@ int CudaNdarray_dimshuffle(CudaNdarray * self, unsigned int len, const int * pat
 void fprint_CudaNdarray(FILE * fd, const CudaNdarray *self)
 {
    fprintf(fd, "CudaNdarray <%p, %p> nd=%i dev_structure_fresh=%d data_allocated=%d\n",
-	    self, self->devdata, self->nd, self->dev_structure_fresh, self->data_allocated);
+            self, self->devdata, self->nd, self->dev_structure_fresh, self->data_allocated);
    fprintf(fd, "\tHOST_DIMS:      ");
    for (int i = 0; i < self->nd; ++i)
    {
@@ -510,23 +528,23 @@ void fprint_CudaNdarray(FILE * fd, const CudaNdarray *self)
    {
        fprintf(fd, "%i\t", CudaNdarray_HOST_STRIDES(self)[i]);
    }
-    
+
    int data=0;
    fprintf(fd, "\n\tDEV_DIMS:      ");
    for (int i = 0; i < self->nd; ++i)
    {
        cublasGetVector(1, sizeof(int),
-			self->dev_structure+i, 1,
-			&data, 1);
-	fprintf(fd, "%i\t", data);
+                        self->dev_structure+i, 1,
+                        &data, 1);
+        fprintf(fd, "%i\t", data);
    }
    fprintf(fd, "\n\tDEV_STRIDES: ");
    for (int i = 0; i < self->nd; ++i)
    {
        cublasGetVector(1, sizeof(int),
-			self->dev_structure + self->nd+i, 1,
-			&data, 1);
-	fprintf(fd, "%i \t", data);
+                        self->dev_structure + self->nd+i, 1,
+                        &data, 1);
+        fprintf(fd, "%i \t", data);
    }
    fprintf(fd, "\n");
 }

--- a/theano/sandbox/cuda/elemwise.py
+++ b/theano/sandbox/cuda/elemwise.py
@@ -37,7 +37,7 @@ def get_str_list_logical_scalar(node, value_str='ii_i%i_value', data_str='ii_i%i
 class NaiveAlgo(object):
    verbose = 0 # 1, 2 or 3 for more verbose output.
    cache_version = ()
-    cache_version = ('debug', 14, verbose)
+    cache_version = (14, verbose)

    def __init__(self, scalar_op, sync=True, inplace_pattern={}):
        """
@@ -56,7 +56,7 @@ class NaiveAlgo(object):
            print >> sio, "//    Input  ", ipos, str(i.type)
        for ipos, i in enumerate(node.outputs):
            print >> sio, "//    Output ", ipos, str(i.type)
-        print >> sio, "static __global__ void kernel_%s_%s_%s_%s(unsigned int numEls" %(self.scalar_op.__class__.__name__,nodename, id(self), nd)
+        print >> sio, "static __global__ void kernel_%s_%s_%s(unsigned int numEls" % (self.scalar_op.__class__.__name__,nodename, nd)
        if (nd):
            print >> sio, "\t,", ", ".join("const int dim%i" % i for i in xrange(nd))
        #declare inputs
@@ -159,10 +159,9 @@ class NaiveAlgo(object):
                print >> sio, "//    Input  ", ipos, str(i.type)
            for ipos, i in enumerate(node.outputs):
                print >> sio, "//    Output ", ipos, str(i.type)
-            print >> sio, "static __global__ void kernel_%s_%s_%s_%s(unsigned int numEls" %(
+            print >> sio, "static __global__ void kernel_%s_%s_%s(unsigned int numEls" %(
                    self.scalar_op.__class__.__name__,
                    nodename,
-                    id(self),
                    'tiling%i'%nd)
            if (nd):
                print >> sio, "\t,", ", ".join("const int dim%i" % i for i in xrange(nd))
@@ -262,10 +261,9 @@ class NaiveAlgo(object):
            print >> sio, "//    Input  ", ipos, str(i.type)
        for ipos, i in enumerate(node.outputs):
            print >> sio, "//    Output ", ipos, str(i.type)
-        print >> sio, "static __global__ void kernel_%s_%s_%s_%s(unsigned int numEls" %(
+        print >> sio, "static __global__ void kernel_%s_%s_%s(unsigned int numEls" %(
                self.scalar_op.__class__.__name__,
                nodename,
-                id(self),
                'tiling%i_less_registers'%nd)
        if (nd):
            print >> sio, "\t,", ", ".join("const int dim%i" % i for i in xrange(nd))
@@ -472,7 +470,6 @@ class NaiveAlgo(object):
        nd = node.outputs[0].type.ndim
        nb_inputs = len(node.inputs)
        nb_outputs = len(node.outputs)
-        id_self = id(self)
        d = dict()
        #input_params and output_params go into the function declaration/definition
        input_params = ", ".join("const float * i%i_data, const int * i%i_str"%(ipos, ipos)
@@ -512,7 +509,7 @@ class NaiveAlgo(object):
        """ %locals()
        if self.verbose:
            print >> sio, """
-                std::cerr << "calling kernel_%(scalar_op)s_%(nodename)s_%(id_self)s     w numEls" << numEls << " dims"<< d << "\\n";
+                std::cerr << "calling kernel_%(scalar_op)s_%(nodename)s     w numEls" << numEls << " dims"<< d << "\\n";
            """ %locals()
            print >> sio, 'std::cerr << ' + " << ' ' <<  ".join(['"  "']+list("dims[%i]"%di
                for di in xrange(nd)) + ["'\\n';"])
@@ -693,7 +690,7 @@ nd_collapse_[i]=0;
                print >> sio, 'std::cerr << " local_ostr %(ipos)s: " <<'%locals()+' << " " << '.join(["local_ostr[%(ipos)s][%(x)s]"%locals() for x in range(nd)])+'<<"\\n";'


-        def launch_Ccontiguous(nodename, id_self, scalar_op, sync=True):
+        def launch_Ccontiguous(nodename, scalar_op, sync=True):
            kernel_call_args = ["numEls"]
            for ipos in xrange(len(node.inputs)):
                kernel_call_args.append("i%i_data"%ipos)
@@ -736,7 +733,7 @@ nd_collapse_[i]=0;
            else:
                print >> sio, " return 0; " %locals()

-        def launch_General(nodename, id_self, scalar_op, force_nd, sync=True):
+        def launch_General(nodename, scalar_op, force_nd, sync=True):
            # kernel_call_args are used to invoke the cuda kernel
            local="local_"
            kernel_call_args = ["numEls"]
@@ -769,7 +766,7 @@ nd_collapse_[i]=0;
                if (threads_per_block * n_blocks < numEls)
                    threads_per_block = std::min(numEls/n_blocks, (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK);

-                kernel_%(scalar_op)s_%(nodename)s_%(id_self)s_%(force_nd)s<<<n_blocks, threads_per_block>>>(%(kernel_call_args)s);
+                kernel_%(scalar_op)s_%(nodename)s_%(force_nd)s<<<n_blocks, threads_per_block>>>(%(kernel_call_args)s);
                """ %locals()
            if sync:
                print >> sio, """
@@ -791,11 +788,11 @@ nd_collapse_[i]=0;
        print >> sio, "if(numEls==0) return 0;"
        print >> sio, "switch (nd_collapse==0?0:min(%(nd)s,nd_collapse)) {"%locals()
        print >> sio, "case 0: {"
-        launch_Ccontiguous(nodename, id_self, scalar_op, self.sync)
+        launch_Ccontiguous(nodename, scalar_op, self.sync)
        print >> sio, "        } break;"
        for i in range(1, nd+1):
            print >> sio, "case "+str(i)+": {"
-            launch_General(nodename, id_self, scalar_op, i, self.sync)
+            launch_General(nodename, scalar_op, i, self.sync)
            print >> sio, "        } break;"

        print >> sio, "}"#end case

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -553,7 +553,7 @@ def local_gpu_advanced_incsubtensor1(node):
                                               gpu_from_host(y), *coords)]

    # Should not execute for GpuAdvancedIncSubtensor1
-    if node.op.__class__ is tensor.AdvancedSubtensor1:
+    if node.op.__class__ is tensor.AdvancedSubtensor1 and node.inputs[0].dtype=="float32":
        x, y  = node.inputs[0:2]
        coords = node.inputs[2:]
        go_gpu = False
@@ -585,7 +585,7 @@ def local_gpu_incsubtensor(node):
                gpu_from_host(x),
                gpu_from_host(y),
                *coords)]
-    if type(node.op) == tensor.IncSubtensor:
+    if type(node.op) == tensor.IncSubtensor and node.inputs[0].dtype=="float32":
        x, y = node.inputs[0:2]
        assert isinstance(x.type, tensor.TensorType)
        assert isinstance(y.type, tensor.TensorType)

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -318,11 +318,11 @@ def test_elemwise3():
    a = tcn.shared_constructor(theano._asarray(numpy.random.rand(*shape), dtype='float32'), 'a')
    b = tensor.fvector()
    print b.type
-    print tensor.constant(1).type
-    print (1 + b).type
-    print (1 + b**a).type
-    print tensor.exp((1 + b**a)).type
-    f = pfunc([b], [], updates=[(a, (a+b).dimshuffle([2,0,3,1]) * tensor.exp(1 +
+    fone = tensor.constant(1, dtype='float32')
+    print (fone + b).type
+    print (fone + b**a).type
+    print tensor.exp((fone + b**a)).type
+    f = pfunc([b], [], updates=[(a, (a+b).dimshuffle([2,0,3,1]) * tensor.exp(fone +
        b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):

--- a/theano/sandbox/cuda/tests/test_bench_loopfusion.py
+++ b/theano/sandbox/cuda/tests/test_bench_loopfusion.py
@@ -61,7 +61,7 @@ class Kouh2008(object):
            dtype = x_list[0].dtype
        n_terms = len(x_list)

-        def shared_uniform(low, high, size, name): 
+        def shared_uniform(low, high, size, name):
            return _shared_uniform(rng, low, high, size, dtype, name)

        use_softmax_w = True
@@ -86,7 +86,7 @@ class Kouh2008(object):
            raise ValueError('exponent range must have low <= high')

        p_unbounded = shared_uniform(low=-0.1, high=0.1, size=(n_out,), name='p')
-        q_unbounded = shared_uniform(low=-0.1, high=0.1, size=(n_out,), name='q') 
+        q_unbounded = shared_uniform(low=-0.1, high=0.1, size=(n_out,), name='q')
        r_unbounded = shared_uniform(low=-0.1, high=0.1, size=(n_out,), name='r')
        k_unbounded = shared_uniform(low=-0.2, high=0.2, size=(n_out,), name='k') # biases

@@ -122,7 +122,7 @@ class Kouh2008(object):
        """Return a KouhLayer instance with random parameters

        The parameters are drawn on a range [typically] suitable for fine-tuning by gradient
-        descent. 
+        descent.


        :param input: a tensor of shape (n_examples, n_in)
@@ -137,7 +137,7 @@ class Kouh2008(object):
        many 'simple cell' responses.

        :param eps: this amount is added to the softplus of filter responses as a baseline
-        firing rate (that prevents a subsequent error from ``pow(0, p)``) 
+        firing rate (that prevents a subsequent error from ``pow(0, p)``)

        :returns: KouhLayer instance with freshly-allocated random weights.

@@ -149,7 +149,7 @@ class Kouh2008(object):
            dtype = input.dtype
        _logger.debug('dtype %s' % dtype)

-        def shared_uniform(low, high, size, name): 
+        def shared_uniform(low, high, size, name):
            return _shared_uniform(rng, low, high, size, dtype, name)

        f_list = [shared_uniform(low=-2.0/numpy.sqrt(n_in), high=2.0/numpy.sqrt(n_in), size=(n_in, n_out), name='f_%i'%i)
@@ -232,7 +232,7 @@ class Config(object):
    if dtype2=='floatX':
        import theano.config as c
        dtype2 = c.config.get('scalar.floatX')
-        
+
    rng_seed = 23498

    n_hid = 300
@@ -273,7 +273,7 @@ if 0:
            # Skip test if cuda_ndarray is not available.
            from nose.plugins.skip import SkipTest
            import theano.sandbox.cuda as cuda_ndarray
-            if cuda_ndarray.cuda_enabled == False:
+            if not cuda_ndarray.cuda_available:
                raise SkipTest('Optional package cuda disabled')
            import theano.sandbox.cuda
            theano.sandbox.cuda.use()

--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
@@ -5,7 +5,7 @@ import numpy
 from nose.plugins.skip import SkipTest

 from theano.compile.pfunc import pfunc
-from theano import tensor
+from theano import config, tensor
 import theano

 import theano.sandbox.cuda as cuda
@@ -35,18 +35,28 @@ def test_no_shared_var_graph():
    assert numpy.any(isinstance(x.op,cuda.HostFromGpu) for x in l)

 def test_int_pow():
-    a = CudaNdarrayType([False])()
+    # This is to ensure that '4' does not upcast to float64.
+    if config.cast_policy == 'numpy+floatX':
+        floatX_backup = config.floatX
+        config.floatX = 'float32'

-    f = theano.function([a], (a*4).sum(), mode=mode_with_gpu)
+    try:
+        a = CudaNdarrayType([False])()

-    op_names = [n.op.__class__.__name__ for n in f.maker.env.toposort()]
-    assert op_names == ['GpuSum', 'GpuElemwise', 'HostFromGpu']
+        f = theano.function([a], (a*4).sum(), mode=mode_with_gpu)

-    f = theano.function([a], tensor.pow(a,4).sum(), mode=mode_with_gpu)
-    op_names = [n.op.__class__.__name__ for n in f.maker.env.toposort()]
-    assert op_names == ['GpuElemwise', 'GpuSum', 'HostFromGpu']
+        op_names = [n.op.__class__.__name__ for n in f.maker.env.toposort()]
+        assert op_names == ['GpuSum', 'GpuElemwise', 'HostFromGpu']

-    #theano.printing.debugprint(f)
+        f = theano.function([a], tensor.pow(a,4).sum(), mode=mode_with_gpu)
+        op_names = [n.op.__class__.__name__ for n in f.maker.env.toposort()]
+        assert op_names == ['GpuElemwise', 'GpuSum', 'HostFromGpu']
+
+        #theano.printing.debugprint(f)
+
+    finally:
+        if config.cast_policy == 'numpy+floatX':
+            config.floatX = floatX_backup

 def test_gpualloc():
    '''
@@ -144,7 +154,8 @@ def test_opt_gpujoin_joinvectors_elemwise_then_minusone():
 def test_print_op():
    """ Test that print ops don't block gpu optimization"""
    b = tensor.fmatrix()
-    f = theano.function([b],theano.printing.Print()(b)*2, mode=mode_with_gpu)
+    ftwo = tensor.constant(2, dtype='float32')
+    f = theano.function([b],theano.printing.Print()(b) * ftwo, mode=mode_with_gpu)
    #theano.printing.debugprint(f)
    #print f.maker.env.toposort()
 #[GpuFromHost(<TensorType(float32, matrix)>), <theano.printing.Print object at 0x3581210>(GpuFromHost.0), GpuElemwise{mul}(CudaNdarray{[[ 2.]]}, <theano.printing.Print object at 0x3581210>.0), HostFromGpu(GpuElemwise{mul}.0)]

--- a/theano/sandbox/linalg/__init__.py
+++ b/theano/sandbox/linalg/__init__.py
+
+from ops import (cholesky, matrix_inverse, solve,
+        diag, extract_diag, alloc_diag,
+        det, PSD_hint,
+        trace, spectral_radius_bound)
--- a/theano/sandbox/linalg/ops.py
+++ b/theano/sandbox/linalg/ops.py
--- a/theano/sandbox/linalg/tests/test_linalg.py
+++ b/theano/sandbox/linalg/tests/test_linalg.py
--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
--- a/theano/sandbox/rng_mrg.py
+++ b/theano/sandbox/rng_mrg.py
--- a/theano/sandbox/test_multinomial.py
+++ b/theano/sandbox/test_multinomial.py
--- a/theano/sandbox/test_rng_mrg.py
+++ b/theano/sandbox/test_rng_mrg.py
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
--- a/theano/scalar/tests/test_basic.py
+++ b/theano/scalar/tests/test_basic.py
--- a/theano/scan_module/scan_op.py
+++ b/theano/scan_module/scan_op.py
--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
--- a/theano/tensor/nnet/tests/test_nnet.py
+++ b/theano/tensor/nnet/tests/test_nnet.py
--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
--- a/theano/tensor/raw_random.py
+++ b/theano/tensor/raw_random.py
--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
--- a/theano/tensor/tests/test_incsubtensor.py
+++ b/theano/tensor/tests/test_incsubtensor.py
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tensor/tests/test_raw_random.py
+++ b/theano/tensor/tests/test_raw_random.py
--- a/theano/tensor/tests/test_shared_randomstreams.py
+++ b/theano/tensor/tests/test_shared_randomstreams.py
--- a/theano/tests/test_tutorial.py
+++ b/theano/tests/test_tutorial.py