merge w conflict in cuda_ndarray.cu

f6351e1a · James Bergstra · 219f113a · bc7877d1 · f6351e1a · f6351e1a
--- a/.hgignore
+++ b/.hgignore
@@ -8,6 +8,15 @@ syntax: glob
 *.so
 *.sw?
 *~
+*.aux
+*.log
+*.nav
+*.out
+*.pdf
+*.snm
+*.toc
+*.vrb
+.noseids
 Theano.egg-info
 \#*\#
 build

--- a/HISTORY.txt
+++ b/HISTORY.txt
@@ -5,6 +5,81 @@
 Release Notes
 =============

+Theano 0.3.1 (2011-02-21)
+=========================
+
+Deprecation:
+ * The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
+    See http://deeplearning.net/software/theano/tutorial/aliasing.html
+
+Bugs fixed:
+ * The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
+    * In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
+
+ * In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
+ * Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
+   The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
+   To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
+   call the first function to cache the number of steps, and then call the second function with a different number of steps.
+ * In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
+   The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
+
+Crash fixed:
+ * Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
+ * Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
+ * Disabled C code generator that make gcc crash on complex type.
+ * Crash in optimization when an Op has no input.
+ * Output shape is now computed correctly for matrix-vector multiplication on GPU.
+ * In Scan, when using numbers as inputs, not symbolic variables.
+ * In GradScan, when there is only 1 inputs in the Scan.
+ * In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
+ * Some segfault at exit with GPU code.
+
+Optimization:
+ * New SpecifyShape op that allow to pass more shape info in the graph.
+ * Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
+ * Remove join of only 1 element.
+ * During optimization, consider one more case in get_constant_value.
+
+GPU:
+ * cuda_shared.value = X now works inplace!
+     * cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
+ * Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
+ * New init_gpu_device theano flags.
+ * Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
+ * CPU join of only 1 element that was not moved to the GPU.
+
+New features:
+ * tensor.reshape now makes dimensions of length 1 broadcastable.
+ * tensor.prod now implements the gradient.
+ * DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
+    * This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
+ * Sparse.structured_dot now works when both matrices are sparse
+ * Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
+ * New 3D convolution ops, with CPU and GPU implementations.
+ * New colors in pydotprint.
+
+Documentation:
+ * Documented lib.amdlibm and (new) init_gpu_device config variables.
+ * A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
+ * Revision to the Windows installation instructions.
+ * The cuda documentation is now generated on the web server.
+ * Better documentation of .theanorc and its sections.
+
+Unit tests:
+ * Stop usage of deprecated functions or syntax in the unit tests.
+ * Better testing of GPU convolution nets.
+ * Make more tests able to use different random seeds.
+ * Tests of sparse now use default mode, not a hard-coded one.
+ * Remove some tests of unimplemented features.
+
+Other:
+ * The name of compiledir now includes the Python version to make it easier for people with many Python versions
+ * Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
+ * Whitespace, tabulation and indentation clean-up in the code.
+ * Better detection of memory sharing between variables.
+
+
 Theano 0.3 (2010-11-23)
 =======================


--- a/NEWS.txt
+++ b/NEWS.txt
 Modifications in the trunk since the last release

-In trunk since 0.3.1 release
----------------------------
-GPU:
- * Move to the gpu fused elemwise that have other dtype then float32 in them(except float64) if the input and output are float32.
-   * This allow to move elemwise comparaison to the gpu if we cast it to float32 after that.
-
-
-Theano 0.3.1 (2011-02-21)
----------------------------
+Theano 0.4.0rc4 (2011-06-13)
+--------------------------------------------------

 Deprecation:
- * The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
-    See http://deeplearning.net/software/theano/tutorial/aliasing.html
+ * tag.shape attribute deprecated (#633)
+ * CudaNdarray_new_null is deprecated in favour of CudaNdarray_New
+ * Dividing integers with / is deprecated: use // for integer division, or
+   cast one of the integers to a float type if you want a float result (you may
+   also change this behavior with config.int_division).
+ * Removed (already deprecated) sandbox/compile module
+ * Removed (already deprecated) incsubtensor and setsubtensor functions,
+   inc_subtensor and set_subtensor are to be used instead.

 Bugs fixed:
- * The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
-    * In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
-
- * In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
- * Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
-   The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
-   To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
-   call the first function to cache the number of steps, and then call the second function with a different number of steps.
- * In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
-   The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
+ * In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error.
+ * THEANO_FLAGS='optimizer=None' now works as expected
+ * Fixed memory leak in error handling on GPU-to-host copy
+ * Fix relating specifically to Python 2.7 on Mac OS X
+ * infer_shape can now handle Python longs
+ * Trying to compute x % y with one or more arguments being complex now
+   raises an error.
+ * The output of random samples computed with uniform(..., dtype=...) is
+   guaranteed to be of the specified dtype instead of potentially being of a
+   higher-precision dtype.
+ * The perform() method of DownsampleFactorMax did not give the right result
+   when reusing output storage. This happen only if you use the Theano flags 
+   'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'.

 Crash fixed:
- * Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
- * Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
- * Disabled C code generator that make gcc crash on complex type.
- * Crash in optimization when an Op has no input.
- * Output shape is now computed correctly for matrix-vector multiplication on GPU.
- * In Scan, when using numbers as inputs, not symbolic variables.
- * In GradScan, when there is only 1 inputs in the Scan.
- * In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
- * Some segfault at exit with GPU code.
+ * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
+   crash.
+ * Some optimizations crashed when the "ShapeOpt" optimization was disabled.

 Optimization:
- * New SpecifyShape op that allow to pass more shape info in the graph.
- * Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
- * Remove join of only 1 element.
- * During optimization, consider one more case in get_constant_value.
+ * Optimize all subtensor followed by subtensor.

 GPU:
- * cuda_shared.value = X now works inplace!
-     * cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
- * Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
- * New init_gpu_device theano flags.
- * Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
- * CPU join of only 1 element that was not moved to the GPU.
+ * Move to the gpu fused elemwise that have other dtype then float32 in them
+   (except float64) if the input and output are float32.
+   * This allow to move elemwise comparisons to the GPU if we cast it to
+     float32 after that.
+ * Implemented CudaNdarray.ndim to have the same interface in ndarray.
+ * Fixed slowdown caused by multiple chained views on CudaNdarray objects
+ * CudaNdarray_alloc_contiguous changed so as to never try to free
+   memory on a view: new "base" property
+ * Safer decref behaviour in CudaNdarray in case of failed allocations
+ * New GPU implementation of tensor.basic.outer
+ * Multinomial random variates now available on GPU

 New features:
- * tensor.reshape now makes dimensions of length 1 broadcastable.
- * tensor.prod now implements the gradient.
- * DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
-    * This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
- * Sparse.structured_dot now works when both matrices are sparse
- * Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
- * New 3D convolution ops, with CPU and GPU implementations.
- * New colors in pydotprint.
+ * ProfileMode
+    * profile the scan overhead
+    * simple hook system to add profiler
+    * reordered the output to be in the order of more general to more specific
+ * DebugMode now checks Ops with different patterns of preallocated memory,
+   configured by config.DebugMode.check_preallocated_output.
+ * var[vector of index] now work, (grad work recursively, the direct grad
+   work inplace, gpu work)
+    * limitation: work only of the outer most dimensions.
+ * New way to test the graph as we build it. Allow to easily find the source
+   of shape mismatch error:
+   `http://deeplearning.net/software/theano/tutorial/debug_faq.html#interactive-debugger`__
+ * cuda.root inferred if nvcc is on the path, otherwise defaults to
+   /usr/local/cuda
+ * Better graph printing for graphs involving a scan subgraph
+ * Casting behavior can be controlled through config.cast_policy,
+   new (experimental) mode.
+ * Smarter C module cache, avoiding erroneous usage of the wrong C
+   implementation when some options change, and avoiding recompiling the
+   same module multiple times in some situations.
+ * The "theano-cache clear" command now clears the cache more thoroughly.
+ * More extensive linear algebra ops (CPU only) that wrap scipy.linalg
+   now available in the sandbox.
+ * CUDA devices 4 - 16 should now be available if present.
+ * infer_shape support for the View op, better infer_shape support in Scan
+ * infer_shape supported in all case of subtensor
+ * tensor.grad now gives an error by default when computing the gradient
+   wrt a node that is disconnected from the cost (not in the graph, or
+   no continuous path from that op to the cost).
+ * New tensor.isnan and isinf functions.

 Documentation:
- * Documented lib.amdlibm and (new) init_gpu_device config variables.
- * A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
- * Revision to the Windows installation instructions.
- * The cuda documentation is now generated on the web server.
- * Better documentation of .theanorc and its sections.
+ * Better commenting of cuda_ndarray.cu
+ * Fixes in the scan documentation: add missing declarations/print statements
+ * Better error message on failed __getitem__
+ * Updated documentation on profile mode
+ * Better documentation of testing on Windows
+ * Better documentation of the 'run_individual_tests' script

 Unit tests:
- * Stop usage of deprecated functions or syntax in the unit tests.
- * Better testing of GPU convolution nets.
- * Make more tests able to use different random seeds.
- * Tests of sparse now use default mode, not a hard-coded one.
- * Remove some tests of unimplemented features.
+ * More strict float comparaison by default
+ * Reuse test for subtensor of tensor for gpu tensor(more gpu test)
+ * Tests that check for aliased function inputs and assure appropriate copying
+   (#374)
+ * Better test of copies in CudaNdarray
+ * New tests relating to the new base pointer requirements
+ * Better scripts to run tests individually or in batches
+ * Some tests are now run whenever cuda is available and not just when it has
+   been enabled before
+ * Tests display less pointless warnings.

 Other:
- * The name of compiledir now includes the Python version to make it easier for people with many Python versions
- * Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
- * Whitespace, tabulation and indentation clean-up in the code.
- * Better detection of memory sharing between variables.
+ * Correctly put the broadcast flag to True in the output var of
+   a Reshape op when we receive an int 1 in the new shape.
+ * pydotprint: high contrast mode is now the default, option to print
+   more compact node names.
+ * pydotprint: How trunk label that are too long.
+ * More compact printing (ignore leading "Composite" in op names)
--- a/benchmark/numexpr/gen_graph.py
+++ b/benchmark/numexpr/gen_graph.py
@@ -188,10 +188,10 @@ def execs_timeit_2vector(exprs, fname=None):
        pylab.subplots_adjust(wspace=0.25, hspace=0.25)
        #legend=[]
        #plot = fig.add_subplot(1,len(exprs),idx)
-        speedup = [t[0].min()/t[1].min() for t in time]
+        speedup = [t["numpy"].min()/t["numexpr"].min() for t in time]

-        pylab.semilogx(nb_calls, speedup, linewidth=1.0, linestyle = '--', color='r')
-        speedup = [t[0].min()/t[2].min() for t in time]
+        pylab.semilogx(nb_calls, speedup, linewidth=1.0,  color='r')
+        speedup = [t["numpy"].min()/t["theano"].min() for t in time]
        pylab.semilogx(nb_calls, speedup, linewidth=1.0, color = 'b')
        pylab.grid(True)
        if (idx == 2) or (idx == 3):

--- a/bin/theano-cache
+++ b/bin/theano-cache
 #!/usr/bin/env python
-import sys
+import logging, os, sys
 from theano import config
 from theano.gof.cc import get_module_cache

+_logger = logging.getLogger('theano.bin.theano-cache')
+_logger.setLevel(logging.WARN)
+
+
 if len(sys.argv) == 1:
    print config.compiledir
 elif sys.argv[1] in ('clear'):
-    get_module_cache().clear()
+    # We skip the refresh on module cache creation because the refresh will
+    # be done when calling clear afterwards.
+    cache = get_module_cache(init_args=dict(do_refresh=False))
+    cache.clear(unversioned_min_age=-1, clear_base_files=True,
+                delete_if_problem=True)
+    # Print a warning if some cached modules were not removed, so that the user
+    # knows he should manually delete them to properly clear the cache.
+    items = [item for item in sorted(os.listdir(cache.dirname))
+                  if item.startswith('tmp')]
+    if items:
+        _logger.warning('There remain elements in the cache dir that you may '
+                        'need to erase manually. The cache dir is:\n  %s' %
+                        config.compiledir)
+        _logger.debug('Remaining elements (%s): %s' %
+                      (len(items), ', '.join(items)))
+
 else:
    print 'command "%s" not recognized' % sys.argv[1]
    print 'Type "theano-cache" to print the cache location'

--- a/doc/conf.py
+++ b/doc/conf.py
@@ -51,9 +51,9 @@ copyright = '2008--2011, LISA lab'
 # other places throughout the built documents.
 #
 # The short X.Y version.
-version = '0.3.1'
+version = '0.4'
 # The full version, including alpha/beta/rc tags.
-release = '0.3.1'
+release = '0.4.0rc4'

 # There are two options for replacing |today|: either, you set today to some
 # non-false value, then it is used:

--- a/doc/developer/index.txt
+++ b/doc/developer/index.txt

 .. _developer:

-======================
+==============================================
 Theano Design and Implementation Documentation
-======================
+==============================================


 .. toctree::

--- a/doc/developer/tensor.txt
+++ b/doc/developer/tensor.txt
@@ -7,7 +7,7 @@ Tensor
 This file describes the design of theano.tensor.

 Elemwise grad and R_op 
-=================
+======================

 Here's another straightforward example, though a bit more elaborate
 than adding two numbers together. Let's say that you want to compute

--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -74,7 +74,8 @@ following methods:

      A function Mode may allow ``output_storage`` elements to persist between
      evaluations, or it may reset ``output_storage`` cells to hold a value of
-      ``None``.  This feature can allow ``perform`` to reuse memory between
+      ``None``.  It can also pre-allocate some memory for the Op to use.
+      This feature can allow ``perform`` to reuse memory between
      calls, for example.

  This method must be determined by the inputs. That is to say, if

--- a/doc/extending/pipeline.txt
+++ b/doc/extending/pipeline.txt
@@ -108,7 +108,7 @@ case if ``borrow`` was True, the thunk would be allowed to reuse (or
    The compile cache is based upon the C++ code of the graph to be compiled.
    So, if you change compilation configuration variables, such as
    :attr:`config.blas.ldflags`, you will need to manually remove your compile cache,
-    using ``Theano/bin/theano-compiledir clear``
+    using ``Theano/bin/theano-cache clear``

    Theano also implements a lock mechanism that prevents
    multiple compilations within the same compilation directory (to avoid

--- a/doc/extending/type.txt
+++ b/doc/extending/type.txt
@@ -32,16 +32,27 @@ default values.
      reference to ``value`` (i.e. casting prohibited).
      If ``strict`` is False, then casting may happen, but downcasting should
      only be used in two situations:
-          * if ``allow_downcast`` is True
-          * if ``allow_downcast`` is ``None`` and the default behavior for this
-            type allows downcasting for the given ``value`` (this behavior is
-            type-dependent, you may decide what your own type does by default)
+
+      * if ``allow_downcast`` is True
+      * if ``allow_downcast`` is ``None`` and the default behavior for this
+        type allows downcasting for the given ``value`` (this behavior is
+        type-dependent, you may decide what your own type does by default)

      We need to define ``filter`` with three arguments. The second argument
      must be called ``strict`` (Theano often calls it by keyword) and must
      have a default value of ``False``. The third argument must be called
      ``allow_downcast`` and must have a default value of ``None``.

+    .. method:: filter_inplace(value, storage, strict=False, allow_downcast=None)
+
+      If filter_inplace is defined, it will be called instead of
+      filter() This is to allow reusing the old allocated memory. As
+      of this writing this is used only when we transfer new data to a
+      shared variable on the gpu.
+
+      ``storage`` will be the old value. i.e. The old numpy array,
+      CudaNdarray, ...
+
    .. method:: is_valid_value(value)

      Returns True iff the value is compatible with the Type. If

--- a/doc/hpcs2011_tutorial/Makefile
+++ b/doc/hpcs2011_tutorial/Makefile
+all: presentation.pdf
+
+presentation.pdf: presentation.tex pics/f_optimized.png pics/logreg_pydotprint_prediction.png
+# pics/f_unoptimized.png pics/logreg_pydotprint_predic.png pics/logreg_pydotprint_train.png
+	pdflatex presentation.tex
+
+pics/f_optimized.png: simple_example.py
+	python simple_example.py
+
+pics/logreg_pydotprint_prediction.png: logreg_example.py
+	python logreg_example.py
+#pics/f_unoptimized.png: simple_example.py
+#	python simple_example.py
+
--- a/doc/hpcs2011_tutorial/double_op.py
+++ b/doc/hpcs2011_tutorial/double_op.py
+import numpy
+import theano
+
+class DoubleOp(theano.Op):
+    def __eq__(self, other):
+        return type(self) == type(other)
+    def __hash__(self):
+        return hash(type(self))
+    def __str__(self):
+        return self.__class__.__name__
+    def make_node(self, x):
+        x = theano.tensor.as_tensor_variable(x)
+        return theano.Apply(self, [x], [x.type()])
+    def perform(self, node, inputs, output_storage):
+        x = inputs[0]
+        z = output_storage[0]
+        z[0] = x * 2
+
+x = theano.tensor.matrix()
+
+f = theano.function([x], DoubleOp()(x))
+
+inp = numpy.random.rand(5,5)
+out = f(inp)
+assert numpy.allclose(inp*2, out)
+print inp
+print out
--- a/doc/hpcs2011_tutorial/logreg_example.py
+++ b/doc/hpcs2011_tutorial/logreg_example.py
+import numpy
+import theano
+import theano.tensor as T
+rng = numpy.random
+
+N = 400
+feats = 784
+D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+training_steps = 10000
+
+# Declare Theano symbolic variables
+x = T.matrix("x")
+y = T.vector("y")
+w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+x.tag.test_value = D[0]
+y.tag.test_value = D[1]
+#print "Initial model:"
+#print w.get_value(), b.get_value()
+
+
+# Construct Theano expression graph
+p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
+prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+gw,gb = T.grad(cost, [w,b])
+
+# Compile expressions to functions
+train = theano.function(
+            inputs=[x,y],
+            outputs=[prediction, xent],
+            updates={w:w-0.01*gw, b:b-0.01*gb},
+            name = "train")
+predict = theano.function(inputs=[x], outputs=prediction,
+            name = "predict")
+
+if any( [x.op.__class__.__name__=='Gemv' for x in train.maker.env.toposort()]):
+    print 'Used the cpu'
+elif any( [x.op.__class__.__name__=='GpuGemm' for x in train.maker.env.toposort()]):
+    print 'Used the gpu'
+else:
+    print 'ERROR, not able to tell if theano used the cpu or the gpu'
+    print train.maker.env.toposort()
+
+
+
+for i in range(training_steps):
+    pred, err = train(D[0], D[1])
+#print "Final model:"
+#print w.get_value(), b.get_value()
+
+print "target values for D"
+print D[1]
+
+print "prediction on D"
+print predict(D[0])
+
+# Print the graph used in the slides
+theano.printing.pydotprint(predict,
+                           outfile="pics/logreg_pydotprint_predic.png",
+                           var_with_name_simple=True)
+theano.printing.pydotprint_variables(prediction,
+                           outfile="pics/logreg_pydotprint_prediction.png",
+                           var_with_name_simple=True)
+theano.printing.pydotprint(train,
+                           outfile="pics/logreg_pydotprint_train.png",
+                           var_with_name_simple=True)
--- a/doc/hpcs2011_tutorial/pics/CPU_VS_GPU.png
+++ b/doc/hpcs2011_tutorial/pics/CPU_VS_GPU.png
--- a/doc/hpcs2011_tutorial/pics/Caffeine_Machine_no_background_red.png
+++ b/doc/hpcs2011_tutorial/pics/Caffeine_Machine_no_background_red.png
--- a/doc/hpcs2011_tutorial/pics/UdeM_NoirBleu_logo_Marie_crop.pdf
+++ b/doc/hpcs2011_tutorial/pics/UdeM_NoirBleu_logo_Marie_crop.pdf
--- a/doc/hpcs2011_tutorial/pics/apply_node.odg
+++ b/doc/hpcs2011_tutorial/pics/apply_node.odg
--- a/doc/hpcs2011_tutorial/pics/apply_node.pdf
+++ b/doc/hpcs2011_tutorial/pics/apply_node.pdf
--- a/doc/hpcs2011_tutorial/pics/bloc_repartition.png
+++ b/doc/hpcs2011_tutorial/pics/bloc_repartition.png
--- a/doc/hpcs2011_tutorial/pics/conv.pdf
+++ b/doc/hpcs2011_tutorial/pics/conv.pdf
--- a/doc/hpcs2011_tutorial/pics/grid_block_thread.png
+++ b/doc/hpcs2011_tutorial/pics/grid_block_thread.png
--- a/doc/hpcs2011_tutorial/pics/lisabook_logo_text_3.png
+++ b/doc/hpcs2011_tutorial/pics/lisabook_logo_text_3.png
--- a/doc/hpcs2011_tutorial/pics/mlp.pdf
+++ b/doc/hpcs2011_tutorial/pics/mlp.pdf
--- a/doc/hpcs2011_tutorial/pics/multiple_graph.pdf
+++ b/doc/hpcs2011_tutorial/pics/multiple_graph.pdf
--- a/doc/hpcs2011_tutorial/pics/pipeline.odg
+++ b/doc/hpcs2011_tutorial/pics/pipeline.odg
--- a/doc/hpcs2011_tutorial/pics/pipeline.pdf
+++ b/doc/hpcs2011_tutorial/pics/pipeline.pdf
--- a/doc/hpcs2011_tutorial/pics/pycuda-logo-crop.pdf
+++ b/doc/hpcs2011_tutorial/pics/pycuda-logo-crop.pdf
--- a/doc/hpcs2011_tutorial/pics/theano_logo.png
+++ b/doc/hpcs2011_tutorial/pics/theano_logo.png
--- a/doc/hpcs2011_tutorial/presentation.tex
+++ b/doc/hpcs2011_tutorial/presentation.tex
--- a/doc/hpcs2011_tutorial/pycuda_double_op.py
+++ b/doc/hpcs2011_tutorial/pycuda_double_op.py
--- a/doc/hpcs2011_tutorial/pycuda_simple.py
+++ b/doc/hpcs2011_tutorial/pycuda_simple.py
--- a/doc/hpcs2011_tutorial/scan_poly.py
+++ b/doc/hpcs2011_tutorial/scan_poly.py
--- a/doc/hpcs2011_tutorial/scan_pow.py
+++ b/doc/hpcs2011_tutorial/scan_pow.py
--- a/doc/hpcs2011_tutorial/simple_example.py
+++ b/doc/hpcs2011_tutorial/simple_example.py
--- a/doc/index.txt
+++ b/doc/index.txt
--- a/doc/install.txt
+++ b/doc/install.txt
--- a/doc/internal/hg_primer.txt
+++ b/doc/internal/hg_primer.txt
--- a/doc/internal/how_to_release.txt
+++ b/doc/internal/how_to_release.txt
--- a/doc/internal/metadocumentation.txt
+++ b/doc/internal/metadocumentation.txt
--- a/doc/internal/python.txt
+++ b/doc/internal/python.txt
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
--- a/doc/library/sandbox/index.txt
+++ b/doc/library/sandbox/index.txt
--- a/doc/library/sandbox/linalg.txt
+++ b/doc/library/sandbox/linalg.txt
--- a/doc/library/scan.txt
+++ b/doc/library/scan.txt
--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
--- a/doc/sandbox/sparse.txt
+++ b/doc/sandbox/sparse.txt
--- a/doc/tutorial/debug_faq.txt
+++ b/doc/tutorial/debug_faq.txt
--- a/doc/tutorial/modes.txt
+++ b/doc/tutorial/modes.txt
--- a/doc/tutorial/shape_info.txt
+++ b/doc/tutorial/shape_info.txt
--- a/setup.py
+++ b/setup.py
--- a/theano/compile/debugmode.py
+++ b/theano/compile/debugmode.py
--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
--- a/theano/compile/mode.py
+++ b/theano/compile/mode.py
--- a/theano/compile/pfunc.py
+++ b/theano/compile/pfunc.py
--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
--- a/theano/compile/sandbox/__init__.py
+++ b/theano/compile/sandbox/__init__.py
--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
--- a/theano/configparser.py
+++ b/theano/configparser.py
--- a/theano/gof/apply_shape.py
+++ b/theano/gof/apply_shape.py
--- a/theano/gof/cc.py
+++ b/theano/gof/cc.py
--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
--- a/theano/gof/cutils.py
+++ b/theano/gof/cutils.py
--- a/theano/gof/env.py
+++ b/theano/gof/env.py
--- a/theano/gof/link.py
+++ b/theano/gof/link.py
--- a/theano/gof/op.py
+++ b/theano/gof/op.py
--- a/theano/gof/opt.py
+++ b/theano/gof/opt.py
--- a/theano/gof/tests/test_compute_test_value.py
+++ b/theano/gof/tests/test_compute_test_value.py
--- a/theano/gof/type.py
+++ b/theano/gof/type.py
--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
--- a/theano/misc/do_nightly_build
+++ b/theano/misc/do_nightly_build
--- a/theano/misc/pycuda_example.py
+++ b/theano/misc/pycuda_example.py
--- a/theano/misc/strutil.py
+++ b/theano/misc/strutil.py
--- a/theano/misc/tests/test_pycuda.py
+++ b/theano/misc/tests/test_pycuda.py
--- a/theano/printing.py
+++ b/theano/printing.py
--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
--- a/theano/sandbox/cuda/blas.py
+++ b/theano/sandbox/cuda/blas.py
--- a/theano/sandbox/cuda/conv.cu
+++ b/theano/sandbox/cuda/conv.cu
--- a/theano/sandbox/cuda/conv_kernel.cu
+++ b/theano/sandbox/cuda/conv_kernel.cu
--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
--- a/theano/sandbox/cuda/elemwise.py
+++ b/theano/sandbox/cuda/elemwise.py
--- a/theano/sandbox/cuda/nnet.py
+++ b/theano/sandbox/cuda/nnet.py
--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
--- a/theano/sandbox/cuda/tests/test_bench_loopfusion.py
+++ b/theano/sandbox/cuda/tests/test_bench_loopfusion.py
--- a/theano/sandbox/cuda/tests/test_blas.py
+++ b/theano/sandbox/cuda/tests/test_blas.py
--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
--- a/theano/sandbox/cuda/tests/test_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_cuda_ndarray.py
--- a/theano/sandbox/cuda/tests/test_mlp.py
+++ b/theano/sandbox/cuda/tests/test_mlp.py
--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
--- a/theano/sandbox/linalg/__init__.py
+++ b/theano/sandbox/linalg/__init__.py
--- a/theano/sandbox/linalg/ops.py
+++ b/theano/sandbox/linalg/ops.py
--- a/theano/sandbox/linalg/tests/test_linalg.py
+++ b/theano/sandbox/linalg/tests/test_linalg.py
--- a/theano/sandbox/multinomial.py
+++ b/theano/sandbox/multinomial.py
--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
--- a/theano/sandbox/rng_mrg.py
+++ b/theano/sandbox/rng_mrg.py
--- a/theano/sandbox/test_multinomial.py
+++ b/theano/sandbox/test_multinomial.py
--- a/theano/sandbox/test_rng_mrg.py
+++ b/theano/sandbox/test_rng_mrg.py
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
--- a/theano/scalar/tests/test_basic.py
+++ b/theano/scalar/tests/test_basic.py
--- a/theano/scan_module/__init__.py
+++ b/theano/scan_module/__init__.py
--- a/theano/scan_module/scan.py
+++ b/theano/scan_module/scan.py
--- a/theano/scan_module/scan_op.py
+++ b/theano/scan_module/scan_op.py
--- a/theano/scan_module/scan_utils.py
+++ b/theano/scan_module/scan_utils.py
--- a/theano/scan_module/scan_views.py
+++ b/theano/scan_module/scan_views.py
--- a/theano/scan_module/tests/test_scan.py
+++ b/theano/scan_module/tests/test_scan.py
--- a/theano/sparse/sandbox/__init__.py
+++ b/theano/sparse/sandbox/__init__.py
--- a/theano/sparse/sandbox/sp.py
+++ b/theano/sparse/sandbox/sp.py
--- a/theano/sparse/sandbox/test_sp.py
+++ b/theano/sparse/sandbox/test_sp.py
--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
--- a/theano/tensor/nnet/nnet.py
+++ b/theano/tensor/nnet/nnet.py
--- a/theano/tensor/nnet/tests/test_nnet.py
+++ b/theano/tensor/nnet/tests/test_nnet.py
--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
--- a/theano/tensor/raw_random.py
+++ b/theano/tensor/raw_random.py
--- a/theano/tensor/signal/downsample.py
+++ b/theano/tensor/signal/downsample.py
--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
--- a/theano/tensor/tests/test_incsubtensor.py
+++ b/theano/tensor/tests/test_incsubtensor.py
--- a/theano/tensor/tests/test_naacl09.py
+++ b/theano/tensor/tests/test_naacl09.py
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
--- a/theano/tensor/tests/test_raw_random.py
+++ b/theano/tensor/tests/test_raw_random.py
--- a/theano/tensor/tests/test_shared_randomstreams.py
+++ b/theano/tensor/tests/test_shared_randomstreams.py
--- a/theano/tensor/tests/test_sharedvar.py
+++ b/theano/tensor/tests/test_sharedvar.py
--- a/theano/tests/run_individual_tests.py
+++ b/theano/tests/run_individual_tests.py
--- a/theano/tests/test_config.py
+++ b/theano/tests/test_config.py
--- a/theano/tests/test_printing.py
+++ b/theano/tests/test_printing.py
--- a/theano/tests/test_tutorial.py
+++ b/theano/tests/test_tutorial.py