Merge pull request #1457 from nouiz/gpu_min_max

Gpu min max, more pattern implemented

Merge pull request #1457 from nouiz/gpu_min_max
3d41cfaa · lamblin · ae91b627 · d7bad6de · 3d41cfaa · 3d41cfaa
--- a/NEWS_DEV.txt
+++ b/NEWS_DEV.txt
+.. _NEWS:
+===================
+DRAFT Release Notes
+===================
+git log -p rel-0.6rc3... |grep Merge|grep '#' |cut -f 8 -d ' ' | replace "#" "* https://github.com/Theano/Theano/pull/"
+Done up to PR 1456
+Theano Development version
+==========================
+Highlights:
+ * Python 3.3 compatibility with buildbot.
+ * Full advanced indexing support.
+ * Better Windows 64 bits support.
+ * New profiler.
+ * Better error message that help debugging.
+Installation:
+ * Canopy support (direct link to MKL):
+   * On Linux and Mac OSX (Frédéric B., Robert Kern)
+   * On Windows (Edward Shi, Frédéric B.)
+ * Anaconda instruction (Pascal L., Frederic B.)
+ * Doc Ubuntu 13.04 (Frederic B.)
+Commiters for this rc3 only:
+Bug fix:
+ * Fix wrong result of GpuDownsampleFactorMaxGrad on Mac OSX. (Pascal L.)
+ * Auto-Detect and work around a bug in BLAS on MacOS X (Pascal L.)
+ * Work around bug in MacOS X. If 2 compiled module had the same name, the os or python was not always the right one event when we use the right handle to it. (Pascal L.)
+   Use this hash in the Python module, and in %(nodename)s, so that different helper functions in the support code for different Ops will always have different names.
+ * Fix infinit loop related to Scan on the GPU. (Pascal L.)
+ * Fix ConstructSparseFromList.infer_shape, (Pascal L., reported by Rami Al-Rfou')
+ * (introduced in the development version after 0.6rc3 release) (Frederic B.)
+   Reduction that upcast the input on no axis (ex: call theano.sum() on a scalar when the original dtype isn't float64 or [u]int64). It produce bad results as we don't upcast the inputs in the code, we just copy it.
+ * Fix some case of theano.clone() with we git it replacement of x that is a function of x. (Razvan P., reported by Akio Takano)
+New Features:
+ * Python 3.3 compatible (abalkin, Gabe Schwartz, Frederic B.)
+ * A new profiler (Frederic B.)
+   The new profiler now can profile the memory with the Theano flag profile_memory=True.
+   The ProfileMode now can't profile memory anymore and print a message about it.
+   Now we raise an error if we try to profile when the gpu is enabled if we didn't set
+   correctly the env variable to force the driver to sync the kernel launch.
+   Otherwise the profile information are useless.
+   The new profiler support the enabling/disable of the garbage collection.
+ * Adds tensor.tri, tensor.triu, and tensor.tril functions that wrap Numpy equivalents (Jeremiah Lowin)
+ * Adds tensor.nonzero, tensor.flatnonzero functions that wrap Numpy equivalents (Jeremiah Lowin)
+ * Adds tensor.nonzero_values to get around lack of advanced indexing for nonzero elements (Jeremiah Lowin)
+ * Make {inc,set}_subtensor work on output of take. (Pascal L.)
+ * When device=cpu and force_device=True, force that we disable the gpu. (Frederic B.)
+ * Better Windows 64 bits support for indexing/reshaping (Pascal L.)
+ * Full advanced indexing support (John Salvatier, seberg)
+ * Add theano.tensor.stacklist(). Recursivly stack lists of tensors to maintain similar structure (Matthew R.)
+ * Add Theano flag value: on_opt_error=pdb (Olivier D.)
+ * GpuSoftmax[WithBias] for bigger row. (Frederic B.)
+ * Make Erfinv work on the GPU (Guillaume Desjardin, Pascal L.)
+ * Add "theano-cache basecompiledir purge" (Pascal L.)
+   This purge the all the compiledir that are in the base compiledir.
+ * A_tensor_variable.zeros_like() now support the dtype parameter (Pascal L.)
+ * More stable reduce operations by default (Pascal L.)
+   Add an accumulator dtype to CAReduceDtype (acc_dtype)
+   by default, acc_dtype is float64 for float32 inputs
+   then, cast to specified output dtype (float32 for float32 inputs)
+ * Test default blas flag before using it (Pascal L.)
+   This make it work correctly by default if no blas library is installed.
+ * Add cuda.unuse() to help test that need to enable/disable the GPU (Fred)
+ * Add theano.tensor.nnet.ultra_fast_sigmoid and the opt(disabled by default) local_ultra_fast_sigmoid. (Frederic B.)
+ * Add theano.tensor.nnet.hard_sigmoid and the opt(disabled by default) local_hard_sigmoid. (Frederic B.)
+ * Add class theano.compat.python2x.Counter() (Mehdi Mirza)
+ * Allow a_cuda_ndarray += another_cuda_ndarray for 6d tensor. (Frederic B.)
+ * Make the op ExtractDiag work on the GPU. (Frederic B.)
+ * New op theano.tensor.chi2sf (Ethan Buchman) TODO ??? LICENSES????
+ * Lift Flatten/Reshape toward input on unary elemwise. (Frederic B.)
+   This make the "log(1-sigmoid) -> softplus" stability optimization being applied with a flatten/reshape in the middle.
+ * Make MonitorMode use the default optimizers config and allow it to change used optimizers (Frederic B.)
+ * Add support for ScalarOp.c_support_code in GpuElemwise. (Frederic B.)
+ * Also make the Psi function run on GPU. (Frederic B.)
+ * Make tensor.outer(x,y) work when ndim != 1 as numpy.outer.
+ * Kron op: Speed up/generalize/GPU friendly. (Frederic B.)
+   (It is not an op anymore, but reuse current op)
+ * Add gpu max for pattern (0, 1) and added all gpu max pattern for gpu min. (Frederic B.)
+ * Add GpuEye (Frederic B.)
+ * Make GpuCrossentropySoftmaxArgmax1HotWithBias and GpuCrossentropySoftmax1HotWithBiasDx work for bigger inputs (Frederic B., reported by Ryan Price)
+ * Finish and move out of sandbox theano.sparse.basic.true_dot (Nicolas Bouchard, Frederic B.)
+   And document all sparse dot variant.
+Interface Deprecation (a warning is printed):
+ * The mode ProfileMode is now deprecated, use the Theano flag profile=True to remplace it.
+ * New theano.sparse_grad() interface to get the sparse grad of a_tensor[an_int_vector]. (Frederic B.)
+   This can speed up the sparse computation when a small fraction of a_tensor is taken.
+   Deprecate the old interface for this. (Frederic B.)
+Interface Change:
+ * Add -m32 or -m64 in the module cache key and add the python bitwidth in the compiledir path. (Pascal L.)
+ * mrg.normal now have the parameter size mandatory. It was crashing with the default value of None. (Olivier D.)
+ * Remove the deprecated passing of multiple mode to theano function. (Frederic B.)
+New Interface (reuse existing functionality):
+ * Add hostname as a var in compiledir_format (Frederic B.)
+New debug feature:
+Speed-ups:
+ * Faster GpuAdvancedIncSubtensor1 on Fermi GPU (and up) on matrix. (Vivek Kulkarni)
+ * Faster GPUAdvancedIncSubtensor1 in some cases on all GPU (Vivek Kulkarni)
+ * Implemented c_code for AdvancedSubtensor1 (abalkin)
+ * Add the equivalent of -march=native to g++ command line. (Frederic B., Pascal L.)
+ * Speed up compilation with Scan (Jan Schlüter)
+ * Add MakeVector.c_code (Fred)
+ * Add Shape.c_code (Fred)
+ * Optimize Elemwise when all the inputs are fortran (Frederic B.)
+   We now generate an fortran output and use vectorisable code.
+ * Add ScalarOp.c_code_contiguous interface and do a default version. (Frederic B.)
+   This could optimize elemwise by helping the compiler generate SIMD instruction.
+ * Use ScalarOp.c_code_contiguous with amdlibm. (Frederic B.)
+   This speed up exp, pow, sin, cos, log, log2, log10 and sigmoid when the input is contiguous in memory.
+ * A fix that remove an local_setsubtensor_of_allocs optimization warning and enable it in that case. (Frederic B., reported by John Salvatier)
+ * Make inv_as_solve optimization work (Matthew Rocklin)
+Crash fixes:
+ * AdvancedSubtensor1: allow broadcasted index vector. (Frederic B., reported by Jeremiah Lowin)
+ * Fix compute_test_value for ifelse (Olivier D., reported by Bitton Tenessi)
+ * Fix import error with some version of NumPy (Olivier D.)
+ * Fix Scan grad exception (Razvan P., reported by Nicolas BL)
+ * Fix compute_test_value for a non_sequence when calling the gradient of Scan (Pascal L., reported by Bitton Tenessi).
+ * Crash fix in Scan following interface change in 0.6rc2 (Razvan P.)
+ * Crash fix on Scan (Razvan P.)
+ * Fix crash in Scan gradient related to compute_test_value (Frederic B., reported by Bitton Tenessi)
+ * Fix a scan optimization warning/error depending of Theano flags (Frederic B.)
+ * Fixed crash for unimplemented elemwise gradient (Olivier D., reported by Michael McNeil Forbes)
+ * Fix crash in the elemwise python code for some big shape with power of 2. (Sina Honari, Pascal L.)
+ * Fix compile and import errors on Windows (bbudescu)
+ * Fix GPU compilation on Windows (XterNalz)
+ * Fix local_abs_merge optimization crash (Pascal L, reported by Jeremiah Lowin)
+ * Fix import theano crash when g++ isn't there (OD)
+ * Fix crash related to rebuild of Theano graph (Pascal L., reported by Divine Eguzouwa)
+ * Fix crash during compilation (David Ward-Farley)
+ * Crash fix in the grad of GPU op in corner case (Pascal L.)
+ * Crash fix on MacOS X (Robert Kern)
+ * theano.misc.gnumpy_utils.garray_to_cudandarray() set strides correctly for dimensions of 1. (Frederic B., reported by Justin Bayer)
+ * Fix crash during optimization with consecutive sum and some combination of axis (Frederic B., reported by Çağlar Gülçehre)
+ * Fix crash with keepdims and negative axis (Frederic B., reported by David W.-F.)
+ * Fix crash of theano.[sparse.]dot(x,y) when x or y is a vector. (Frederic B., reported by Zsolt Bitvai)
+ * Fix opt crash/disabled with ifelse on the gpu (Frederic B, reported by Ryan Price)
+Others:
+ * Theano flags are now evaluated lazyly, only if requeted (Frederic B.)
+ * Fix test when g++ is not avail (Frederic B.)
+ * Typo/pep8 (Olivier D., Frederic B.)
+ * Update doc (Ben McCann)
+ * Doc compatibility guide (abalkin)
+ * Doc the MPI and load op (Frederic B.)
+ * Add manual instruction for OpenBLAS on Ubuntu by (Jianri Li )
+ * Doc fixes (Yaroslav Halchenko)
+ * Better/more error message (Frederic B., Pascal L., Ian Goodfellow)
+ * More doc (Frederic B.)
+ * Fix Error reporting with GpuConv (Frederic B., reported by Heng Luo and Nicolas Pinto)
+ * Update BLAS compilation doc on windows to use OpenBLAS (Olivier D.)
+ * The infer_shape tester method now warn if the shapes values could hide errors. (Frederic B.)
+ * Now travis-ci test with scipy the part that need it (Frederic B.)
+ * Export some function that work on CudaNdarray for windows (Frederic B.)
+ * If the user specify an -arch=sm_* value in the Theano flags for the gpu, don't add one (Frederic B., Pascal L.)
+ * If a c thunk return an error, check if a python exception is set. Otherwise, set a default one (Pascal L.)
+ * Crash fix introduced in the development version (Wei LI)
+ * Added BLAS benchmark result (Frederic B., Ben McCann)
+ * Fix code comment (Hannes Schulz)
+ * More stable tests (Frederic B.)
+ * Add utt.asset_allclose(a, b) to have better error message. (Frederic B.)
+ * Better error message with compute_test_value (Frederic, reported by John Salvatier)
+ * Stochastic order behavior fix (Frederic B.)
+ * Simpler initial graph for subtensor infer shape (Olivier D.)
+   The optimization was doing the optimization, but this allow better reading of the graph before optimization.
+ * Better detectiont of non-aligned ndarray (Frederic B.)
+ * Updae MRG multinomial gradient to the new interface (Mehdi Mirza)
+ * Implement Image2Neibs.perform() to help debug (Frederic B.)
+ * Remove Theano flags from the compilation key (Frederic B.)
+ * Make theano-nose work on executable *.py files. (Alistair Muldal)
+ * Make theano-nose work with older nose version (Frederic B.)
+ * Add extra debug info in verify_grad() (Frederic B.)
+Todo for the final release:
+ * update the NEWS.txt file.
+=============
+Release Notes
+=============
+Theano 0.6rc3 (February 14th, 2013)
+===================================
+Highlights:
+ * Windows related fixes.
+ * Speed-ups.
+ * Crash fixes.
+ * A few small interface changes.
+ * GPU memory leak fix.
+ * A few corner cases fixes without incidence.
+ * More Theano determinism
+ * tensor.{dot,tensordot} more complete/faster/GPU friendly.
+ * tensor.tensordot now support Rop/Lop
+ * tensor.dot support n-dimensional inputs as NumPy
+ * To support more NumPy syntax:
+     * Add theano.tensor.take()
+     * Add a_tensor_variable.{sort,dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag,take}
+Commiters for this rc3 only:
+Frederic Bastien
+Ian Goodfellow
+Pascal Lamblin
+Jeremiah Lowin
+abalkin
+Olivier Delalleau
+Razvan Pascanu
+Rami Al-Rfou'
+Vivek Kulkarni
+Guillaume Desjardins
+David Warde-Farley
+Eric Hunsberger
+Amir Elaguizy
+James Bergstra
+Bug fix:
+ * Fix memory leak on the GPU in some corner cases with the Theano flags `allow_gc=False`. (Frederic B., reported by Jonas Gehring)
+ * Fix copy of random state between graph. (Guillaume D.)
+   http://deeplearning.net/software/theano/tutorial/examples.html#copying-random-state-between-theano-graphs
+ * Fix wrong dtype in sandbox.linalg.ExtractDiag with shape of 0. (Frederic B., reported by abalkin)
+ * Correctly support array with more then 2*10e32 element in AdvancedSubtensor1. (Abalkin)
+ * Fix wrong broadcast dimensions of output of Repeat op. (Abalkin)
+   We where using the inputs broadcasting pattern in some cases when we shouldn't.
+ * Fix theano.sandbox.linalg.eigh grad that didn't always returned the right dtype. (Frederic B., Olivier D.)
+New Features:
+ * More Theano determinism (Ian G., Olivier D., Pascal L.)
+     * Add and use a new class OrderedSet.
+     * theano.grad is now deterministic.
+     * Warn when the user uses a (non ordered) dictionary and this causes non-determinism in Theano.
+     * The Updates class was non-deterministic; replaced it with the OrderedUpdates class.
+ * tensor.tensordot now support Rop/Lop (Jeremiah Lowin)
+   This remove the class TensorDot and TensorDotGrad. It is the Dot/Elemwise ops that are used.
+ * tensor.dot support n-dimensional inputs as NumPy (Jeremiah Lowin)
+   Work on the GPU too.
+ * The Theano flag `nvcc.flags` now accept `-ftz=true`, `--prec-div=false` and `--prec=sqrt=false` as value. (Frederic B.)
+   To enable all of them, use the Theano flag `nvcc.flags=--use_fast_math`.
+ * New op theano.sparse.ConstructSparseFromList (Rami Al-Rfou'  Vivek Kulkarni)
+ * Make Theano work with Anaconda on Windows. (Pascal L.)
+ * Add tensor_var.diagonal and theano.tensor.{diag,diagonal}. (abalkin)
+ * AdvencedSubtensor1 can now have a sparse gradient. (Rami Al-Rfou', Vivek Kulkarni)
+ * Implemented GpuContiguous.grad. (Ian G.)
+Interface Deprecation (a warning is printed):
+ * theano.misc.strutil.renderString -> render_string (Ian G.)
+ * Print a warning when using dictionary and this makes Theano non-deterministic.
+Interface Change:
+ * Raise an error when theano.shared called with a theano variable. (Frederic B.)
+ * Don't print warning for bug before Theano 0.5 by default. (Frederic B.)
+ * Theano functions now always have a field name, default to None. (Frederic B.)
+ * Theano function fct.fgraph have a copy of the Theano function name field. (Ian G.)
+   This is needed to allow the fgraph to know it.
+ * In the grad method, if it were asked to raise an error if there is no path between the variables, we didn't always returned an error. (Ian G.)
+   We returned the mathematical right answer 0 in those cases.
+ * get_constant_value() renamed get_scalar_constant_value() and raise a new exception tensor.basic.NotScalarConstantError. (Ian G.)
+ * theano.function raises an error when trying to replace inputs with the 'given' parameter. (Olivier D.)
+   This was doing nothing, the error message explains what the user probably wants to do.
+New Interface (reuse existing functionality):
+ * tensor_var.sort() as a shortcut for theano.tensor.sort. (Jeremiah Lowin)
+   We where already doing this for argsort.
+ * Add theano.tensor.take() and a_tensor_var.take() to support NumPy syntax. (abalkin) 
+ * Add a_tensor_variable.{dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag}. (abalkin)
+New debug feature:
+ * DebugMode print more info when there is an error. (Frederic B.)
+ * Better profiling of test time with `theano-nose --time-profile`. (Frederic B.)
+ * Detection of infinite loop with global optimizer. (Pascal L.)
+ * DebugMode.check_preallocated_output now also work on Theano function output. (Pascal L.)
+ * DebugMode will now complain when the strides of CudaNdarray of dimensions of 1 are not 0. (Frederic B.)
+Speed-ups:
+ * c_code for SpecifyShape op. (Frederic B.)
+ * cross-entropy optimization now work when specify_shape is used. (Pascal L.)
+ * The Scan optimization ScanSaveMem and PushOutDot1 applied more frequently. (Razvan P, reported Abalkin)
+   A skipped optimization warning was printed.
+ * dot(vector, vector) now faster with some BLAS implementation. (Eric Hunsberger)
+   OpenBLAS and possibly others didn't call {s,d}dot internally when we called {s,d}gemv.
+   MKL was doing this.
+ * Compilation speed up: Take the compiledir lock only for op that generate c_code. (Frederic B)
+ * More scan optimization (Razvan P.)
+     * Opt to make RNN fast in Theano.
+     * Optimize some case of dot, by moving them outside of Scan.
+     * Move some sequences outside of scan too.
+     * Merge more scan inputs, mostly byproduct of other Scan optimizations.
+ * c_code for theano.sparse.AddSD. (Rami Al-Rfou',  Vivek Kulkarni)
+Crash Fixes:
+ * Fix crash about dimshuffle. (abalkin)
+ * Fix crash at compilation. (Olivier D.)
+ * Fix openmp detection. (Pascal L.)
+   Resulted in a crash with EPD on Windows.
+ * Fix for new BLAS interface in SciPy. (Olivier D.)
+   Fix crash with some development version of SciPy.
+ * GpuSum work with bigger shape when summing on the first dim on 3d tensor. (Frederic B., reported Chris Currivan)
+ * Windows compilation crash fix. (Frederic B.)
+ * Make CrossentropySoftmax1HotWithBiasDx and CrossentropySoftmaxArgmax1HotWithBias support uint* dtype. (Frederic B., reported by Mark Fenner)
+ * Fix GpuSoftmax and GpuSoftmaxWithBias crash on GTX285. (Frederic B.)
+ * Fix crash due to a race condition when importing theano. (Ian G.)
+ * Fix crash from path problem with `theano-nose --batch`. (Abalkin) 
+ * Fix crash with tensor.roll(Var, iscalar). (Frederic B., reported by Jeremiah Lowin)
+ * Fix compilation crash with llvm on Mac. (Abalkin)
+ * Fix the grad of Scan that told wrongly that there is no connection between cost and parameters. (Razvan P.)
+ * The infer shape mechanism now force that broadcasted dimensions have a shape know to be equivalent to one during compilation.
+   Sometimes, we where not able knowing this before run time and resulted in crash. (Frederic B.)
+ * Fix compilation problems on GPU on Windows. (Frederic B.)
+ * Fix copy on the GPU with big shape for 4d tensor (Pascal L.)
+ * GpuSubtensor didn't set the stride to 0 for dimensions of 1. This could lead to check failing later that caused a crash. (Frederic B., reported by vmichals)
+Theoretical bugfix (bug that won't happen with current Theano code, but if you messed with the internal, could have affected you):
+ * GpuContiguous, GpuAlloc, GpuDownSampleGrad, Conv2d now check the preallocated outputs strides before using it. (Pascal L.)
+ * GpuDownSample, GpuDownSampleGrad didn't work correctly with negative strides in their output due to problem with nvcc (Pascal L, reported by abalkin?)
+Others:
+ * Fix race condition when determining if g++ is available. (Abalkin)
+ * Documentation improvements. (Many people including David W-F, abalkin, Amir Elaguizy, Olivier D., Frederic B.)
+ * The current GPU back-end have a new function CudaNdarray_prep_output(CudaNdarray ** arr, int nd, const int * dims) (Ian G)
+=============
+Release Notes
+=============
+Theano 0.6rc2 (November 21th, 2012)
+===================================
+Highlights:
+ * Fix for a few regressions introduced in 0.6rc1.
+ * A few new features.
+ * Speed-ups.
+ * Scan fixes.
+ * Crash fixes.
+ * A few small interface changes.
+Commiters for this rc2 only:
+Razvan Pascanu
+Pascal Lamblin
+Frederic Bastien
+Ian Goodfellow
+Jeremiah Lowin
+Caglar Gulcehre
+Jey Kottalam
+Matthew Rocklin
+abalkin
+Regressions in 0.6rc1 fixed:
+ * Fixed the scan gradient dtype issue. In 0.6rc1, some upcast were inserted. (Razvan P.)
+ * Now grad() will do as before 0.6rc1 for float, i.e. the grad dtype will be the same as the inputs inside the graph. If you ask for the direct grad, it will return the computed dtype. (Pascal L.)
+Wrong results fixes:
+ * Scan fix in some case didn't returned the good results. (Razvan P., reported by Jeremiah L.)
+   This happened if you had a state with only neg tap and the output of the state was a function of some sequence.
+   If you had multiple states, there was no problem.
+ * Fixed bug in Scan with multiple outputs,
+   where one output would sometimes overwrite another one. (Razvan P.)
+ * Clip.grad treated the gradient with respect to the clipping boundary as always 0. (Ian G.)
+Interface changes:
+ * We do not support anymore unaligned ndarray in Python code. (Frederic B.)
+   We did not support it in C code and supporting it in Python code made
+   the detection harder.
+ * Now we only officially support SciPy 0.7.2 and NumPy 1.5.0 (Frederic B.)
+   We weren't and aren't testing with older versions.
+ * The theano.sparse.SparseType is available even when SciPy is not (Frederic B.)
+ * Fixed issue where members of consider_constant grad parameter
+   were treated differently from Constant variables. (Ian G.)
+ * Removed the parameter g_cost from theano.grad(). (Ian G.)
+   Use the new more powerful parameter known_grads instead.
+NumPy interface support:
+ * theano.tensor.where is an alias for theano.tensor.switch to support NumPy semantic. (Ian G.)
+ * TensorVariable objects now have dot, argmin, argmax, clip, conj, repeat, trace, std, round,
+   ravel and argsort functions and the real and imag properties as numpy.ndarray objects.
+   The functionality was already available in Theano. (abalkin)
+Speed-ups:
+ * A C version of the SoftMax op (Razvan P.)
+   There was C code for the softmax with bias code.
+ * Faster GpuIncSubtensor (Ian G.)
+ * Faster copy on the GPU for 4d tensor. (Ian G.)
+ * The fix of flatten infer_shape re-enables an optimization (Pascal L.)
+   * The bug was introduced in 0.6rc1.
+ * Enable inc_subtensor on the GPU when updating it with a float64 dtype. (Ian G.)
+   It was causing an optimization warning.
+ * Make DeepCopy reuse preallocated memory. (Frederic B.)
+ * Move the convolution to the GPU when the image shape and logical image shape differ. (Frederic Bastien)
+ * C code for the View Op (Razvan P., Pascal L.)
+New Features:
+ * Added a monitoring mode "MonitorMode" as a debugging tool. (Olivier D.)
+ * Allow integer axes when keepdims==True (Jeremiah Lowin)
+ * Added erfinv and erfcinv op. (Jey Kottalam)
+ * Added tensor.batched_dot(). (Caglar Gulcehre)
+   It uses scan behind the scenes, but makes doing this easier.
+ * theano.get_constant_value(x) (Frederic B.)
+   This tries to have x as a constant int.
+   This does some constant folding to try to convert x into an int.
+   Used by some optimizations.
+ * Add theano.tensor.io.{MPIRecv,MPIRecvWait,MPISend,MPISendWait} (Matthew Rocklin)
+   Theano does not automatically use them. It is up to you to use them and split your computations.
+ * Added theano.sandbox.linalg.eig (abalkin)
+ * Started some support for Python3 (abalkin)
+   setup.py supports python3 now.
+   It calls 2to3 during the setup.
+   Python3 is not fully supported as we didn't update the C code.
+Crash Fixes:
+ * Fix a crash related to scan.grad due to the new mechanism. (Ian G.)
+ * Fix an optimization warning. Now it gets optimized. (Frederic B.)
+ * Fix crash introduced in 0.6rc1 in theano.grad (Ian G.)
+ * Fix crash introduced in 0.6rc1 in the grad of scan (Razvan P.)
+ * Fix crash introduced in 0.6rc1 in the grad of clip (Ian G.)
+   Also implement the gradient on the min/max bound.
+ * Fix crash in the grad of tensor.switch for int (Ian G.)
+ * Fix crash when mixing shared variable on the GPU and sparse dot. (Pascal L.)
+ * Fix crash as sometimes sparse.dot would return a different dtype number
+   that is equivalent but not the one expected. (Pascal L., reported by Rami Al-Rfou)
+ * Better error msg (Ian G.)
+ * Move all sparse random functions back to sandbox as they don't have a state inside Theano. (Pascal L.)
+   They were moved outside the sandbox in 0.6rc1
+ * LoadFromDisk now is allowed to only support some memmap mode. (Pascal L.)
+   Otherwise, this was causing errors, segmentation faults or wrong results.
+ * Fix import problem on PiCloud (Jeremiah Lowin)
+    * You need to use the c|py linker with the default
+      environment. Otherwise, you need to create your own environment.
+ * Fix a crash during optimization when we take a subtensor of a constant with a non constant index. (Ian G.)
+ * Better handling and error message of gradients on integer. (Ian G.)
+ * Fixed a crash where Scan assumed all TypeErrors raised by the grad function were due to undefined gradients (Ian G.)
+Other:
+ * Doc typo fixes, Doc updates, Better error messages: Olivier D., David W.F., Frederic B., James B., Matthew Rocklin, Ian G., abalkin.
+=============
+Release Notes
+=============
+Theano 0.6rc1 (October 1st, 2012)
+=================================
+Highlights:
+ * Bug fixes, crash fixes, CPU and GPU speed up.
+ * theano_var.eval({other_var: val[,...]} to simplify the usage of Theano (Ian G.)
+ * New default linker `cvm`. This is the execution engine that tells ops to run in certain orders.
+   It is now implemented in C and enables lazy evaluation of ifelse op.
+ * Faster theano.function compilation. (Pascal L., Ian G.)
+ * Big sparse submodule update and documentation of it. (Nicolas Bouchard)
+ * Use GPU asynchronous functionality (Frederic B.)
+ * Better Windows support.
+Known bugs:
+ * A few crash cases that will be fixed by the final release.
+Bug fixes:
+ * Outputs of Scan nodes could contain corrupted values: some parts of the
+   output would be repeated a second time, instead of the correct values.
+   It happened randomly, and quite infrequently, but the bug has been present
+   (both in Python and Cython) since April 2011. (Pascal L.)
+ * In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
+   It did not return the right number of elements. (Frederic B.)
+ * set_subtensor(x[int vector], new_value) when moved to the GPU
+   was transformed into inc_subtensor on the GPU. Now we have a correct
+   (but slow) GPU implementation.
+   Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
+   in all cases as well as all inc_subtensor.
+   Note 2: If your code was affected by the incorrect behavior, we now print
+   a warning by default (Frederic B.)
+ * Fixed an issue whereby config values were used as default arguments,
+   with those defaults then stuck at old values if the config variables were
+   changed during program execution. (David W-F)
+ * Fixed many subtle bugs involving mutable default arguments which may have
+   led to unexpected behavior, such as objects sharing instance variables
+   they were not supposed to share. (David W-F)
+ * Correctly record the GPU device number used when we let the driver select it.
+   (Frederic B.)
+ * Min, max with NaN in inputs did not return the right output. (Pascal L.)
+ * The grad of TensorDot, was returning the wrong shape for some combination of axes.
+   We now raise NotImplementedError in those cases. (Frederic B.)
+ * conv2d with subsample >2 returned wrong values. (Pascal L.)
+     * Fixed when mode==valid, disabled when mode==full
+ * theano.sparse.CSMGrad op (generated by the grad of CSM) didn't
+   handle unsorted input correctly and gradient that is sparser
+   than the input. In that case, a bad result was returned. But this could
+   happen only when a sparse input of a Theano function was not
+   sorted. This happens for example with sparse advanced indexing from
+   scipy. The conclusion is most of time Nan in the graph.
+   (Yann Dauphin)
+ * theano.sparse._dot(CSC matrix, dense) optimized version UsmmCSCDense didn't handle
+   correctly not contiguous inputs/outputs. (Pascal L.)
+ * Fix a corner case CVM updates case. (Pascal L.)
+   This happened if the update to a shared variable is itself after optimization.
+   The CVM was not used by default.
+ * Fix the view_map of sparse.Transpose and sparse.sandbow.sp.RowScale. (Frederic B.)
+   This probably didn't cause problem as there is only the UsmmCscDense op
+   (used call to Usmm with CSC matrix) that could interfere with them.
+Deprecation:
+ * Deprecated the Module class (Ian G.)
+   This was a predecessor of SharedVariable with a less pythonic philosophy.
+Interface changes:
+ * Now the base version requirements are numpy >= 1.5.0 and the optional scipy >= 0.7.2.
+ * In Theano 0.5, we removed the deprecated sharedvar.value property.
+   Now we raise an error if you access it. (Frederic B.)
+ * theano.function does not accept duplicate inputs, so function([x, x], ...)
+   does not work anymore. (Pascal L.)
+ * theano.function now raises an error if some of the provided inputs are
+   not part of the computational graph needed to compute the output, for
+   instance, function([x, y], [y]). You can use the kwarg
+   ``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
+   (Pascal L.)
+ * New Theano flag "on_unused_input" that defines the default value of the
+   previous point. (Frederic B.)
+ * tensor.alloc() now raises an error during graph build time
+   when we try to create less dimensions than the number of dimensions
+   the provided value have. In the past, the error was at run time.
+   (Frederic B.)
+ * Remove theano.Value and related stuff (Ian G.)
+   This was a test of what ended up as SharedVariable.
+ * Renamed Env to FunctionGraph, and object attribute "env" to "fgraph" (Ian G.)
+   Deprecation warning printed when you try to access the "env" attribute.
+ * Renamed the FunctionGraph.nodes attribute to FunctionNodes.apply_nodes (Ian G.)
+ * Warn when we don't handle correctly the parameter in Theano flags `nvcc.flags`
+   (Frederic B.)
+ * Do not reorder the user flags passed to the compiler. They get set after other flags. (Frederic B.)
+ * Make setuptools optional (Ilan Schnell)
+ * We warn when a user tries to use an old GPU with which Theano is untested.
+   This could cause crash and will also be very slow. (Frederic B.)
+ * Make theano.grad able to differentiate between not implemented, undefined and disconnected grad.
+   Op.grad function should return theano.gradient.{grad_not_implemented,grad_undefined} or
+   something of DisconectedType (Ian G.)
+ * Make theano.grad expect to always receive a float or undefined
+   gradient and enforce that op with integer output values always
+   return 0. (Ian G.)
+New memory output contract (was mentioned in the release notes of Theano 0.5):
+ * Now the output memory received can be preallocated by other stuff.
+   In the past it was always the previous output an Apply node allocated.
+   So this means that the shape and strides can be different from previous calls
+   and there can be links to this memory at other places.
+   This means it could receive preallocated output that is not c_contiguous.
+   But we don't do that now. (Pascal L.)
+ * New Theano flags to test this DebugMode.check_preallocated_output (Pascal L.)
+ * Updated a few ops to respect this contract (Pascal L.)
+New Features:
+ * GPU scan now works (does not crash) when there is a mixture of float32 and other dtypes.
+ * theano_var.eval({other_var:val[,...]} to simplify the usage of Theano (Ian G.)
+ * debugprint new param ids=["CHAR", "id", "int", ""]
+   This makes the identifier printed to be a unique char, the Python id, a
+   unique int, or not have it printed. We changed the default to be "CHAR"
+   as this is more readable. (Frederic B.)
+ * debugprint new param stop_on_name=[False, True]. If True, we don't print
+   anything below an intermediate variable that has a name. Defaults to False.
+   (Frederic B.)
+ * debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
+ * If you use Enthought Python Distribution (EPD) now we use its blas
+   implementation by default. (Frederic B., Graham Taylor, Simon McGregor)
+ * MRG random now raises an error with a clear message when the passed shape
+   contains dimensions with bad value like 0. (Frederic B. reported by Ian G.)
+ * "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
+ * "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
+ * We add dimensions to CudaNdarray to automatically broadcast more frequently.
+   (Frederic B.)
+ * New theano flag cmodule.warn_no_version. Default False. If True,
+   will print a warning when compiling one or more Op with C code that
+   can't be cached because there is no c_code_cache_version() function
+   associated to at least one of those Ops.  (Frederic B.)
+ * CPU alloc now always generate C code (Pascal L.)
+ * New Theano flag cmodule.warn_no_version=False. When True, warn when an op
+   with C code is not versioned (which forces to recompile it everytimes).
+   (Frederic B.)
+ * C code reuses preallocated outputs (only done by Scan) (Pascal L.)
+ * Garbage collection of intermediate results during Theano function calls
+   for Ops with C code (Pascal L.)
+ * Theano flag compiledir_format now supports the parameter "numpy_version" and "g++". (Frederic B.)
+ * Theano GPU variables, shared variables and constants now support <, <=,
+   > and >= similar to those not on the GPU.
+ * AdvancedIncSubtensor now supports the set_instead_of_inc parameter. (Eric L.)
+ * Added Advanced Indexing support to inc_subtensor and set_subtensor. (Eric L.)
+ * theano.tensor.{any,all,std,var,mean,prod,sum,argmin,argmax,min,max,max_and_argman}
+   have a new parameter keepdims (Eric L.)
+   This allows to broadcast it correctly against the input data to normalize it.
+ * The Updates objects now check that the keys are SharedVariable when we pass them
+   in the __init__ function. (Pascal L.)
+ * Set a Theano Variable name on transposed op when the input has one (Frederic B).
+ * The cvm linker now supports garbage collection (enabled by default). (James B. Arnaud B., Pascal L.)
+ * The cvm linker is now the default linker.
+   This makes the "loop" around the execution of apply node in C. So this lowers the overhead.
+ * theano_variable[numpy.newaxis] is now supported (James B.)
+ * Enable ifelse on the GPU. (Frederic B.)
+ * Correctly support numpy.memmap everywhere (Pascal L.)
+   We add partial support for them before. Just use the normal tensor operation
+   on them and it should work.
+   But be careful not to exhaust your computer memory! (we always generate normal ndarray)
+ * Add an optimization that stabilizes log(softmax(x)). (Ian G.)
+ * Re-enable the Images2Neibs grad. It was not broken, the problem was how we tested it. (Frederic B.)
+ * If `theano_fn.trust_input` is set to False, do not check if the inputs are good
+   when calling the theano function. (Frederic B.)
+ * Add theano.tensor.blas,gem{m,v} as shortcut.
+ * theano.grad(..., add_names=True). False for the old
+   behavior. Otherwise it tries to name the grad variables. (Ian G.)
+ * theano-nose (Pascal L.)
+   A wrapper around nosetests that adds needed extensions.
+   * --profile-time option, to print time spent in each test (Eric L.)
+   * --batch option, to allow to run tests in batch to lower memory requirement.
+ * m = mean(log(1 - sigm(x)))
+   x - scalar * theano.grad(m, x)
+   There is a stabilization optimization for this.
+   Now it is applied more frequently. (Pascal L.)
+New Op/functions:
+ * Added element-wise operation theano.tensor.{GammaLn,Psi} (John Salvatier, Nicolas Bouchard)
+ * Added element-wise operation theano.tensor.{arcsin,arctan,arccosh,arcsinh,arctanh,exp2,arctan2} (Nicolas Bouchard)
+ * Added element-wise operation theano.tensor.{gamma,conj,complex_from_polar,expm1,deg2rad,rad2deg,trunc,gamma} (Nicolas Bouchard)
+ * Added theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
+ * Added theano.tensor.diff that wraps numpy.diff (Nicolas B.)
+ * Added theano.tensor.bincount that wraps numpy.bincount (Nicolas B., Pascal L, Frederic B.)
+ * Added theano.tensor.squeeze (Nicolas B.)
+   This removes broadcasted dimensions from the variable.
+   Theano-esque version of numpy.squeeze.
+ * Added theano.tensor.repeat that wraps numpy.repeat (Nicolas B. + PL)
+ * Added theano.tensor.bartlett that wraps  numpy.bartlett (Eric L.)
+ * Added theano.tensor.fill_diagonal that wraps numpy.fill_diagonal (Eric L., Frederic B.)
+ * Added tensor.square that is an alias for tensor.sqr as NumPy (Ian G.)
+ * Added theano.tensor.load(path, dtype, broadcastable, mmap_mode=None) op
+   that allows to load a .npy file in a theano graph (Matthew Rocklin)
+ * theano.sandbox.linalg.kron.py:Kron op. (Eric L.)
+   Kronecker product
+Speed up:
+ * CPU convolutions are now parallelized (Frederic B.)
+   By default use all cores/hyper-threads.
+   To control it, use the `OMP_NUM_THREADS=N` environment variable where N is the number of
+   parallel threads to use. By default it is equal to the number of CPU cores/hyper
+   threads that you have.
+   There is a new Theano flag `openmp` to allow/disallow openmp op.
+   If your BLAS library is parallelized, this flag won't affect it, but the
+   env variable will.
+ * Remove a corner case causing duplicated dot22/gemm in the graph. (Frederic B., Ian G.)
+ * Enable fusion of elemwise that have the same clients multiple times. (Frederic B.)
+ * New optimization: Remove reduction over broadcastable dimensions (James B., Frederic B.)
+ * Faster theano.function compilation. (Pascal L., Ian G.)
+ * Remove GPU transfer around specify_shape op. (Frederic B.)
+ * Implemented/tested MANY op.infer_shape method (Eric Larsen)
+   This allows Theano to make better shape inferance.
+ * Implement Solve.infer_shape (Matthew Rocklin)
+ * Scan memory optimizations now work more frequently. (Razvan P.)
+   There was a warning printed by the subtensor optimization in those cases.
+ * Faster rng_mrg Python code. (mostly used for tests) (Frederic B.)
+Speed up GPU:
+ * Convolution on the GPU now checks the generation of the card to make
+   it faster in some cases (especially medium/big ouput image) (Frederic B.)
+     * We had hardcoded 512 as the maximum number of threads per block. Newer cards
+       support up to 1024 threads per block.
+ * Faster GpuAdvancedSubtensor1, GpuSubtensor, GpuAlloc (Frederic B.)
+ * We now pass the GPU architecture to nvcc when compiling (Frederic B.)
+ * Now we use the GPU function async feature by default. (Frederic B.)
+   Set the environment variable `CUDA_LAUNCH_BLOCKING` to `1` to disable this
+   for profiling or debugging.
+ * Faster creation of CudaNdarray objects (Frederic B.)
+ * Now some Max reductions are implemented on the GPU. (Ian G.)
+Sparse Sandbox graduate (moved from theano.sparse.sandbox.sp):
+ * sparse.remove0 (Frederic B., Nicolas B.)
+ * sparse.sp_sum(a, axis=None) (Nicolas B.)
+     * bugfix: the not structured grad was returning a structured grad.
+ * sparse.{col_scale,row_scale,ensure_sorted_indices,clean} (Nicolas B.)
+ * sparse.{diag,square_diagonal} (Nicolas B.)
+Sparse:
+ * Support for uint* dtype.
+ * Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
+   have the same sparsity pattern. (Frederic B.)
+ * New Ops: sparse.{expm1,deg2rad,rad2deg,trunc} (Nicolas B.)
+ * New Ops: sparse.{sqrt,sqr,log1p,floor,ceil,sgn,round_half_to_even} (Nicolas B.)
+ * New Ops: sparse.{arctanh,tanh,arcsinh,sinh,arctan,arcsin,tan,sin} (Nicolas B.)
+ * New functions: structured_{add,exp,log,pow,minimum,maximum,sigmoid} (Yann D., Nicolas B.)
+     * Optimized op: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
+ * New Op: sparse.mul_s_v multiplication of sparse matrix by broadcasted vector (Yann D.)
+ * New Op: sparse.Cast() (Yann D., Nicolas B.)
+     * Add sparse_variable.astype() and theano.sparse.cast() and
+       theano.sparse.{b,w,i,l,f,d,c,z}cast() as their tensor equivalent (Nicolas B.)
+ * Op class: SamplingDot (Yann D., Nicolas B.)
+   * Optimized version: SamplingDotCsr, StructuredDotCSC
+   * Optimizations to insert the optimized version: local_sampling_dot_csr, local_structured_add_s_v
+ * New Ops: sparse.{Multinomial,Poisson,Binomial} (Yann D., NB)
+ * Implement the CSMProperties grad method (Yann Dauphin)
+ * Move optimizations to theano/sparse/opt.py (Nicolas B.)
+New flags:
+ * `profile=True` flag now prints the sum of all printed profiles. (Frederic B.)
+     * It works with the linkers vm/cvm (default).
+     * Also print compile time, optimizer time and linker time.
+     * Also print a summary by op class.
+ * new flag "profile_optimizer" (Frederic B.)
+   when profile=True, will also print the time spent in each optimizer.
+   Useful to find optimization bottleneck.
+ * new flag "cmodule.remove_gxx_opt" (Frederic B.)
+   If True, will remove -O* parameter passed to g++.
+   This is useful to debug in gdb module compiled by Theano.
+   The parameter -g is passed by default to g++.
+ * new flag cmodule.compilation_warning
+   if True, will print compilation warning.
+ * new flag `allow_gc` (Frederic B.)
+   When False, do not garbage collect intermediate results when they are not needed.
+   This uses more memory, but allocates memory less frequently so faster.
+ * new flag `vm.lazy` (Frederic B.)
+   Useful only for the vm linkers. When lazy is None,
+   auto detect if lazy evaluation is needed and use the apropriate
+   version. If lazy is True/False, force the version used between
+   Loop/LoopGC and Stack.
+ * new flag `cxx`. This is the C++ compiler to use. If empty do not compile C code. (Frederic B.)
+ * New flag `print_active_device` that defaults to True. (Matthew R.)
+Documentation:
+ * Added in the tutorial documentation on how to extend Theano.
+   This explains how to make a Theano Op from a Python function.
+   http://deeplearning.net/software/theano/tutorial/extending_theano.html
+   (Frederic B.)
+ * New installation instructions for Windows using EPD (Pascal L.)
+ * New installation on Windows by using a Linux VM from ContinuumIO (Frederic B.)
+ * Revisions of Theano tutorial and addition of exercises to it. (Eric L.)
+ * New tutorial on Sparse variable. (Nicolas B., Sebastien Lemieux, Frederic Bastien
+   http://www.deeplearning.net/software/theano/tutorial/sparse.html
+ * Installation documentation for CentOS6 (Frederic B.)
+ * Installation documentation for Ubuntu (with GPU) (Frederic B., Matthias Zoehrer)
+ * Doc typo fixes, Doc updates, Better error messages: Olivier D., David W.F., Frederic B., James B., Matthew Rocklin, Ian G.
+ * Python Memory Management tutorial (Steven Pigeon, Olivier D.)
+Proposal:
+ * Math framework for complex gradients (Pascal L.)
+Internal changes:
+ * Define new exceptions MissingInputError and UnusedInputError, and use them
+   in theano.function, instead of TypeError and ValueError. (Pascal L.)
+ * Better handling of bitwidth and max values of integers and pointers
+   across platforms (Pascal L.)
+ * Made a few Ops with C code versioned to reduce compilation time.
+   (Frederic B, Pascal L.)
+ * Better deletion of files in the compiledir (Frederic B.)
+ * Safer import on sort op (Nicolas Pinto)
+ * hash_from_dict for elemwise op (Fredric B.)
+ * Renamed BadCLinkerOutput into BadThunkOutput. (PL)
+ * tensor.utils.shape_of_variables (Matthew R.)
+ * Add the numpy abi version and g++/nvcc version in the key of compiled code. (Frederic B.)
+ * env.replace_all_validate_remove (Frederic B.)
+   This allows global optimizer to ensure it removed some nodes from the graph.
+   This is a generic way to catch errors that would otherwise duplicate
+   computation.
+   * It was used for GEMM and Scan optimization (Frederic B., Razvan P.)
+ * Fix how exception are raised in GPU code (James B.)
+ * Made code respect pep8: OD, Fred, Pascal L., Nicolas Bouchard, Eric Larsen and others.
+ * TensorType and CudaNdarrayType now have a value_zeros method that call CudaNdarray.zeros or
+   numpy.zeros with the right dtype. (Pascal L., Olivier D.)
+   This allows to have the same code work with both types.
+ * Renamed FunctionGraph.extend function to FunctionGraph.attach_feature. (Ian G.)
+ * New exception MissingGXX when we try to compile but there is no cxx compiler. (Frederic B.)
+ * New fct theano.gof.utils.give_variables_names(...) that gives unique names to variables. (Matthew R.)
+ * Use most of the time the new NumPy C-API for later NumPy release. (Frederic B.)
+ * New theano.gof.sched.sort_apply_nodes() that will allow other execution ordering. (Matthew R.)
+ * New attribute sort_schedule_fn, a way to specify a scheduler to use. (Matthew R.)
+Crash Fix:
+ * Fix import conflict name (usaar33, Frederic B.)
+    * This makes Theano work with PiCloud.
+ * Do not try to use the BLAS library when blas.ldflags is manually set to an
+   empty string (Frederic B., Pascal L.)
+ * When importing theano on a computer without GPU with the Theano
+   flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by  Luo Heng)
+ * Optimization printed a useless error when scipy was not available. (Frederic B.)
+ * GPU conv crash/slowdown on newer hardware (James B.)
+ * Better error handling in GPU conv (Frederic B.)
+ * GPU optimization that moves element-wise Ops to the GPU. Crash happened in
+   a particular execution order of this optimization and the
+   element-wise fusion optimization when upcasting some inputs to
+   float32 (to compute them on the GPU).
+   (Frederic B., reported by Sander Dieleman)
+ * GpuReshape in some particular case when the input is not contiguous
+   (Frederic B., reported by Sander Dieleman)
+ * GpuSoftmaxWithBias with shape (0, N) with N > 1.
+   (Frederic B., reported by Razvan P.)
+ * Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
+   (Pascal L., reported by Simon McGregor)
+ * Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
+   dimensions, which could typically result in optimization crashes (Olivier D.)
+ * Fixed crash when concatenating some arrays with specific broadcasting
+   patterns (Olivier D.)
+ * Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylor)
+ * In advanced indexing, if some inputs are constant, no need to call constant(...)
+   on their value any more. (Pascal L., reported by John Salvatier)
+ * Fix crash on GPU when the GpuSubtensor didn't put the right stride
+   when the result tensor had a dimension with size of 1. (Pascal L,
+   reported Graham T.)
+ * Fix scan crash that made it not run on the GPU in one case. (Guillaume D.)
+ * If you grad again a random state, don't crash (Razvan P.)
+ * GpuDownsampleFactorMax and its grad with inputs dimensions 0 and 1 bigger then 65535.
+   (Frederic B. reported by Gabe Schwartz)
+ * Potential crash due to parallel compilation when importing theano.sandbox.cuda
+   (Olivier D.)
+ * Crash fix on python 2.4 with slicing. (Pascal L.)
+ * grad of argmin and argmax (Razvan P.)
+ * Don't compute the Rop for shared variables with updates (mostly random).
+   We don't use them and they caused crash. (Razvan P.)
+ * MaxArgmax.grad() when one of the gradient it receives is None. (Razvan P, reported by Mark Fenner)
+ * Fix crash of GpuSum when some dimensions shape was 0. (Frederic B.)
+Tests:
+ * Use less memory (Olivier D.) (fix crash on 32-bit computers)
+ * Fix test with Theano flag "blas.ldflags=". (Frederic B., Pascal L.)
+ * Fix crash with advanced subtensor and numpy constant.
+ * Fix random tests crash due to random value. (Pascal L.)
+ * Always introduce Alloc node when calling alloc and let the optimizer remove them if needed.
+   This allows DebugMode to catch some shape error. (Pascal L.)
+ * DebugMode now checks the view_map for all types of Theano variables.
+   It was doing only variables of tensor type. (Frederic B.)
+Others:
+ * Remove python warning for some python version. (Gabe Schwartz)
+ * Remove useless fill op in fast_compile mode to make the graph more readable. (Fredric B.)
+ * Remove GpuOuter as it is a subset of the new GpuGer (Frederic B.)
+ * Now we use http://travis-ci.org/ to run all CPU tests (without SciPy)
+   with the default mode on all Pull Requests.
+   This should make the trunk more stable. (Fredric B.)
+ * Our nightly buildbot now checks on python 2.4 (Frederic B.)
+   This should make the trunk work on it more frequently.
+Other thanks:
+ * blaxill reported an error introduced into the trunk.
+New stuff that will probably be reworked/removed before the release:
+ * Better PyCUDA sharing of the GPU context.(fix crash at exit) (Frederic B.)
+   TODO: there is still a crash at exit!
--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -5,7 +5,7 @@ import sys
 import numpy
 import theano
-from theano import Op, Type, Apply, Variable, Constant
+from theano import Type, Apply
 from theano import tensor, scalar, config
 from theano.compat.six import StringIO
 from theano.scalar import Scalar
@@ -543,7 +543,9 @@ class GpuCAReduce(GpuOp):
                self.scalar_op == other.scalar_op)
    def __hash__(self):
-        return hash(type(self)) ^ hash(self.reduce_mask) ^ hash(type(self.scalar_op))
+        return (hash(type(self)) ^
+                hash(self.reduce_mask) ^
+                hash(type(self.scalar_op)))
    def __str__(self):
        return "GpuCAReduce{%s}{%s}" % (
@@ -565,7 +567,6 @@ class GpuCAReduce(GpuOp):
    def perform(self, node, inp, out):
        x, = inp
        z, = out
-        self._op_guard()
        # reduce_max is declared but does nothing but
        # raise NotImplementedError.
        # We can't call it here anyway because it hasn't
@@ -599,7 +600,7 @@ class GpuCAReduce(GpuOp):
        inp = ['fake_input_name_%d' % i for i in xrange(len(inputs))]
        out = ['fake_output_name_%d' % i for i in xrange(len(node.outputs))]
-        sub = { 'fail' : 'fake failure code' }
+        sub = {'fail': 'fake failure code'}
        try:
            self.c_code(node, name, inp, out, sub)
@@ -634,7 +635,8 @@ class GpuCAReduce(GpuOp):
        # but tensor.elemwise.CAReduce has this exact same check so I guess
        # this is OK to do
        if self.scalar_op in [scal.minimum, scal.maximum]:
-            conds = ["(CudaNdarray_HOST_DIMS(%s)[%d] == 0)" % (x, i) for i in xrange(nd_in) \
+            conds = ["(CudaNdarray_HOST_DIMS(%s)[%d] == 0)" % (x, i)
+                     for i in xrange(nd_in)
                     if self.reduce_mask[i]]
            assert len(conds) > 0
            cond = "(" + " || ".join(conds) + ")"
@@ -691,9 +693,18 @@ class GpuCAReduce(GpuOp):
        # \begin bracket the reduction in a check that there is
        # actually work to do
+        if getattr(self.scalar_op, 'identity', None) == 0:
+            zero_shp = "cudaMemset(%(z)s->devdata, 0, CudaNdarray_SIZE(%(z)s) * sizeof(float))" % locals()
+        #TODO: elif getattr(self.scalar_op, 'identity', None) == 1:
+        else:
+            zero_shp = """
+            PyErr_Format(PyExc_NotImplementedError,
+                         "GpuCAReduce not implemented when input shape is 0 for this scalar_op");
+            %(fail)s;
+            """ % locals()
        print >> sio, """
        if (CudaNdarray_SIZE(%(z)s) && ! CudaNdarray_SIZE(%(x)s)){
-            cudaMemset(%(z)s->devdata, 0, CudaNdarray_SIZE(%(z)s) * sizeof(float));
+            %(zero_shp)s;
        }
        else if (CudaNdarray_SIZE(%(z)s))
        {
@@ -710,10 +721,12 @@ class GpuCAReduce(GpuOp):
            print >> sio, 'if(CudaNdarray_is_c_contiguous(%(x)s)){'%locals()
            self.c_code_reduce_ccontig(sio, node, name, x, z, fail)
            print >> sio, "}else{"
-            getattr(self, 'c_code_reduce_%s'%(''.join(str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
+            getattr(self, 'c_code_reduce_%s'%(''.join(
+                str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
            print >> sio, "}"
        else:
-            getattr(self, 'c_code_reduce_%s'%(''.join(str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
+            getattr(self, 'c_code_reduce_%s'%(''.join(
+                str(i) for i in self.reduce_mask)))(sio, node, name, x, z, fail)
        # \end bracket the reduction ...
        print >> sio, """
@@ -886,6 +899,22 @@ class GpuCAReduce(GpuOp):
        """
+    def _assign_init(self, first_item):
+        """
+        This return the initial value for myresult.
+        If the scalar op have an identity value, return it.
+        Otherwise, check that the scalar op is maximum or minimum
+        and return first_item. It should be the first element of the reduction.
+        As the maximum and minimum of the same value don't change, this work.
+        """
+        if hasattr(self.scalar_op, 'identity'):
+            return str(self.scalar_op.identity)
+        else:
+            assert isinstance(self.scalar_op, (scal.Maximum,
+                                               scal.Minimum))
+            return first_item
    def _assign_reduce(self, node, name, left, right, sub):
        """
            node: the node argument to this op's c_code
@@ -897,17 +926,16 @@ class GpuCAReduce(GpuOp):
            returns C code to reduce left and right, assigning the
            result to left."""
-        x ,= node.inputs
+        x, = node.inputs
        dtype = x.dtype
+        dummy_left = scal.Scalar(dtype=dtype)()
-        dummy_left = scal.Scalar(dtype = dtype)()
+        dummy_right = scal.Scalar(dtype=dtype)()
-        dummy_right = scal.Scalar(dtype = dtype)()
        dummy_node = self.scalar_op.make_node(dummy_left, dummy_right)
-        dummy_name = name + '_scalar_op'+ str(self._n_scalar_op_calls)
+        dummy_name = name + '_scalar_op' + str(self._n_scalar_op_calls)
        self._n_scalar_op_calls += 1
        return self.scalar_op.c_code(dummy_node, dummy_name, (left, right),
@@ -954,7 +982,8 @@ class GpuCAReduce(GpuOp):
              float temp = buf[threadNum + halfPoint];
              """
-        new_version += self._assign_reduce(node, name, 'buf[threadNum]', 'temp', sub)
+        new_version += self._assign_reduce(node, name,
+                                           'buf[threadNum]', 'temp', sub)
        new_version += """
            }
@@ -984,7 +1013,8 @@ class GpuCAReduce(GpuOp):
            for (int i = threadNum + warpSize; i < threadCount; i += warpSize)
            {
                """
-        current_version += self._assign_reduce(node, name, 'myresult', 'buf[i]', sub) + """
+        current_version += self._assign_reduce(node, name,
+                                               'myresult', 'buf[i]', sub) + """
            }
            buf[threadNum] = myresult;
        /*Comment this optimization as it don't work on Fermi GPU.
@@ -992,9 +1022,11 @@ class GpuCAReduce(GpuOp):
            // no sync because only one warp is running
            if(threadCount >32)
            {"""
-        for num in [16,8,4,2,1]:
+        for num in [16, 8, 4, 2, 1]:
-            current_version += self._assign_reduce(node, name, 'buf[threadNum]',
+            current_version += self._assign_reduce(node, name,
-                    'buf[threadNum+%d]' % num, sub)
+                                                   'buf[threadNum]',
+                                                   'buf[threadNum+%d]' % num,
+                                                   sub)
        current_version += """
                if (threadNum == 0)
                {
@@ -1007,9 +1039,11 @@ class GpuCAReduce(GpuOp):
            {
                //reduce so that threadNum 0 has the reduction of everything
                """
-        for num in [16,8,4,2,1]:
+        for num in [16, 8, 4, 2, 1]:
            this_if = "if (threadNum + %d < threadCount) " % num + \
-                self._assign_reduce(node, name, 'buf[threadNum]','buf[threadNum+%d]' % num, sub)
+                self._assign_reduce(node, name,
+                                    'buf[threadNum]','buf[threadNum+%d]' % num,
+                                    sub)
            current_version += this_if
        current_version += """
                if (threadNum == 0)
@@ -1026,8 +1060,8 @@ class GpuCAReduce(GpuOp):
    #Threads must be organized as: threadNum%nb_reduce correspond to the same sum
    #nb_reduce<=warpSize
-    def _k_reduce_buf_multiple(self, z_pos, nb_reduce):
+    def _k_reduce_buf_multiple(self, z_pos, node, name, nb_reduce):
-        self._op_guard()
+        reduce_fct = self._assign_reduce(node, name, 'myresult', 'buf[i]', {})
        return """
        __syncthreads(); // some kernel do multiple reduction.
        buf[threadNum] = myresult;
@@ -1039,7 +1073,7 @@ class GpuCAReduce(GpuOp):
            //round up all the partial sums into the first `nb_reduce` elements
            for (int i = threadNum + %(nb_reduce)s; i < threadCount; i += %(nb_reduce)s)
            {
-                myresult += buf[i];
+                %(reduce_fct)s;
            }
            %(z_pos)s = myresult;
        }
@@ -1052,11 +1086,20 @@ class GpuCAReduce(GpuOp):
        is for the case where we are reducing on all axes and x is
        C contiguous.
        """
-        self._op_guard()
+        if getattr(self.scalar_op, 'identity', None) == 0:
+            zero_shp = "cudaMemset(%(z)s->devdata, 0, CudaNdarray_SIZE(%(z)s) * sizeof(float))" % locals()
+        #TODO: elif getattr(self.scalar_op, 'identity', None) == 1:
+        else:
+            zero_shp = """
+            PyErr_Format(PyExc_NotImplementedError,
+                         "GpuCAReduce not implemented when input shape is 0 for this scalar_op");
+            %(fail)s;
+            """ % locals()
        print >> sio, """
        {
          if(CudaNdarray_SIZE(%(x)s)==0){
-            cudaMemset(CudaNdarray_DEV_DATA(%(z)s),0,sizeof(float));
+            %(zero_shp)s;
          }else{
            int verbose = 0;
            dim3 n_threads(
@@ -1092,7 +1135,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_1(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1106,7 +1148,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_11(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1189,10 +1230,6 @@ class GpuCAReduce(GpuOp):
        self.c_code_reduce_01X(sio, node, name, x, z, fail, 3)
    def c_code_reduce_10(self, sio, node, name, x, z, fail):
-        if not isinstance(self.scalar_op, (scal.Add,
-                                           scal.Maximum,
-                                           scal.Minimum)):
-            raise NotImplementedError()
        print >> sio, """
        {
            int verbose = 0;
@@ -1242,7 +1279,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_010(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        makecall_inner = self._makecall(node, name, x, z, fail,
                                        pattern="010_inner")
@@ -1365,7 +1401,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_0101(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1385,7 +1420,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_100(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        # use threadIdx.x for i0
        # use blockIdx.x for i1
@@ -1406,7 +1440,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_110(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1428,7 +1461,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_001(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1451,7 +1483,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_111(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1484,7 +1515,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_0011(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1515,7 +1545,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_1111(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1548,7 +1577,6 @@ class GpuCAReduce(GpuOp):
        """ % locals()
    def c_code_reduce_1011(self, sio, node, name, x, z, fail):
-        self._op_guard()
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
        {
@@ -1587,19 +1615,17 @@ class GpuCAReduce(GpuOp):
        else:
            return ()
-    def _op_guard(self):
-        """ Raises NotImplementedError if op is not Add """
-        if not isinstance(self.scalar_op, theano.scalar.basic.Add):
-            raise NotImplementedError()
    def c_support_code_apply(self, node, nodename):
        sio = StringIO()
        nd_in = len(self.reduce_mask)
        if all(i == 1 for i in self.reduce_mask):
-            self._op_guard()
            #this kernel is ok for up to a few thousand elements, but
            # it only runs on ONE multiprocessor
-            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub = {})
+            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub={})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0]",
+                                             {})
+            reduce_init = self._assign_init("A[0]")
            print >> sio, """
            static __global__ void kernel_reduce_ccontig_%(nodename)s(
                    const unsigned int d0,
@@ -1609,7 +1635,7 @@ class GpuCAReduce(GpuOp):
                const int threadCount = blockDim.x;
                const int threadNum = threadIdx.x;
                extern __shared__ float buf[];
-                float myresult = 0.0f;
+                float myresult = %(reduce_init)s;
                if (warpSize != 32)
                {
@@ -1618,16 +1644,19 @@ class GpuCAReduce(GpuOp):
                for (int i0 = threadIdx.x; i0 < d0; i0 += blockDim.x)
                {
-                    myresult += A[i0];
+                    %(reduce_fct)s
                }
                %(reducebuf)s
            }
            """ % locals()
        if self.reduce_mask == (1,):
-            self._op_guard()
            #this kernel is ok for up to a few thousand elements, but
            # it only runs on ONE multiprocessor
-            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub = {})
+            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub={})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0]",
+                                             {})
+            reduce_init = self._assign_init("A[0]")
            print >> sio, """
            static __global__ void kernel_reduce_1_%(nodename)s(
                    const unsigned int d0,
@@ -1637,7 +1666,7 @@ class GpuCAReduce(GpuOp):
                const int threadCount = blockDim.x;
                const int threadNum = threadIdx.x;
                extern __shared__ float buf[];
-                float myresult = 0.0f;
+                float myresult = %(reduce_init)s;
                if (warpSize != 32)
                {
@@ -1646,17 +1675,20 @@ class GpuCAReduce(GpuOp):
                for (int i0 = threadIdx.x; i0 < d0; i0 += blockDim.x)
                {
-                    float Ai = A[i0 * sA0];
+                    %(reduce_fct)s
-                    myresult += Ai;
                }
                %(reducebuf)s
            }
            """ % locals()
        if self.reduce_mask == (1, 1):
-            self._op_guard()
            #this kernel is ok for up to a few thousand elements, but
            # it only runs on ONE multiprocessor
-            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub = {})
+            reducebuf = self._k_reduce_buf('Z[0]', node, nodename, sub={})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1]",
+                                             {})
+            reduce_init = self._assign_init("A[0]")
            print >> sio, """
            static __global__ void kernel_reduce_11_%(nodename)s(
                    const int d0,
@@ -1667,7 +1699,7 @@ class GpuCAReduce(GpuOp):
                const int threadCount = blockDim.x * blockDim.y;
                const int threadNum = threadIdx.y*blockDim.x + threadIdx.x;
                extern __shared__ float buf[];
-                float myresult = 0.0f;
+                float myresult = %(reduce_init)s;
                if (warpSize != 32)
                {
@@ -1678,8 +1710,7 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)
                    {
-                        float Ai = A[i0 * sA0 + i1 * sA1];
+                        %(reduce_fct)s;
-                        myresult += Ai;
                    }
                }
                %(reducebuf)s
@@ -1727,28 +1758,15 @@ class GpuCAReduce(GpuOp):
                first_i3 = 'threadIdx.x'
                sA3 = 'sA3'
-            reducebuf = self._k_reduce_buf('Z[i0 * sZ0]', node, nodename, sub = {})
+            reducebuf = self._k_reduce_buf('Z[i0 * sZ0]', node,
+                                           nodename, sub={})
            param_dim = ",".join(["const int d%d" % i
                                  for i in xrange(nd_in)])
            param_strides = ",".join(["const int sA%d" % i
                                      for i in xrange(nd_in)])
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
-            # TODO: ideally this would all be some clean function of scalar_op,
+            reduce_init = self._assign_init("A[%(first_i3)s * %(sA3)s + %(first_i2)s * %(sA2)s + %(first_i1)s * %(sA1)s + i0 * sA0]" % locals())
-            # but since sum is a special case where it's OK to reduce with an
-            # extra 0, I would need to change the behavior of the sum reduction
-            # code to do that. I don't want to benchmark and test changes to the
-            # sum code so I will leave that for later.
-            # max/min reduction is also a special case that is simple to implement.
-            # this is the special case where reduction is idempotent so it doesn't
-            # matter if we reduce with the first element multiple times.
-            if isinstance(self.scalar_op, (scal.Add, scal.Maximum, scal.Minimum)):
-                # special cased max/min code (special case because visits first
-                # member of each row twice)
-                if isinstance(self.scalar_op, scal.Add):
-                    reduce_init = "0.f;"
-                else:
-                    reduce_init = "A[%(first_i3)s * %(sA3)s + %(first_i2)s * %(sA2)s + %(first_i1)s * %(sA1)s + i0 * sA0];" % locals()
            reduce_fct = self._assign_reduce(
                node, nodename, "myresult",
                "A[i3 * sA3 + i2 * sA2 + i1 * sA1 + i0 * sA0]",
@@ -1769,22 +1787,7 @@ class GpuCAReduce(GpuOp):
                    }
                }
                """ % locals()
-            else:
-                # TODO: implement general case and get rid of the two special
-                # cases above
-                # it should initialize myresult to element 0,
-                # and the for loop should begin traversing from element 1
-                # raise an error if asked to reduce an empty dimension
-                # (maybe special-case sum to return 0 instead of returning an
-                # error)
-                # in both cases, benchmark the general case against the existing
-                # code to make sure it does not cause a slowdown
-                raise NotImplementedError()
        if self.reduce_mask == (0, 1, 0) or self.reduce_mask == (1, 0):
-            if not isinstance(self.scalar_op, (scal.Add,
-                                               scal.Maximum,
-                                               scal.Minimum)):
-                raise NotImplementedError()
            # this kernel uses one block for each column,
            # threads per block for each element per column.
@@ -1796,10 +1799,7 @@ class GpuCAReduce(GpuOp):
            reduce_fct = self._assign_reduce(node, nodename, "myresult",
                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2]",
                                             {})
-            if isinstance(self.scalar_op, scal.Add):
+            reduce_init = self._assign_init("A[i0 * sA0 + threadIdx.x * sA1 + i2 * sA2]")
-                reduce_init = "0.f;"
-            else:
-                reduce_init = "A[i0 * sA0 + threadIdx.x * sA1 + i2 * sA2];"
            print >> sio, """
            static __global__ void kernel_reduce_010_%(nodename)s(
                    const int d0,
@@ -1835,7 +1835,10 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (0, 1, 0):
-            self._op_guard()
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "X[a * sX0 + b * sX1 + c * sX2]",
+                                             {})
+            reduce_init = self._assign_init("X[a * sX0 + 0 * sX1 + c * sX2]")
            print >> sio, """
            static __global__ void kernel_reduce_010_AD_%(nodename)s(
                    const int A,
@@ -1863,10 +1866,10 @@ class GpuCAReduce(GpuOp):
                        int c = i2_D * 32 + threadIdx.x;
                        if (c < C)
                        {
-                            myresult = 0;
+                            myresult = %(reduce_init)s;
                            for (int b = 0; b < B; ++b)
                            {
-                                myresult += X[a * sX0 + b * sX1 + c * sX2];
+                                %(reduce_fct)s;
                            }
                            Z[a * sZ0 + c * sZ1] = myresult;
                        }
@@ -1876,7 +1879,6 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (0, 1, 0):
-            self._op_guard()
            #
            # This kernel is optimized when the inner most dimensions
            # have the smallest stride.
@@ -1891,9 +1893,12 @@ class GpuCAReduce(GpuOp):
            init = self._k_init(node, nodename)
            decl = self._k_decl(node, nodename, pattern="010_inner")
            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
+                                                    node, nodename,
                                                    'blockDim.x')
-            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
-                                                    'blockDim.x')
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2]",
+                                             {})
+            reduce_init = self._assign_init("A[i0 * sA0 + 0 * sA1 + i2 * sA2]")
            print >> sio, """
            %(decl)s
            {
@@ -1908,9 +1913,10 @@ class GpuCAReduce(GpuOp):
              {
                for (int i2 = blockIdx.y*blockDim.x+threadIdx.x; i2 < d2; i2 += gridDim.y*blockDim.x)
                 {
+                  myresult = %(reduce_init)s;
                  for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
                  {
-                      myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
+                      %(reduce_fct)s;
                  }
                  %(reducebuf)s
                 }
@@ -1918,7 +1924,6 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (1, 1, 0):
-            self._op_guard()
            # this kernel uses one block for each column,
            # threads per block for each element per column.
@@ -1926,6 +1931,10 @@ class GpuCAReduce(GpuOp):
            #      c_contiguous (typical case) then each warp is accessing non-contigous
            #      memory (a segment of a column).
            reducebuf = self._k_reduce_buf('Z[blockIdx.x * sZ0]', node, nodename, sub = {})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + blockIdx.x * sA2]",
+                                             {})
+            reduce_init = self._assign_init("A[blockIdx.x * sA2]")
            print >> sio, """
            static __global__ void kernel_reduce_110_%(nodename)s(
                    const int d0,
@@ -1938,7 +1947,7 @@ class GpuCAReduce(GpuOp):
                const int threadCount = blockDim.x * blockDim.y;
                const int threadNum = threadIdx.y * blockDim.x + threadIdx.x;
                extern __shared__ float buf[];
-                float myresult = 0.0f;
+                float myresult = %(reduce_init)s;
                if (warpSize != 32)
                {
@@ -1951,8 +1960,7 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)
                    {
-                        float Ai = A[i0 * sA0 + i1 * sA1 + blockIdx.x * sA2];
+                        %(reduce_fct)s;
-                        myresult += Ai;
                    }
                }
@@ -1960,11 +1968,14 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (1, 0, 0):
-            self._op_guard()
            reducebuf = self._k_reduce_buf('Z[i1 * sZ0 + i2 * sZ1]',
                                           node, nodename, sub={})
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2]",
+                                             {})
+            reduce_init = self._assign_init("A[i1 * sA1 + i2 * sA2]")
            print >> sio, """
            %(decl)s
            {
@@ -1973,10 +1984,10 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i1 = blockIdx.x; i1 < d1; i1 += gridDim.x)
                    {
-                        myresult = 0;
+                        myresult = %(reduce_init)s;
                        for (int i0 = threadIdx.x; i0 < d0; i0 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
+                            %(reduce_fct)s
                        }
                        %(reducebuf)s
                    }
@@ -1984,23 +1995,26 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (1, 1, 1):
-            self._op_guard()
            reducebuf = self._k_reduce_buf('Z[0]', node,
                                           nodename, sub={})
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2]",
+                                             {})
+            reduce_init = self._assign_init("A[0]")
            print >> sio, """
            %(decl)s
            {
                %(init)s
-                myresult = 0;
+                myresult = %(reduce_init)s;
                for (int i0 = threadIdx.z; i0 < d0; i0 += blockDim.z)
                {
                    for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
                    {
                        for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
+                            %(reduce_fct)s;
                        }
                    }
                }
@@ -2008,11 +2022,14 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (0, 0, 1):
-            self._op_guard()
            # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]',
-                    node, nodename, sub = {})
+                                           node, nodename, sub={})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2]",
+                                             {})
+            reduce_init = self._assign_init("A[i0 * sA0 + i1 * sA1]")
            print >> sio, """
            static __global__ void kernel_reduce_001_%(nodename)s(
                    const int d0,
@@ -2035,10 +2052,10 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y)
                    {
-                        float myresult = 0.0f;
+                        float myresult = %(reduce_init)s;
                        for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2];
+                            %(reduce_fct)s;
                        }
                        %(reducebuf)s
                    }
@@ -2046,13 +2063,16 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (0, 0, 1, 1):
-            self._op_guard()
             # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]',
-                    node, nodename, sub = {})
+                                           node, nodename, sub={})
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3]",
+                                             {})
+            reduce_init = self._assign_init("A[i0 * sA0 + i1 * sA1]")
            print >> sio, """
            %(decl)s
            {
@@ -2062,12 +2082,12 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y)
                    {
-                        float myresult = 0.0f;
+                        float myresult = %(reduce_init)s;
                    for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)
                    {
                        for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3];
+                            %(reduce_fct)s;
                        }
                    }
                        %(reducebuf)s
@@ -2076,13 +2096,16 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (0, 1, 0, 1):
-            self._op_guard()
            # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]',
-                    node, nodename, sub = {})
+                                           node, nodename, sub={})
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3]",
+                                             {})
+            reduce_init = self._assign_init("A[i0 * sA0 + i2 * sA2]")
            print >> sio, """
            %(decl)s
            {
@@ -2092,12 +2115,12 @@ class GpuCAReduce(GpuOp):
                {
                    for (int i2 = blockIdx.y; i2 < d2; i2 += gridDim.y)
                    {
-                        float myresult = 0.0f;
+                        float myresult = %(reduce_init)s;
                    for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)
                    {
                        for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3];
+                            %(reduce_fct)s;
                        }
                    }
                        %(reducebuf)s
@@ -2106,16 +2129,19 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (1, 1, 1, 1):
-            self._op_guard()
            reducebuf = self._k_reduce_buf('Z[0]', node, nodename,
-                    sub = {})
+                                           sub={})
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3]",
+                                             {})
+            reduce_init = self._assign_init("A[0]")
            print >> sio, """
            %(decl)s
            {
                %(init)s
-                myresult = 0;
+                myresult = %(reduce_init)s;
              for (int i0 = 0; i0 < d0; i0++)
                for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)
                {
@@ -2123,7 +2149,7 @@ class GpuCAReduce(GpuOp):
                    {
                        for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
                        {
-                            myresult += A[i0 * sA0 + i1 * sA1 + i2 * sA2 + i3 * sA3];
+                            %(reduce_fct)s;
                        }
                    }
                }
@@ -2131,9 +2157,12 @@ class GpuCAReduce(GpuOp):
            }
            """ % locals()
        if self.reduce_mask == (1, 0, 1, 1):
-            self._op_guard()
            reducebuf = self._k_reduce_buf('Z[blockIdx.x*sZ0]',
-                    node, nodename, sub = {})
+                                           node, nodename, sub={})
+            reduce_fct = self._assign_reduce(node, nodename, "myresult",
+                                             "A[i0 * sA0 + blockIdx.x * sA1 + i2 * sA2 + i3 * sA3]",
+                                             {})
+            reduce_init = self._assign_init("A[blockIdx.x * sA1]")
            print >> sio, """
            static __global__ void kernel_reduce_1011_%(nodename)s(
                    const unsigned int d0,
@@ -2147,7 +2176,7 @@ class GpuCAReduce(GpuOp):
                const int threadCount = blockDim.x * blockDim.y * blockDim.z;
                const int threadNum = threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
                extern __shared__ float buf[];
-                float myresult = 0.0f;
+                float myresult = %(reduce_init)s;
                if (warpSize != 32)
                {
@@ -2160,8 +2189,7 @@ class GpuCAReduce(GpuOp):
                    {
                        for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)
                        {
-                            float Ai = A[i0 * sA0 + blockIdx.x * sA1 + i2 * sA2 + i3 * sA3];
+                            %(reduce_fct)s;
-                            myresult += Ai;
                        }
                    }
                }

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -602,7 +602,7 @@ def local_gpu_careduce(node):
        scalar_op = node.op.scalar_op
        # currently, only these two ops are supported at all,
        # and max does not support all combinations of axes
-        if node.op.scalar_op in [scal.add, scal.maximum, scal.minimum]:
+        if node.op.scalar_op in [scal.add, scal.mul, scal.maximum, scal.minimum]:
            x, = node.inputs
            if x.owner and x.owner.op == host_from_gpu:
                if node.op.axis is None:

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -65,9 +65,16 @@ def test_careduce():
    TODO: test with broadcast
    """
    for scalar_op, careduce_op in [
+            (theano.scalar.mul, tensor.elemwise.CAReduceDtype),
            (theano.scalar.add, tensor.elemwise.CAReduceDtype),
            (theano.scalar.maximum, tensor.CAReduce),
-            (theano.scalar.minimum, tensor.CAReduce)]:
+            (theano.scalar.minimum, tensor.CAReduce)
+            #The following 2 cases could work if the scalar_op.c_code work with float* dtype.
+            #Currently we have this error:
+            #error: invalid operands of types 'npy_float32' and 'npy_float32' to binary 'operator&'
+            #(theano.scalar.and_, tensor.elemwise.CAReduce),
+            #(theano.scalar.or_, tensor.elemwise.CAReduce),
+    ]:
        for shape, pattern in [((1,1),(1,)),
                               ((1,0),(1,)),
                               ((0,1),(1,)),
@@ -124,11 +131,6 @@ def test_careduce():
            op = careduce_op(scalar_op, axis=pattern)
            pat = tensor_pattern_to_gpu_pattern(shape, pattern)
-            #GpuCAReduce{maximum/minimum} support only those patterns
-            if scalar_op in [theano.scalar.maximum,
-                             theano.scalar.minimum] and pat not in [
-                                 (0, 1), (0, 1, 1), (0, 1, 1), (1, 0)]:
-                continue
            a = tensor.TensorType('float32', (False,) * len(shape))()
            b = op(a)
@@ -139,15 +141,22 @@ def test_careduce():
            f = theano.function([a], b, mode=mode_with_gpu)
            f2 = theano.function([a], b, mode=mode_without_gpu)
            assert tcn.GpuCAReduce in [x.op.__class__
-                                       for x in f.maker.fgraph.toposort()]
+                                       for x in f.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            assert op.__class__ in [x.op.__class__
-                                    for x in f2.maker.fgraph.toposort()]
+                                    for x in f2.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            f_caused_value_error = False
            try:
                f_out = f(val)
            except ValueError, e:
                exc = e
                f_caused_value_error = True
+            except NotImplementedError:
+                if (numpy.prod(shape) == 0 and
+                    getattr(scalar_op, 'identity', None) != 0):
+                    continue
+                raise
            f2_caused_value_error = False
            try:
@@ -179,6 +188,7 @@ def test_careduce():
                theano.tensor.basic.float32_rtol = 2e-5
                assert _allclose(f_out, f2_out), ('shape', shape,
                                                    'pattern', pattern,
+                                                    scalar_op,
                                                    sum([shape[i] for i in pattern]),
                                                    f2(val), f(val), val)
            finally:
@@ -193,11 +203,6 @@ def test_careduce():
                               ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3])]:
            op = careduce_op(scalar_op, axis=pattern)
            pat = tensor_pattern_to_gpu_pattern(shape, pattern)
-            #GpuCAReduce{maximum/minimum} support only those patterns
-            if scalar_op in [theano.scalar.maximum,
-                             theano.scalar.minimum] and pat not in [
-                                 (0, 1), (0, 1, 1), (0, 1, 1), (1, 0)]:
-                continue
            a = tensor.TensorType('float32', (False,) * len(shape))()
            dim_pattern = range(len(shape))
@@ -212,11 +217,14 @@ def test_careduce():
            f = theano.function([a], b, mode=mode_with_gpu)
            f2 = theano.function([a], b, mode=mode_without_gpu)
            assert tcn.GpuCAReduce in [x.op.__class__
-                                       for x in f.maker.fgraph.toposort()]
+                                       for x in f.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            assert op.__class__ in [x.op.__class__
-                                    for x in f2.maker.fgraph.toposort()]
+                                    for x in f2.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            assert _allclose(f2(val), f(val)), ('shape', shape,
                                                'pattern', pattern,
+                                                scalar_op,
                                                sum([shape[i] for i in pattern]))
            #test with broadcast
@@ -227,11 +235,6 @@ def test_careduce():
                               ((5,4,3,2),[0,1,2,3]), ((5,4,3,2),[0,2,3])]:
            op = careduce_op(scalar_op, axis=pattern)
            pat = tensor_pattern_to_gpu_pattern(shape, pattern)
-            #GpuCAReduce{maximum/minimum} support only those patterns
-            if scalar_op in [theano.scalar.maximum,
-                             theano.scalar.minimum] and pat not in [
-                                 (0, 1), (0, 1, 1), (0, 1, 1), (1, 0)]:
-                continue
            shape = numpy.asarray(shape) * 2
            a = tensor.TensorType('float32', (False,) * len(shape))()
@@ -258,9 +261,11 @@ def test_careduce():
            f = theano.function([a], b, mode=mode_without_gpu)
            f2 = theano.function([a2], b2, mode=mode_with_gpu)
            assert tcn.GpuCAReduce in [x.op.__class__
-                                       for x in f2.maker.fgraph.toposort()]
+                                       for x in f2.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            assert op.__class__ in [x.op.__class__
-                                    for x in f.maker.fgraph.toposort()]
+                                    for x in f.maker.fgraph.toposort()], (
+                                           scalar_op, shape, pattern)
            assert _allclose(f2(val2), f(val)), ('shape', shape,
                                                 'pattern', pattern,
                                                 sum([shape[i] for i in pattern]))

--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
@@ -2721,7 +2721,7 @@ class TrueDotTester(utt.InferShapeTester):
                assert tested.format == format
                assert tested.dtype == expected.dtype
                tested = tested.toarray()
-                assert numpy.allclose(tested, expected)
+                utt.assert_allclose(tested, expected)
    def test_op_sd(self):
        for format in sparse.sparse_formats:
@@ -2743,7 +2743,7 @@ class TrueDotTester(utt.InferShapeTester):
                assert tested.format == format
                assert tested.dtype == expected.dtype
                tested = tested.toarray()
-                assert numpy.allclose(tested, expected)
+                utt.assert_allclose(tested, expected)
    def test_infer_shape(self):
        for format in sparse.sparse_formats: