Merged

5432363f · Olivier Delalleau · 887c8aa7 · a0545039 · 5432363f · 5432363f
--- a/NEWS.txt
+++ b/NEWS.txt
 Trunk sin last release
 ------
 * Sparse type is now supported by the shape op and the ShapeFeature optimizer work correctly with them.
- * fuse GpuElemwise more often(in the case where their is too many inputs that fusing all of them would bust the 256 bytes limits of parameter to gpu function)
- * Speed up gemv by a work around scipy gemv slowness when the matrix is in c order(the default)
+ * Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
+ * Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).

 Theano 0.3 (2010-11-23)
 -----------------------

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -19,7 +19,8 @@ instructions below for detailed installation steps):

    Linux, Mac OS X or Windows operating system
        We develop mainly on 64-bit Linux machines. 32-bit architectures are
-        not well-tested.
+        not well-tested. Note that GPU computing does not work yet under
+        Windows.

    Python_ >= 2.4
        The development package (``python-dev`` or ``python-devel``
@@ -330,7 +331,7 @@ Mac

    .. code-block:: bash

-        $ sudo port install gcc44 py25-scipy mercurial python_select
+        $ sudo port install gcc44 py26-scipy mercurial python_select

    This will install all the required Theano dependencies.  Note that
    compiling gcc takes significant time (hours)! SciPy depends on ATLAS (a
@@ -344,13 +345,13 @@ Mac
  packages are updated quite frequently.

 - In order to use the MacPorts version of python, you might
-  need to explicitly select it with ``sudo python_select python25``. The
+  need to explicitly select it with ``sudo python_select python26``. The
  reason this is necessary is because you might have an Apple-provided python
  (via, for example, an Xcode installation). After performing this step, you
  should check that the symbolic link provided by ``which python`` points to
  the MacPorts python. For instance, on Snow Leopard with the latest MacPorts, 
  the output of ``which python`` is ``/opt/local/bin/python`` and this symbolic
-  link points to ``/opt/local/bin/python2.5``. When executing ``sudo
+  link points to ``/opt/local/bin/python2.6``. When executing ``sudo
  python_select python26-apple`` (which you should **not** do), the link
  points to ``/usr/bin/python2.6``.

@@ -364,7 +365,7 @@ Mac

 - Please follow the same procedure with ``numpy``.

- Put ``export PYTHONPATH=/opt/local/lib/python2.5/site-packages:$PYTHONPATH``
+- Put ``export PYTHONPATH=/opt/local/lib/python2.6/site-packages:$PYTHONPATH``
  in your ``.bashrc`` in order to include your MacPorts Python packages 
  (NumPy, SciPy) in Python's path.

@@ -469,7 +470,7 @@ components as in Python(x,y) that are required by Theano, follow these steps:
  sub-directory are in your system path. This may be done by
  modifying the global ``PATH`` Windows environment variables, or by creating
  a ``.profile`` file in your MinGW home, containing a line like
-  ``export PATH=$PATH:/c/Python27:/c/Python27/Scripts`` (note that the latter
+  ``export PATH=$PATH:/c/Python26:/c/Python26/Scripts`` (note that the latter
  will work only when you run Theano from a MinGW shell).

 - In order to run Theano's test-suite, you will need `nose
@@ -661,10 +662,14 @@ follows:
 Using the GPU
 ~~~~~~~~~~~~~

-Please note that these are tentative instructions (we have not yet been able to
-get the GPU to work under Windows with Theano).
-Please report your own successes / failures on the `theano-users`_ mailing list.
+At this point, GPU computing does not work under Windows. The current main
+issue is that the compilation commands used under Linux / MacOS to create
+and use a CUDA-based shared library with the nvcc compiler do not work with
+Windows DLLs. If anyone can figure out the proper compilation steps for
+Windows, please let us know on the `theano-dev`_ mailing list.

+Instructions below should at least get you started so you can reproduce the
+above-mentioned issue.
 Those are instructions for the 32-bit version of Python (the one that comes
 with Python(x,y) is 32-bit).

@@ -679,44 +684,47 @@ use a compilation directory located somewhere else:
      [global]
      base_compiledir=path_to_a_directory_without_such_characters

-You also need to add in the configuration file those lines (make sure this
-is the correct Python installation path):
-
-    .. code-block:: cfg
-
-      [cuda]
-      nvccflags=-LC:\Python27\libs
-
 Then
  
  1) Install CUDA driver (32-bit on 32-bit Windows, idem for 64-bit).
  
  2) Install CUDA toolkit 32-bit (even if you computer is 64-bit, 
-     must match the Python installation version)
+     must match the Python installation version).
  
-  3) Install CUDA SDK 32-bit
+  3) Install CUDA SDK 32-bit.
  
-  4) Test some pre-compiled example of the sdk
+  4) Test some pre-compiled example of the sdk.

-  5) Download Visual Studio 2008 Express(free, VS2010 not supported by nvcc 3.1,
-     VS2005, not available for download, but supported by nvcc, the non free version should work too)
+  5) Download Visual Studio 2008 Express (free, VS2010 not supported by nvcc 3.1,
+     VS2005 is not available for download but supported by nvcc, the non free version should work too).

-  6) Follow the instruction in the GettingStartedWindows.pdf file from CUDA web 
-     site to compile CUDA code with VS2008. If that don't work, you won't be 
-     able to compile GPU code with Theano.
+  6) Follow the instruction in the GettingStartedWindows.pdf file from the CUDA web 
+     site to compile CUDA code with VS2008. If that does not work, you will
+     not be able to compile GPU code with Theano.

-  7) Put into you PATH environment variable the directory where cl.exe is. 
-     In my case it is: ``C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin``
+  7) Edit your Theano configuration file to add lines like the following
+     (make sure these paths match your own specific installation):

-  8) Make sure the Theano folder is in your ``PYTHONPATH`` environment variable.
+     .. code-block:: cfg

-  9) Then in Python do: ``import theano.sandbox.cuda``
+        [cuda]
+        nvccflags=-LC:\Python26\libs
+
+        [nvcc]
+        compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin
+
+  8) In Python do: ``import theano.sandbox.cuda``. This will compile the
+     first CUDA file, and no error should occur.
+
+  9) Then run the Theano CUDA test files with nosetests from the
+     ``theano/sandbox/cuda/tests`` subdirectory. In the current version of
+     Theano, this should fail with an error like:
+    
+     .. code-block:: bash

-     That will print some error if their is an error to compile the first CUDA file.
+        NVCC: nvcc fatal: Don't know what to do with
+            'C:/CUDA/compile/tmpmkgqx6/../cuda_ndarray/cuda_ndarray.pyd'

-  10) Then run the Theano CUDA test file. In Windows command line (cmd.exe), 
-      run the program nosetests inside the Theano repository. 
-      nosetests is installed by Python(x,y).

 Generating the documentation
 ----------------------------
@@ -739,3 +747,4 @@ The PDF of the documentation is ``html/theano.pdf``.


 .. _theano-users: http://groups.google.com/group/theano-users?pli=1
+.. _theano-dev: http://groups.google.com/group/theano-dev?pli=1
--- a/doc/internal/index.txt
+++ b/doc/internal/index.txt
@@ -11,6 +11,7 @@ If you're feeling ambitious, go fix some `pylint
 .. toctree::
   :maxdepth: 2

+   release
   dev_start_guide
   lisa_labo
   mammouth

--- a/doc/release.txt
+++ b/doc/release.txt
--- a/doc/library/index.txt
+++ b/doc/library/index.txt
@@ -20,6 +20,7 @@ Types and Ops that you can use to build and compile expression graphs.
   scalar/index
   gof/index
   scan
+   sandbox/index

 There are also some top-level imports that you might find more convenient:


--- a/doc/library/sandbox/cuda/index.txt
+++ b/doc/library/sandbox/cuda/index.txt
+
+.. _libdoc_sandbox_cuda:
+
+===========================================
+:mod:`sandbox.cuda` -- The CUDA GPU backend
+===========================================
+
+.. module:: sandbox.cuda
+   :platform: Unix, Windows
+   :synopsis: Code for GPU programming
+.. moduleauthor:: LISA
+
+.. toctree::
+    :maxdepth: 1
+
+    var
+    type
+
+
+
--- a/doc/library/sandbox/cuda/type.txt
+++ b/doc/library/sandbox/cuda/type.txt
+..  ../../../../theano/sandbox/cuda/type.py
+..  ../../../../theano/sandbox/cuda/var.py
+..  ../../../../theano/sandbox/cuda/
+
+.. _libdoc_cuda_type:
+
+======================================================================
+:mod:`sandbox.cuda.type` --  The Type object for Cuda-allocated arrays
+======================================================================
+
+.. module:: sandbox.cuda.type
+   :platform: Unix, Windows
+   :synopsis: The Type object for CUDA-allocated arrays
+.. moduleauthor:: LISA
+
+API
+===
+
+
--- a/doc/library/sandbox/cuda/var.txt
+++ b/doc/library/sandbox/cuda/var.txt
+..  ../../../../theano/sandbox/cuda/type.py
+..  ../../../../theano/sandbox/cuda/var.py
+..  ../../../../theano/sandbox/cuda/
+
+.. _libdoc_cuda_var:
+
+===================================================================
+:mod:`sandbox.cuda.var` --  The Variables for Cuda-allocated arrays
+===================================================================
+
+.. module:: sandbox.cuda.var
+   :platform: Unix, Windows
+   :synopsis: The Variables object for CUDA-allocated arrays
+.. moduleauthor:: LISA
+
+API
+===
+
+.. autoclass:: theano.sandbox.cuda.var.CudaNdarraySharedVariable
+    :members: get_value, set_value
+
--- a/doc/library/sandbox/index.txt
+++ b/doc/library/sandbox/index.txt
+
+.. _libdoc_sandbox:
+
+==============================================================
+:mod:`sandbox` -- Experimental Code
+==============================================================
+
+.. module:: sandbox
+   :platform: Unix, Windows
+   :synopsis: Experimental code
+.. moduleauthor:: LISA
+
+.. toctree::
+    :maxdepth: 1
+
+    cuda/index
+
+
+
--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -142,7 +142,7 @@ transparent.  But when you are using a GPU (or in future perhaps a remote machin
 is not the internal representation of your data.
 If you really want Theano to return its internal representation *and never copy it*
 then you should use the ``return_internal_type=True`` argument to
-``get_value``.  It will never copy the internal object (always return in
+``get_value``.  It will never cast the internal object (always return in
 constant time), but might return various datatypes depending on contextual
 factors (e.g. the compute device, the dtype of the numpy array).

@@ -154,6 +154,12 @@ It is possible to use ``borrow=False`` in conjunction with
 ``return_internal_type=True``, which will return a deep copy of the internal object.
 This is primarily for internal debugging, not for typical use.

+For the transparent use of different type of optimization Theano can make,
+there is the policy that get_value() always return by default the same object type
+it received when the shared variable was created. So if you created manually data on 
+the gpu and create a shared variable on the gpu with this data, get_value will always
+return gpu data event when return_internal_type=False.
+
 *Take home message:*

 It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
@@ -182,6 +188,30 @@ This pattern works regardless of the compute device, and when the compute device
 makes it possible to expose Theano's internal variables without a copy, then it
 goes as fast as an in-place update.

+
+When shared variables are allocated on the GPU, the transfers to and from GPU device memory can
+be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
+
+* Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
+  GPU memory for the new value would be allocated before the old memory was released. If you're
+  running near the limits of GPU memory, this could cause you to run out of GPU memory
+  unnecessarily.  *Solution*: update to a newer version of Theano.
+
+* If you are going to swap several chunks of data in and out of a shared variable repeatedly,
+  you will want to reuse the memory that you allocated the first time if possible - it is both
+  faster and more memory efficient.
+  *Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
+  data to make sure that every chunk is the same size.
+
+* It is also worth mentioning that, current GPU copying routines support only contiguous memory.
+  So Theano must make the ``value`` you provide ``c_contiguous`` prior to copying it.
+  This can require an extra copy of the data on the host.  *Solution*: make sure that the value
+  you assign to a CudaNdarraySharedVariable is *already*  ``c_contiguous``.
+
+(Further remarks on the current implementation of the GPU version of set_value() can be found
+here: :ref:`libdoc_cuda_var`)
+
+
 Retrieving and assigning via the .value property
 ------------------------------------------------


--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -21,7 +21,7 @@ Toolkit installs a folder on your computer with subfolders *bin*, *lib*,
 *include*, and some more too.  (Sanity check: The *bin* subfolder should contain an *nvcc*
 program which is the compiler for GPU code.)  This folder is called the *cuda
 root* directory.
-On Linux or OS-X >= 10.4, you must add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a 64-bit
+On Linux or OS X >= 10.4, you must add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a 64-bit Linux
 computer) to your ``$LD_LIBRARY_PATH`` environment variable.


@@ -274,3 +274,10 @@ Tips for improving performance on GPU
  that can tell you if not enought of your graph is on the gpu or if their
  is too much memory transfert.

+
+Changing the value of shared variables
+--------------------------------------
+
+To change the value of a shared variable, e.g. to provide new data to process,
+use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
+see :ref:`aliasing`.
\ No newline at end of file
--- a/theano/compile/sharedvalue.py
+++ b/theano/compile/sharedvalue.py
@@ -138,7 +138,9 @@ class SharedVariable(Variable):


    def filter_update(self, update):
-        """When this shared variable is updated by a pfunc, the update value will be run through this function.
+        """
+        When this shared variable is updated by a pfunc, the update value will be run through this function.
+
        This is a good spot to cast or convert the update expression as necessary.

        Default behaviour is to return `update` unmodified if it is a Variable, otherwise to create a SharedVariable for it by calling ``shared(update)``.

--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
@@ -7,10 +7,12 @@ import re
 from theano.configparser import config, AddConfigVar, StrParam

 def default_compiledirname():
-    platform_id = platform.platform() + '-' + platform.processor()
-    platform_id += ('-' + platform.python_version())
+    platform_id = '-'.join([
+        platform.platform(),
+        platform.processor(),
+        platform.python_version()])
    platform_id = re.sub("[\(\)\s]+", "_", platform_id)
-    return 'compiledir_'+platform_id
+    return 'compiledir_' + platform_id

 def is_valid_compiledir(path):
    if not os.access(path, os.R_OK | os.W_OK):

--- a/theano/gof/link.py
+++ b/theano/gof/link.py
@@ -163,19 +163,16 @@ class Container(object):
            if value is None:
                self.storage[0] = None
                return
-            if self.type.__class__.__name__ == "CudaNdarrayType" and isinstance(value,numpy.ndarray):
-                #The filter method of CudaNdarray alloc a new memory region on the gpu.
-                #The ref count will be decremented after that.
-                #That cause 2 region allocated at the same time!
-                #We decrement the memory reference conter now to try to lower the memory usage.
-                self.storage[0] = None

            kwargs = {}
            if self.strict:
                kwargs['strict'] = True
            if self.allow_downcast is not None:
                kwargs['allow_downcast'] = self.allow_downcast
-            self.storage[0] = self.type.filter(value, **kwargs)
+            if hasattr(self.type,'filter_inplace'):
+                self.storage[0] = self.type.filter_inplace(value, self.storage[0], **kwargs)
+            else:
+                self.storage[0] = self.type.filter(value, **kwargs)

        except Exception, e:
            e.args = e.args + (('Container name "%s"' % self.name),)

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -89,7 +89,7 @@ if __name__ == "__main__":
            * numpy with ATLAS from distribution(FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto with 1, 2, 4 and 8 threads.
-                          Xeon   Xeno   Xeon  Core2 i7
+                          Xeon   Xeon   Xeon  Core2 i7
        lib/nb threads    E5345  E5430  E5450 E8500 930

        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s

--- a/theano/misc/may_share_memory.py
+++ b/theano/misc/may_share_memory.py
@@ -33,8 +33,8 @@ else:
        except NotImplementedError:
            b_sparse = False

-        a_cuda=False
-        b_cuda=False
+        a_cuda = False
+        b_cuda = False
        if a.__class__.__name__ == "CudaNdarray":
            a_cuda = True
        if b.__class__.__name__ == "CudaNdarray":

--- a/theano/misc/tests/test_may_share_memory.py
+++ b/theano/misc/tests/test_may_share_memory.py
@@ -37,17 +37,18 @@ def test_may_share_memory():

    #test that it raise error when needed.
    for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False),]:
-        if rep == False:
-            try:
-                may_share_memory(a_,b_)
-                raise Exception("An error was expected")
-            except TypeError:
-                pass
-            try:
-                may_share_memory(b_,a_)
-                raise Exception("An error was expected")
-            except TypeError:
-                pass
+        assert may_share_memory(a_,b_,False)==rep
+        assert may_share_memory(b_,a_,False)==rep
+        try:
+            may_share_memory(a_,b_)
+            raise Exception("An error was expected")
+        except TypeError:
+            pass
+        try:
+            may_share_memory(b_,a_)
+            raise Exception("An error was expected")
+        except TypeError:
+            pass

 if scipy_imported:
    def test_may_share_memory_scipy():
@@ -64,14 +65,18 @@ if scipy_imported:

            assert may_share_memory(a_,b_)==rep
            assert may_share_memory(b_,a_)==rep
-            if rep == False:
-                try:
-                    may_share_memory(a_,b_)
-                    raise Exception("An error was expected")
-                except:
-                    pass
-                try:
-                    may_share_memory(b_,a_)
-                    raise Exception("An error was expected")
-                except:
-                    pass
+
+        #test that it raise error when needed.
+        for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False)]:
+            assert may_share_memory(a_,b_,False)==rep
+            assert may_share_memory(b_,a_,False)==rep
+            try:
+                may_share_memory(a_,b_)
+                raise Exception("An error was expected")
+            except TypeError:
+                pass
+            try:
+                may_share_memory(b_,a_)
+                raise Exception("An error was expected")
+            except TypeError:
+                pass
--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -51,12 +51,11 @@ def set_cuda_disabled():

 #cuda_ndarray compile and import
 cuda_path = os.path.abspath(os.path.split(__file__)[0])
-date = os.stat(os.path.join(cuda_path,'cuda_ndarray.cu'))[stat.ST_MTIME]
-date = max(date,os.stat(os.path.join(cuda_path,'cuda_ndarray.cuh'))[stat.ST_MTIME])
-date = max(date,os.stat(os.path.join(cuda_path,'conv_full_kernel.cu'))[stat.ST_MTIME])
-date = max(date,os.stat(os.path.join(cuda_path,'conv_kernel.cu'))[stat.ST_MTIME])
+cuda_files = ('cuda_ndarray.cu', 'cuda_ndarray.cuh', 'conv_full_kernel.cu', 'conv_kernel.cu')
+stat_times = [os.stat(os.path.join(cuda_path, cuda_file))[stat.ST_MTIME] for cuda_file in cuda_files]
+date = max(stat_times)

-cuda_ndarray_loc = os.path.join(config.compiledir, 'cuda_ndarray')
+cuda_ndarray_loc = os.path.join(config.compiledir,'cuda_ndarray')
 cuda_ndarray_so = os.path.join(cuda_ndarray_loc,
                               'cuda_ndarray.' + get_lib_extension())
 compile_cuda_ndarray = True
@@ -87,7 +86,7 @@ try:
                    'cuda_ndarray',
                    code,
                    location=cuda_ndarray_loc,
-                    include_dirs=[cuda_path], libs=['cublas'])
+                                                  include_dirs=[cuda_path], libs=['cublas'])

            from cuda_ndarray.cuda_ndarray import *
 except Exception, e:
@@ -105,17 +104,19 @@ if cuda_available:
        cuda_available = False
        cuda_initialization_error_message = e.message

+# We must do those import to be able to create the full doc when nvcc
+from theano.sandbox.cuda.var import (CudaNdarrayVariable,
+                                     CudaNdarrayConstant,
+                                     CudaNdarraySharedVariable,
+                                     float32_shared_constructor)
+from theano.sandbox.cuda.type import CudaNdarrayType
+
 if cuda_available:
    #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
    import cuda_ndarray.cuda_ndarray
    if cuda_ndarray_so != cuda_ndarray.cuda_ndarray.__file__:
        warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")

-    from theano.sandbox.cuda.type import CudaNdarrayType
-    from theano.sandbox.cuda.var import (CudaNdarrayVariable,
-            CudaNdarrayConstant,
-            CudaNdarraySharedVariable,
-            float32_shared_constructor)
    shared_constructor = float32_shared_constructor

    import basic_ops

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -1701,20 +1701,6 @@ class GpuSubtensor(tensor.Subtensor):
            cdata = cdata[0]
        out[0] = x.__getitem__(cdata)

-        if 0: 
-            # JSB: commenting this out because it breaks code and does not look right
-            #      Dumi could you try to run the examples in the DeepLearningBenchmarks
-            #      for example?  This logic doesn't seem right -- we just 
-            #      cast cdata to a tuple, so it doesn't have a .start field.
-            
-            # some numpy installations don't expose the __index__() methods for
-            # numpy.int8/16/32/64. Casting to python's int instead
-            start = int(cdata.start) if cdata.start!=None else None
-            stop = int(cdata.stop) if cdata.stop!=None else None
-            step = int(cdata.step) if cdata.step!=None else None
-            newslice = slice(start,stop,step)
-            out[0] = x.__getitem__(newslice)
-
 class GpuIncSubtensor(tensor.IncSubtensor):
    def make_node(self, x, y, *inputs):
        assert isinstance(x.type, CudaNdarrayType)

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -819,20 +819,46 @@ def test_shared_float32():
    # Unregister
    del theano.shared.constructors[-1]

+def test_shared_cudandarray():
+    '''Test that we can create a CudaNdarraySharedVariable from a CudaNdarray'''
+    a = cuda.shared_constructor(cuda.CudaNdarray.zeros((2,3)))
+    assert isinstance(a.type, tcn.CudaNdarrayType)
+

 import theano.tensor.tests.test_sharedvar
+#This test the case when the shared constructor view an CudaNdarray as input
 test_shared_options = theano.tensor.tests.test_sharedvar.makeSharedTester(
+    shared_constructor_ = tcn.shared_constructor,
+    dtype_ = 'float32',
+    get_value_borrow_true_alias_ = True,
+    shared_borrow_true_alias_ = True,#True when the original value is already a CudaNdarray!
+    set_value_borrow_true_alias_ = True,
+    set_value_inplace_ = True,
+    set_casted_value_inplace_ = False,
+    shared_constructor_accept_ndarray_ = True,
+    internal_type_ = cuda_ndarray.CudaNdarray,
+    test_internal_type_ = lambda a: isinstance(a,cuda_ndarray.CudaNdarray),
+    theano_fct_ = theano.tensor.exp,
+    ref_fct_ = numpy.exp,
+    cast_value_ = cuda_ndarray.CudaNdarray,
+    op_by_matrix_ = True)
+
+#This test the case when the shared constructor view an ndarray as input
+test_shared_options2 = theano.tensor.tests.test_sharedvar.makeSharedTester(
    shared_constructor_ = tcn.shared_constructor,
    dtype_ = 'float32',
    get_value_borrow_true_alias_ = False,
    shared_borrow_true_alias_ = False,
    set_value_borrow_true_alias_ = False,
+    set_value_inplace_ = True,
+    set_casted_value_inplace_ = True,
+    shared_constructor_accept_ndarray_ = True,
    internal_type_ = cuda_ndarray.CudaNdarray,
    test_internal_type_ = lambda a: isinstance(a,cuda_ndarray.CudaNdarray),
    theano_fct_ = theano.tensor.exp,
    ref_fct_ = numpy.exp,
    cast_value_ = numpy.asarray,
-    add_matrix_ = True)
+    op_by_matrix_ = True)

 if __name__ == '__main__':
    test_many_arg_elemwise()

--- a/theano/sandbox/cuda/tests/test_tensor_op.py
+++ b/theano/sandbox/cuda/tests/test_tensor_op.py
@@ -65,6 +65,48 @@ def test_softmax_optimizations():
    assert env.outputs[0].owner.inputs[0].owner.op == cuda.host_from_gpu
    assert env.outputs[0].owner.inputs[0].owner.inputs[0].owner.op == cuda.nnet.gpu_crossentropy_softmax_argmax_1hot_with_bias

+def test_may_share_memory_cuda():
+    from theano.misc.may_share_memory import may_share_memory
+    a = cuda.CudaNdarray(numpy.zeros((3,4),dtype='float32'))
+    b = cuda.CudaNdarray(numpy.zeros((3,4),dtype='float32'))
+    na = numpy.zeros((3,4))
+    nb = numpy.zeros((3,4))
+    va = a.view()
+    vb = b.view()
+    ra = a.reshape((4,3))
+    rb = b.reshape((4,3))
+
+    #can't test the transpose as ta._strides = is not implemented
+    #manual transpose of a
+    #ta = a.reshape((4,3))
+    #ta._strides = (ta._strides[1],ta._strides[0])#not implemented
+    #elem_size=elem_size = numpy.zeros(0,dtype=a.dtype).dtype.itemsize
+    #ta.gpudata += ta.size*elem_size
+
+    for a_,b_,rep in [(a,a,True),(b,b,True),(a,b,False),
+                      (a,na,False),(b,nb,False),(na,b,False),(nb,a,False),
+                      (a,va,True),(b,vb,True),(va,b,False),(a,vb,False),
+                      (a,ra,True),(b,rb,True),(ra,b,False),(a,rb,False),
+                      ]:
+        assert may_share_memory(a_,b_)==rep
+        assert may_share_memory(b_,a_)==rep
+
+    #test that it raise error when needed.
+    for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False)]:
+        assert may_share_memory(a_,b_,False)==rep
+        assert may_share_memory(b_,a_,False)==rep
+        try:
+            may_share_memory(a_,b_)
+            raise Exception("An error was expected")
+        except TypeError:
+            pass
+        try:
+            may_share_memory(b_,a_)
+            raise Exception("An error was expected")
+        except TypeError:
+            pass
+
+
 def test_grad_sqrt_sum():
    """
    This trigered a bug in the past.

--- a/theano/sandbox/cuda/type.py
+++ b/theano/sandbox/cuda/type.py
@@ -8,10 +8,13 @@ from theano import Op, Type, Apply, Variable, Constant
 from theano import tensor, config
 from theano import scalar as scal

-import cuda_ndarray.cuda_ndarray as cuda
-import cuda_ndarray
-
-from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str
+try:
+    # We must do those import to be able to create the full doc when nvcc
+    import cuda_ndarray.cuda_ndarray as cuda
+    from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str
+    import cuda_ndarray
+except ImportError:
+    pass

 class CudaNdarrayType(Type):

@@ -53,14 +56,18 @@ class CudaNdarrayType(Type):
        self.dtype_specs() # error checking is done there

    def filter(self, data, strict=False, allow_downcast=None):
+        return self.filter_inplace(data, None, strict=strict, allow_downcast=allow_downcast)
+
+    def filter_inplace(self, data, old_data, strict=False, allow_downcast=None):
        if strict or allow_downcast or isinstance(data, cuda.CudaNdarray):
-            return cuda.filter(data, self.broadcastable, strict, None)
+            return cuda.filter(data, self.broadcastable, strict, old_data)
+
        else: # (not strict) and (not allow_downcast)
            # Check if data.dtype can be accurately casted to self.dtype
            if isinstance(data, numpy.ndarray):
                up_dtype = scal.upcast(self.dtype, data.dtype)
                if up_dtype == self.dtype:
-                    return cuda.filter(data, self.broadcastable, strict, None)
+                    return cuda.filter(data, self.broadcastable, strict, old_data)
                else:
                    raise TypeError(
                        '%s, with dtype %s, cannot store a value of '
@@ -75,10 +82,10 @@ class CudaNdarrayType(Type):
                        type(data) is float and
                        self.dtype==theano.config.floatX):
                    return cuda.filter(converted_data, self.broadcastable,
-                            strict, None)
+                            strict, old_data)
                elif numpy.all(data == converted_data):
                    return cuda.filter(converted_data, self.broadcastable,
-                            strict, None)
+                            strict, old_data)
                else:
                    raise TypeError(
                        '%s, with dtype %s, cannot store accurately value %s, '
@@ -87,6 +94,7 @@ class CudaNdarrayType(Type):
                        % (self, self.dtype, data, converted_data, self.dtype),
                        data)

+
    @staticmethod
    def bound(a):
        high = a.gpudata
@@ -112,10 +120,11 @@ class CudaNdarrayType(Type):
        if a.__class__ is b.__class__:
            a_l, a_h = CudaNdarrayType.bound(a)
            b_l, b_h = CudaNdarrayType.bound(b)
-            if b_l>=a_h or a_l >= b_h:
+            if b_l >= a_h or a_l >= b_h:
                return False
            return True
-        else: return False
+        else:
+            return False

    @staticmethod
    def values_eq(a, b):
@@ -352,4 +361,8 @@ copy_reg.constructor(CudaNdarray_unpickler)
 def CudaNdarray_pickler(cnda):
    return (CudaNdarray_unpickler, (numpy.asarray(cnda),))

-copy_reg.pickle(cuda.CudaNdarray, CudaNdarray_pickler, CudaNdarray_unpickler)
+try:
+    # In case cuda is not imported.
+    copy_reg.pickle(cuda.CudaNdarray, CudaNdarray_pickler, CudaNdarray_unpickler)
+except NameError:
+    pass
--- a/theano/sandbox/cuda/var.py
+++ b/theano/sandbox/cuda/var.py
@@ -8,15 +8,18 @@ from theano import tensor
 from theano.compile import SharedVariable

 from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda import filter as type_support_filter
-
-from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost
+try:
+    # We must do those import to be able to create the full doc when nvcc
+    from theano.sandbox.cuda import filter as type_support_filter
+    from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost
+except ImportError:
+    pass

 class _operators(tensor.basic._tensor_py_operators):
    """Define a few properties and conversion methods for CudaNdarray Variables.

    The default implementation of arithemetic operators is to build graphs of TensorType
-    variables. 
+    variables.

    The optimization pass (specialization) will insert pure GPU implementations.
    This approach relieves the Cuda-Ops of having to deal with input argument checking and
@@ -49,9 +52,34 @@ class CudaNdarrayConstant(Constant, _operators):
 CudaNdarrayType.Constant = CudaNdarrayConstant

 class CudaNdarraySharedVariable(SharedVariable, _operators):
+    """
+    Shared Variable interface to CUDA-allocated arrays
+    """
+
+    get_value_return_ndarray = True

    def get_value(self, borrow=False, return_internal_type=False):
-        if return_internal_type: # return a cuda_ndarray
+        """
+        Return the value of this SharedVariable's internal array.
+
+        :param borrow:
+                permit the return of internal storage, when used in conjunction with
+                ``return_internal_type=True``
+        :param return_internal_type:
+                True to return the internal ``cuda_ndarray`` instance rather than a ``numpy.ndarray``
+                (Default False)
+
+        By default ``get_value()`` copies from the GPU to a ``numpy.ndarray`` and returns that
+        host-allocated array.
+
+        ``get_value(False,True)`` will return a GPU-allocated copy of the original GPU array.
+
+        ``get_value(True,True)`` will return the original GPU-allocated array without any
+        copying.
+
+        """
+        if return_internal_type or not self.get_value_return_ndarray:
+            # return a cuda_ndarray
            if borrow:
                return self.container.value
            else:
@@ -60,6 +88,37 @@ class CudaNdarraySharedVariable(SharedVariable, _operators):
            return numpy.asarray(self.container.value)

    def set_value(self, value, borrow=False):
+        """
+        Assign `value` to the GPU-allocated array.
+
+        :param borrow: ``True`` permits reusing `value` itself, ``False`` requires that this function
+                       copies `value` into internal storage.
+
+        :note:
+
+            Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
+            GPU memory for the new value would be allocated before the old memory was released. If you're
+            running near the limits of GPU memory, this could cause you to run out of GPU memory.
+
+            Beginning with Theano 0.3.1, set_value will work in-place on the GPU, if the following conditions
+            are met:
+
+            * The destination on the GPU must be c_contiguous.
+            * The source is on the CPU.
+            * The old value must have the same dtype as the new value (which is a given for now,
+              since only float32 is supported).
+            * The old and new value must have the same shape.
+            * The old value is being completely replaced by the new value (not partially modified,
+              e.g. by replacing some subtensor of it).
+            * You change the value of the shared variable via set_value, not via the .value
+              accessors. You should not use the .value accessors anyway, since they will soon be
+              deprecated and removed.
+
+            It is also worth mentioning that, for efficient transfer to the GPU, Theano will make the new data
+            ``c_contiguous``. This can require an extra copy of the data on the host.
+
+            This work what when borrow=True and when borrow=False
+        """
        if not borrow:
            #TODO: check for cuda_ndarray type
            if not isinstance(value, numpy.ndarray):
@@ -84,11 +143,11 @@ CudaNdarrayType.SharedVariable = CudaNdarraySharedVariable

 def cuda_shared_constructor(value, name=None, strict=False,
        allow_downcast=None, borrow=False, broadcastable=None):
-    """SharedVariable Constructor for TensorType"""
+    """SharedVariable Constructor for CudaNdarrayType"""

    # THIS CONSTRUCTOR TRIES TO CAST VALUE TO A FLOAT32, WHICH THEN GOES ONTO THE CARD
    # SO INT shared vars, float64 shared vars, etc. all end up on the card.
-    # THIS IS NOT THE DEFAULT BEHAVIOUR THAT WE WANT. 
+    # THIS IS NOT THE DEFAULT BEHAVIOUR THAT WE WANT.
    # SEE float32_shared_constructor

    #TODO: what should strict mean in this context, since we always have to make a copy?
@@ -115,22 +174,34 @@ def cuda_shared_constructor(value, name=None, strict=False,

 def float32_shared_constructor(value, name=None, strict=False,
        allow_downcast=None, borrow=False, broadcastable=None):
-    """SharedVariable Constructor for TensorType"""
+    """SharedVariable Constructor for CudaNdarrayType from numpy.ndarray or CudaNdarray"""

-    # if value isn't a float32 ndarray, then raise
-    if not isinstance(value, numpy.ndarray):
-        raise TypeError('ndarray required')
-    if value.dtype.num != CudaNdarrayType.typenum:
+    # if value isn't a float32 ndarray, or a CudaNdarray then raise
+
+    if not isinstance(value, (numpy.ndarray, theano.sandbox.cuda.CudaNdarray)):
+        raise TypeError('ndarray or CudaNdarray required')
+    if isinstance(value, numpy.ndarray) and value.dtype.num != CudaNdarrayType.typenum:
        raise TypeError('float32 ndarray required')

    if broadcastable is None:
        broadcastable = (False,) * len(value.shape)
    type = CudaNdarrayType(broadcastable=broadcastable)
-    deviceval = type_support_filter(value, broadcastable, False, None)
+    get_value_return_ndarray = True
+    if isinstance(value, theano.sandbox.cuda.CudaNdarray):
+        get_value_return_ndarray = False
+        if borrow:
+            deviceval = value
+        else:
+            deviceval = value.copy()
+    else:
+        deviceval = type_support_filter(value, broadcastable, False, None)
+
    try:
        rval = CudaNdarraySharedVariable(type=type, value=deviceval, name=name, strict=strict)
    except Exception, e:
        print "ERROR", e
        raise
-    return rval

+    rval.get_value_return_ndarray = get_value_return_ndarray
+
+    return rval
--- a/theano/sandbox/test_neighbours.py
+++ b/theano/sandbox/test_neighbours.py
@@ -400,11 +400,16 @@ def test_neibs_grad_verify_grad_warp_centered():
    try:
        unittest_tools.verify_grad(fn, [images_val], mode=mode_without_gpu)
        raise Exception("Expected an error")
-        if cuda.cuda_available:
-            unittest_tools.verify_grad(fn, [images_val], mode=mode_with_gpu)
    except NotImplementedError:
        pass

+    if cuda.cuda_available:
+        try:
+            unittest_tools.verify_grad(fn, [images_val], mode=mode_with_gpu)
+            raise Exception("Expected an error")
+        except NotImplementedError:
+            pass
+
 if __name__ == '__main__':
    #test_neibs_gpu()
    #test_neibs()

--- a/theano/scan.py
+++ b/theano/scan.py
@@ -44,7 +44,7 @@ import tensor
 import misc.safe_asarray as safe_asarray
 from tensor import opt, TensorType
 import gof
-from gof import Optimizer, toolbox, Op, Apply
+from gof import Optimizer, toolbox, Op, Apply, Variable
 from compile import optdb, SharedVariable, function, Param
 import compile
 import gradient
@@ -1559,8 +1559,15 @@ class Scan(Op):
                            theano.config.floatX))
        inner_gfn_ins = inner_g_outs + self.inputs

-        g_args = [self.n_steps] + g_outs[:self.n_outs_not_shared] \
-                + scan_outputs + args[1:]
+
+        # Make sure you don't have numbers in here
+        if not isinstance(self.n_steps, Variable):
+            n_steps = tensor.as_tensor(self.n_steps)
+        else:
+            n_steps = self.n_steps
+        g_args = [n_steps] + g_outs[:self.n_outs_not_shared] \
+                            + scan_outputs + args[1:]
+
        truncate_gradient = self.truncate_gradient
        for x in self.store_steps[:self.n_outs_not_shared]:
            if x>0 :
@@ -1571,8 +1578,11 @@ class Scan(Op):
                self.n_seqs, self.n_outs, self.n_outs_not_shared,
                self.go_backwards, self.seqs_taps, self.outs_taps,
                truncate_gradient)
+
        g_scan_outs = g_scan(g_args)
-        # We need to add several None's fpr shared vars with updates
+        if not type(g_scan_outs) in (list, tuple):
+            g_scan_outs = [ g_scan_outs ]
+        # We need to add several None's for shared vars with updates
        gradients = [None] + g_scan_outs[:self.n_seqs+self.n_outs_not_shared]
        gradients += [None for i in xrange(self.n_outs-self.n_outs_not_shared)]
        gradients += g_scan_outs[self.n_seqs+self.n_outs_not_shared:]

--- a/theano/sparse/sharedvar.py
+++ b/theano/sparse/sharedvar.py
@@ -15,7 +15,7 @@ def sparse_constructor(value, name=None, strict=False, allow_downcast=None,
    writeme
    """
    if not isinstance(value, scipy.sparse.spmatrix):
-        raise TypeError()
+        raise TypeError("Expected a sparse matrix in the sparse shared variable constructor. Received: ",value.__class__)

    if format is None:
        format = value.format
@@ -24,5 +24,3 @@ def sparse_constructor(value, name=None, strict=False, allow_downcast=None,
        value = copy.deepcopy(value)
    return SparseTensorSharedVariable(type=type, value=value, name=name,
            strict=strict, allow_downcast=allow_downcast)
-
-
--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
@@ -468,7 +468,7 @@ def test_shape_i():

    a = SparseType('csr', dtype=sparse_dtype)()
    f = theano.function([a], a.shape[1], mode='FAST_RUN')
-    assert f(sp.csr_matrix(random_lil((100,10), sparse_dtype, 3)))==(10)
+    assert f(sp.csr_matrix(random_lil((100,10), sparse_dtype, 3))) == 10

 def test_shape():
    # Test that getting the shape of a sparse variable
@@ -501,11 +501,20 @@ def test_may_share_memory():

 import theano.tensor.tests.test_sharedvar
 test_shared_options=theano.tensor.tests.test_sharedvar.makeSharedTester(
-    theano.sparse.shared, 'float64',
-    True, True, True, scipy.sparse.csc_matrix, scipy.sparse.issparse,
-    lambda a: dense_from_sparse(a*2.),
-    lambda a: numpy.asarray((a*2).todense()),
-    scipy.sparse.csr_matrix)
+    shared_constructor_ = theano.sparse.shared,
+    dtype_ = 'float64',
+    get_value_borrow_true_alias_ = True,
+    shared_borrow_true_alias_ = True,
+    set_value_borrow_true_alias_ = True,
+    set_value_inplace_ = False,
+    set_casted_value_inplace_ = False,
+    shared_constructor_accept_ndarray_ = False,
+    internal_type_ = scipy.sparse.csc_matrix,
+    test_internal_type_ = scipy.sparse.issparse,
+    theano_fct_ = lambda a: dense_from_sparse(a*2.),
+    ref_fct_ = lambda a: numpy.asarray((a*2).todense()),
+    cast_value_ = scipy.sparse.csr_matrix)
+

 if __name__ == '__main__':
    unittest.main()
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -3538,8 +3538,16 @@ tilegrad = TileGrad()


 class Tile(Op):
-    """Tiles its input according to reps. Reps is of same dimension as x
-    and contains the number of times to tile x in each dimension"""
+    """
+    Construct an array by repeating the input x according to reps pattern.
+
+    Tiles its input according to reps. The len of reps is the number of
+    dimension of x and contains the number of times to tile x in each dimension.
+
+
+    :see: `numpy.tile http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html`_
+    """
+
    def __init__(self, ndim):
        self.ndim = ndim
    def __eq__(self, other):
@@ -4273,7 +4281,7 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False):
    each element of the list.  If an element of `wrt` is not differentiable
    with respect to the output, then a zero variable is returned.

-    This function is a wrapper around a the more general function
+    This function is a wrapper around the more general function
    `theano.gradient.grad_sources_inputs``.

    """

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -941,12 +941,16 @@ class CAReduce(Op):
                # If it's a zero-size array, use scalar_op.identity if available
                if variable.shape[dimension] == 0:
                    if hasattr(self.scalar_op, 'identity'):
-                        variable = self.scalar_op.identity
+                        variable = numpy.array(self.scalar_op.identity)
                        break
                    else:
                        raise ValueError("Input (%s) has zero-size on axis %s, but self.scalar_op (%s) has no attribute 'identity'" % (variable, dimension, self.scalar_op))
                else:
                    variable = self.ufunc.reduce(variable, dimension)
+            variable = numpy.asarray(variable)
+            if numpy.may_share_memory(variable, input):
+                # perhaps numpy is clever for reductions of size 1?  We don't want this.
+                variable = variable.copy()
            output[0] = theano._asarray(variable, dtype = node.outputs[0].type.dtype)
        else:
            output[0] = numpy.copy(variable)
@@ -1169,27 +1173,79 @@ class Prod(CAReduce):
                ).get(idtype, idtype)

    def grad(self, (prod_in, ), (gz, )):
+        '''
+        The grad of this Op could be very easy, it is was not for the case
+        where zeros are present in a given "group" (ie. elements reduced
+        together to form the product).
+
+        If no zeros are found in the elements of the product, then the
+        partial derivative of the product relative to one of the elements
+        (one of the inputs) is simply the product of the other elements.
+        That's easy to see from the chain rule.
+
+        Now the trick (with no zeros) is to take the overall product, then
+        for every original element, the partial derivative is given by
+        this product divided by the element itself (which equals the product
+        of the other terms). This is easy to do by broadcasting the original
+        product.
+
+        (Note that we also need to broadcast-multiply by the "incoming gradient",
+        ie. the gradient of the cost relative to the output/product).
+
+        -----
+
+        With zeros, things get more complicated. For a given group, we have 3
+        cases:
+        * No zeros in the group. Use previous trick.
+        * If only one zero is present, then the gradient for that element is
+            non-zero, but is zero for all others.
+        * If more than one zero is present, then all the derivatives are zero.
+
+        For the last two cases (with 1 or more zeros), we can't use the division
+        trick, as this gives divisions by 0.
+
+        Implementing that case-by-case logic is not as trivial, so a bunch of
+        hacks are piled down here to do it. Notably, for the "only one zero"
+        case, there's a special Op that computes the product of the elements
+        in the group, minus the zero (see ProdWithoutZero). The trick is then
+        to use the division trick for groups with no zero, to use the
+        ProdWithoutZeros op where there's only one zero, and to output a
+        derivative of zero for any element part of a group with more than
+        one zero.
+
+        I do this by first counting the number of zeros in each group (see
+        the "T.eq()" bits), then taking this or that behavior (see T.switch)
+        based on the result of this count.
+        '''
        if prod_in.dtype[0:3] in ('int','uin'):
            return [None]

+
+        # Prepare the broadcasting that is used everywhere to broadcast
+        # over the original groups (ie. broadcast over the elements of a given
+        # product)
        gz = as_tensor_variable(gz)
        axis = self.axis
        if axis is None:
            axis = range(prod_in.type.ndim)
        if axis == ():
            return gz,
-        new_dims = [] 
+        new_dims = []
        i = 0
        for j, _ in enumerate(prod_in.type.broadcastable):
            if j in axis:
                new_dims.append('x')
            else:
                new_dims.append(i)
-                i += 1 
+                i += 1

+        # result of the product, broadcastable over groups
        prod_out = self(prod_in).dimshuffle(new_dims)
+        # incoming gradient, broadcastable over groups
        gz = gz.dimshuffle(new_dims)

+        # division trick if we don't have zeros. This will contain
+        # NaNs to be eliminated in the T.switch if we do have zeros.
        grad_case_without_zeros = (gz * prod_out / prod_in)

        if self.no_zeros_in_input:
@@ -1198,13 +1254,22 @@ class Prod(CAReduce):
        else:
            T = theano.tensor

-            where_zeros = T.eq(prod_in, 0.0) 
+            where_zeros = T.eq(prod_in, 0.0)
            sum_where_zeros = T.sum(where_zeros, axis=self.axis)
-            groups_with_single_zero = T.eq(sum_where_zeros, 1.0).dimshuffle(new_dims)
+            groups_with_single_zero = T.eq(sum_where_zeros, 1).dimshuffle(new_dims)
+            # tensor with 0 everywhere except for those places where
+            # a 0 part of a group with a single zero was to be found
            where_single_zero = groups_with_single_zero * where_zeros
-            where_gz_not_zero = T.neq(gz, 0.0) 
+            # further optimization to avoid computing ProdWithoutZeros
+            # if the incoming gradient is 0
+            where_gz_not_zero = T.neq(gz, 0.0)
+            # only take ProdWithoutZeros for the groups with single zeros
+            # with non-null incoming gradient
            where_to_take_prod_without_zeros = \
                        groups_with_single_zero * where_gz_not_zero
+            # preprocess the original input so that we set 0 everywhere
+            # except for groups that contain a single zero, to avoid computing
+            # multiplications on other groups
            prod_without_zeros_in = where_to_take_prod_without_zeros * prod_in
            # TODO: put lazy switch here, if it'd work
            # this is pretty efficient already (no multiplication if 0), but
@@ -1212,7 +1277,8 @@ class Prod(CAReduce):
            prod_without_zeros = ProdWithoutZeros(axis=self.axis)(prod_without_zeros_in)
            prod_without_zeros = prod_without_zeros.dimshuffle(new_dims)

-            groups_without_zeros = T.eq(sum_where_zeros, 0.0).dimshuffle(new_dims)
+            groups_without_zeros = T.eq(sum_where_zeros, 0).dimshuffle(new_dims)
+
            final_grad = T.switch(groups_without_zeros, grad_case_without_zeros,
                            T.switch(where_single_zero, prod_without_zeros, 0.0) * gz)

@@ -1228,19 +1294,28 @@ class Prod(CAReduce):
        return ()

 class MulWithoutZeros(scalar.BinaryScalarOp):
-    identity = 1.
+    # "identity" here is zero, as in Reduce we don't want to start
+    # with reducing (1, something_else): this leads to the erronous
+    # case where a vector of zeros is reduced by binary reductions
+    # of (1, 0), which always ends up as 1 (ie. the result for
+    # the c version, for the product of [0,0,0], is 1.0)
+
+    identity = 0.
    commutative = True
    associative = True
-    def impl(self, *inputs):
-        if inputs[0] == 0.:
-            return inputs[1]
-        if inputs[1] == 0.:
-            return inputs[0]
-        return inputs[1] * inputs[2]
+    def impl(self, x, y):
+        if x == 0:
+            return y
+        if y == 0:
+            return x
+        return x*y

    def c_code(self, node, name, (x,y), (z, ), sub):
        return ("%(z)s = ((%(x)s == 0) ? (%(y)s) : " + \
                    "((%(y)s == 0) ? (%(x)s) : ((%(y)s)*(%(x)s))) );") % locals()
+
+    def c_code_cache_version(self):
+        return (1,)
 mul_without_zeros = MulWithoutZeros(scalar.upcast_out, name = 'mul_without_zeros')

 class ProdWithoutZeros(CAReduce):
@@ -1263,4 +1338,3 @@ class ProdWithoutZeros(CAReduce):
            return "ProdWithoutZeros"
        else:
            return "ProdWithoutZeros{%s}" % ", ".join(map(str, self.axis))
-
--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
@@ -595,5 +595,5 @@ def computeH(V,W,b,d):
    return H


-from . import ConvGrad3D
-from . import ConvTransp3D
+import ConvGrad3D
+import ConvTransp3D
--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -90,6 +90,10 @@ def broadcast_like(value, template, env):
    if template not in shape_of:
        raise NotImplementedError('broadcast_like currently requires the template Variable to be in the env already')
    rval = T.alloc(T.cast(value, template.dtype), *shape_of[template])
+    # the template may have 1s in its shape without being broadcastable
+    if rval.broadcastable != template.broadcastable:
+        rval = T.unbroadcast(rval, *[i for i in xrange(rval.ndim) if rval.broadcastable[i]
+            and not template.broadcastable[i]])
    assert rval.type == template.type
    return rval

@@ -663,14 +667,20 @@ def local_fill_to_alloc(node):
        elif v.type.broadcastable == node.outputs[0].type.broadcastable:
            # this is a cast
            rval = [T.cast(v, node.outputs[0].type.dtype)]
+        elif r.type.broadcastable == node.outputs[0].type.broadcastable:
+            # we are broadcasting v somehow, but not r
+            rval = [broadcast_like(v, r, node.env)]
        else:
-            # we are broadcasting v somehow
-            shape_of = node.env.shape_feature.shape_of
+            # we are broadcasting both v and r,
+            # the output shape must be computed
+            # 
+            # TODO: implement this case (including a test!)
+            #
+            #  I think the strategy should be to extend the shorter shape vector
+            #  with 1s (how?) and then take the elementwise max of the two.
+            #  - how to flag an error of shape mismatch where broadcasting should be illegal?
+            return
            # TODO: cut out un-necessary dimshuffles of v
-            rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
-
-        #if rval[0].type != node.outputs[0].type:
-            #print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')

        assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type,
                'orig', node.outputs[0].type,
@@ -764,10 +774,12 @@ def local_subtensor_make_vector(node):
 @gof.local_optimizer([T.Elemwise])
 def local_useless_elemwise(node):
    """
-eq(x,x) -> 1
-neq(x,x) -> 0
-mul(x) -> x
-add(x) -> x
+
+    eq(x,x) -> 1
+    neq(x,x) -> 0
+    mul(x) -> x
+    add(x) -> x
+    identity(x) -> x

    """
    if isinstance(node.op, T.Elemwise):
@@ -783,6 +795,8 @@ add(x) -> x
            return [node.inputs[0]]
        if node.op.scalar_op == theano.scalar.add and len(node.inputs)==1:
            return [node.inputs[0]]
+        if node.op.scalar_op == theano.scalar.identity and len(node.inputs)==1:
+            return [node.inputs[0]]


 @register_specialize
@@ -2255,8 +2269,7 @@ def local_mul_specialize(node):
                neg ^= True #toggles
            elif N.all(y == 0.0):
                # if we find any zero, we just return right away
-                return [T.alloc(numpy.asarray(0, dtype=node.outputs[0].dtype),
-                        *node.env.shape_feature.shape_of[node.outputs[0]])]
+                return [broadcast_like(0, node.outputs[0], node.env)]
            else:
                new_inputs.append(input)

@@ -2273,21 +2286,14 @@ def local_mul_specialize(node):
                    else:
                        rval = T.mul(*new_inputs)

-                return [T.alloc(T.cast(rval, node.outputs[0].dtype),
-                        *node.env.shape_feature.shape_of[node.outputs[0]])]
+                return [broadcast_like(rval, node.outputs[0], node.env)]
            else:
                # there are no variable inputs to mul
                # N.B. this could have been constant-folded...
                if neg:
-                    # return output's worth of -1
-                    return [T.alloc(
-                        numpy.asarray(-1, dtype=node.outputs[0].dtype),
-                            *node.env.shape_feature.shape_of[node.outputs[0]])]
+                    return [broadcast_like(-1, node.outputs[0], node.env)]
                else:
-                    # return output's worth of 1
-                    return [T.alloc(
-                        numpy.asarray(1, dtype=node.outputs[0].dtype),
-                            *node.env.shape_feature.shape_of[node.outputs[0]])]
+                    return [broadcast_like(1, node.outputs[0], node.env)]

 register_specialize(local_mul_specialize)


--- a/theano/tensor/opt_uncanonicalize.py
+++ b/theano/tensor/opt_uncanonicalize.py
 """
-This file implement specialization optimization that break the canonicalization form
+This file implement specialization optimization that break the canonization form of the graph.
+
+Currently their is problem with the order of optimization and the definition of definition of canonized graph.
+
+Right now their is a canonization optimization phase that try to make all equivalent graph identical. This is not always the case, but it do many of the basic stuff canonical. We need to extend the definition of canonization to make this true more often.
+
+The problem this file indent to fix in the future is that in the "Equilibrium" specialization optimization phase, there is optimization that request that the graph is canonical, some other request that this is not true, and some other that break the canonicalization for some optimization. As we can't control the order of those optimization, their is case that some optimization requesting a canonical graph won't be applied as optimization that break the canonicalization form of the graph executed before.
+
+To fix this, we need to split the specialization phase into a phase where optimization can't break the canonicalization form and one where this is allowed. This is also needed for the stabilized optimization phase, but as it happen before the specialization phase, this cause less problem.
+
+Also, we should make the env refuse optimization that break the canonization of the graph in the optimizations phases where the graph is supposed to be canonical.
 """

 # TODO: intelligent merge for mul/add
@@ -30,7 +40,7 @@ from theano import scalar as scal
 class MaxAndArgmaxOptimizer(Optimizer):
    """Replace MaxAndArgmax by CAReduce when the argmax is not used

-       This is faster as MaxAndArgmax don't have c code and execute it 
+       This is faster as MaxAndArgmax don't have c code and execute it
       in two pass.
    """

@@ -70,7 +80,7 @@ def local_max_to_min(node):

    This is tested in tensor/tests/test_basic.py:test_min_max

-    :note: we don't need an opt that will do the reverse as by default 
+    :note: we don't need an opt that will do the reverse as by default
           the interface put only MaxAndArgmax into the graph.
    """
    if node.op == T.neg and node.inputs[0].owner:
@@ -81,5 +91,3 @@ def local_max_to_min(node):
                return [CAReduce(scal.minimum,max.owner.op.axis)(neg.owner.inputs[0])]

    return False
-
-
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -104,9 +104,9 @@ class test_Broadcast(unittest.TestCase):
                xv = numpy.asarray(numpy.random.rand(*xsh))
                yv = numpy.asarray(numpy.random.rand(*ysh))
                zv = xv + yv
-                
+
                f(xv, yv)
-                
+
                assert xv.shape==zv.shape

    def test_perform(self):
@@ -217,11 +217,11 @@ class test_CAReduce(unittest.TestCase):
                    f(xv)
                except ValueError:
                    pass
-                else: 
+                else:
                    self.fail()
            else:
                self.failUnless((numpy.abs(f(xv) - zv) < 1e-10).all())
-                
+

            #test CAReduce.infer_shape
            #the Shape op don't implement c_code!
@@ -248,7 +248,7 @@ class test_CAReduce(unittest.TestCase):
        self.with_linker(gof.CLinker(), maximum)
        self.with_linker(gof.CLinker(), minimum)

-        #need other dtype then real        
+        #need other dtype then real
        #no c_code for or_, and_
        #self.with_linker(gof.CLinker(), or_)
        #self.with_linker(gof.CLinker(), and_)
@@ -258,23 +258,28 @@ class test_Prod(unittest.TestCase):
    def setUp(self):
        unittest_tools.seed_rng()

+        # we want to allow nans in the matrices, so we disable this DEBUG_MODE check
+        mode = theano.compile.mode.get_default_mode()
+        mode = copy(mode)
+        mode.check_isfinite = False

-
+        self.mode = mode

    def test_verify_grad(self):
+
        # including zeros, as the case with zeros is important
        # (and special cases: 1 zero in the row, more than 1 zero in the row)
        x_val = numpy.asarray([[1,2,3],[4,5,6],[7,8,9]], dtype='float32')
        x = theano.tensor.dmatrix()
        # now with verify_grad
-        unittest_tools.verify_grad(Prod(axis=1), [x_val])
+        unittest_tools.verify_grad(Prod(axis=1), [x_val], mode=self.mode)

        # second time, with some added complexity
        # verify_grad takes the sum of the matrices anyway
        def fn(x2):
            return theano.tensor.sqr(Prod(axis=1)(x2))

-        unittest_tools.verify_grad(fn, [x_val])
+        unittest_tools.verify_grad(fn, [x_val], mode=self.mode)


    def test_verify_grad_with_zeros(self):
@@ -287,18 +292,18 @@ class test_Prod(unittest.TestCase):
        x2 = theano.tensor.dmatrix()
        p = Prod(axis=1)(x)
        p2 = Prod(axis=1)(x2)
-        fn = theano.function([x,x2],[p-p2])
+        fn = theano.function([x,x2],[p-p2], mode=self.mode)
        #print "hand computed diff for each row"
        x2_val = numpy.asarray([[1., 2., 3.003], [0.003,5.,6], [0.,0.,9.01]])
        #print fn(x_val, x2_val)
-        fn2 = theano.function([x],[theano.tensor.grad(p.sum(),x)])
+        fn2 = theano.function([x],[theano.tensor.grad(p.sum(),x)], mode=self.mode)
        #print "real grad"
        #print fn2(x_val)
-        fn3 = theano.function([x],[p])
+        fn3 = theano.function([x],[p], mode=self.mode)
        assert numpy.allclose(fn3(x_val), [6.,0.,0.])

        # now with verify_grad
-        unittest_tools.verify_grad(Prod(axis=1), [x_val])
+        unittest_tools.verify_grad(Prod(axis=1), [x_val], mode=self.mode)

        # second time, with some added complexity
        # verify_grad takes the sum of the matrices anyway
@@ -318,11 +323,11 @@ class test_Prod(unittest.TestCase):
        x = theano.tensor.dmatrix()
        x_val = numpy.array([[1,2,3],[0,5,6],[0,0,9]], dtype='float32')
        pwz = ProdWithoutZeros(axis=1)(x)
-        fn = theano.function([x], pwz)
+        fn = theano.function([x], pwz, mode=self.mode)
        assert numpy.allclose(fn(x_val), [6,30,9])

        pwz_a0 = ProdWithoutZeros(axis=0)(x)
-        fn_a0 = theano.function([x], pwz_a0)
+        fn_a0 = theano.function([x], pwz_a0, mode=self.mode)
        assert numpy.allclose(fn_a0(x_val), [1, 10, 162])

    def test_other_grad_tests(self):
@@ -333,24 +338,32 @@ class test_Prod(unittest.TestCase):

        p = Prod(axis=1)
        grad_p = theano.tensor.grad(p(x).sum(), x)
-        grad_fn = theano.function([x], grad_p)
+        grad_fn = theano.function([x], grad_p, mode=self.mode)
        assert numpy.allclose(grad_fn(x_val1), [[6.,3.,2.],[30.,0.,0.],[0.,0.,0.]])
        assert numpy.allclose(grad_fn(x_val2), [[0., 0., 2.], [30., 0., 0.], [72., 63., 56.], [0., 0., 90.]])

        p_axis0 = Prod(axis=0)
        grad_p_axis0 = theano.tensor.grad(p_axis0(x).sum(), x)
-        grad_fn_axis0 = theano.function([x], grad_p_axis0)
+        grad_fn_axis0 = theano.function([x], grad_p_axis0, mode=self.mode)
        assert numpy.allclose(grad_fn_axis0(x_val2), [[0., 400., 0.],[63., 160., 0.], [0., 100., 0.], [0., 80., 0.]])

-        tensor.verify_grad(p, [x_val1], rng=rng)
+        tensor.verify_grad(p, [x_val1], rng=rng, mode=self.mode)

-        
+    def test_mul_without_zeros_zeros(self):
+        a = numpy.zeros((3,3))
+
+        x = theano.tensor.dmatrix()
+
+        mul1 = ProdWithoutZeros(axis=0)(x)
+
+        fn_debug = theano.function([x], mul1, mode=self.mode)
+
+        fn_debug(a)

 if __name__ == '__main__':
-    unittest.main()
-    #suite = unittest.TestSuite([test_Prod('test_verify_grad')])
+    #unittest.main()
+    suite = unittest.TestSuite([test_Prod('test_mul_without_zeros_zeros')])
    #suite.addTest(test_Prod('test_verify_grad_with_zeros'))
    #suite.addTest(test_Prod('test_prod_without_zeros'))
    #suite.addTest(test_Prod('test_other_grad_tests'))
-    #unittest.TextTestRunner().run(suite)
-
+    unittest.TextTestRunner().run(suite)
--- a/theano/tensor/tests/test_sharedvar.py
+++ b/theano/tensor/tests/test_sharedvar.py
--- a/theano/tests/test_scan.py
+++ b/theano/tests/test_scan.py
@@ -1039,6 +1039,34 @@ class T_Scan(unittest.TestCase):
        assert updates[b].type.ndim == b.type.ndim


+    def test_scan_as_tensor_on_gradients(self):
+        """
+        Bug reported by cityhall on scan when computing the gradients
+        """
+        to_scan = theano.tensor.dvector('to_scan')
+        seq     = theano.tensor.dmatrix('seq')
+        f1      = theano.tensor.dscalar('f1')
+
+        def scanStep(prev, seq, f1):
+           return prev + f1 * seq
+
+        scanned, _ = theano.scan(fn = scanStep, \
+                                sequences    = [seq], \
+                                outputs_info = [to_scan], \
+                                non_sequences  = [f1])
+
+        f_scan = theano.function(inputs=[to_scan, seq, f1], outputs=scanned)
+        f_scan([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
+
+        t_grad = theano.tensor.grad(scanned.sum(), wrt=[to_scan, f1],
+        consider_constant=[seq])
+        f_grad = theano.function(inputs=[to_scan, seq, f1], outputs=t_grad)
+
+        f_scan([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
+        f_grad([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
+
+
+
 if __name__ == '__main__':
    unittest.main()

--- a/theano/tests/test_tutorial.py
+++ b/theano/tests/test_tutorial.py
-""" test code snipet in the Theano tutorials.
+""" test code snippet in the Theano tutorials.
 """

 import unittest