merge

134658d7 · James Bergstra · 4dee786c · cca41015 · 134658d7 · 134658d7
--- a/doc/index.txt
+++ b/doc/index.txt
@@ -37,8 +37,9 @@ Roughly in order of what you'll want to check out:
 * :ref:`optimizations` -- Guide to Theano's graph optimizations.
 * :ref:`extending` -- Learn to add a Type, Op, or graph optimization.
 * :ref:`internal` -- How to maintaining Theano, LISA-specific tips, and more...
+* `API <api/>`_ -- The automatically-generated API
-You can download the latest `PDF documentation <http://pylearn.org/theano/theano.pdf>`_, rather than reading it online.
+You can download the latest `PDF documentation <http://deeplearning.net/theanodoc/theano.pdf>`_, rather than reading it online.
 Community
 =========
@@ -47,7 +48,7 @@ Community
 * Register and post to `theano-dev`_ if you want to talk to the developers.
-* We try to stay organized with `Theano's Trac <trac/>`__ 
+* We try to stay organized with `Theano's Trac <http://trac-hg.assembla.com/theano/report/1>`__ 
 * Come visit us in Montreal!  Most of the developers are students in the LISA_ group at the `University of Montreal`_.
@@ -70,8 +71,6 @@ Community
   LICENSE
 .. _theano-dev: http://groups.google.com/group/theano-dev
 .. _theano-users: http://groups.google.com/group/theano-users
 .. _tickets: http://pylearn.org/theano/trac/query?status=accepted&status=assigned&status=new&status=reopened&group=milestone&max=200&col=id&col=summary&col=status&col=owner&col=type&col=priority&col=component&col=time&report=9&order=priority

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -20,7 +20,7 @@ to be installed:
        We develop mainly on 64-bit Linux machines. 32-bit architectures are
        not well-tested.
-    python >= 2.5
+    python >= 2.5 (2.4 should be supported as well)
    `numpy <http://numpy.scipy.org/>`_ >= 1.2
        Earlier versions have memory leaks.
@@ -30,6 +30,8 @@ to be installed:
        is buggy in 0.6. (scipy.csc_matrix dot has a bug with singleton
        dimensions. There may be more bugs.)
+    A BLAS installation (with Level 3 functionality)
 The following libraries and software are optional:
    g++, python-dev
@@ -42,41 +44,49 @@ The following libraries and software are optional:
    `mercurial <http://www.selenic.com/mercurial/>`_
        To download bleeding-edge version of Theano.
+.. _install_bleeding_edge:
+Getting the code
+-----------------
-Easy install
+If you are a developer of Theano, then check out the :ref:`dev_start_guide` guide. 
------------
-The following command will install the latest release of Theano
+The following are general instructions that will set you up with the bleeding-edge 
-on your system:
+version of Theano. First, get the code using `mercurial <http://www.selenic.com/mercurial/wiki/>`__:
 .. code-block:: bash
-    easy_install Theano
+    hg clone http://hg.assembla.com/theano Theano
-Manual install
+Configuring PYTHONPATH
--------------
+---------------------------
+The subdirectory Theano/theano has to be located in a path
+mentioned in your PYTHONPATH. In order to do that, you can either
+create a symbolic link to Theano/theano in a directory already
+mentioned in your PYTHONPATH environment variable, or modify the
+PYTHONPATH so that it mentions Theano.
-To install the latest release of Theano from source, visit the `downloads
+To create a symbolic link:
-<http://pylearn.org/theano/downloads/>`_ page and download the release you
-want. Unpack the release, and type:
 .. code-block:: bash
-    python setup.py build
+    ln -s Theano/theano <someplace on your PYTHONPATH>/theano
-    python setup.py test
-    python setup.py install
-.. _install_bleeding_edge:
+To modify the environment variable PYTHONPATH in bash, you may do this:
+.. code-block:: bash
-Bleeding Edge
+    export PYTHONPATH=<path to Theano's parent dir>/Theano:$PYTHONPATH
--------------
-Feeling lucky and want to run bleeding-edge code?
+In csh:
-Then check out the :ref:`dev_start_guide` guide.
+.. code-block:: csh
-Configuring the environment
+    setenv PYTHONPATH <path to Theano's parent dir>/Theano:$PYTHONPATH
---------------------------
+Configuring Theano's environmental variables
+---------------------------------------------
 Two environment variables are used to control automatic code
 generation. It is possible to use Theano in a way which avoids all
@@ -118,6 +128,33 @@ automatic code generation, but that way is much, much slower.
    Omitting this variable defaults the mode to ``'FAST_RUN'``.
+Testing your installation
+---------------------------
+Once you have completed these steps, you should run the theano test suite like this:
+.. code-block:: bash
+    cd Theano
+    nosetests #execute all the tests
+All tests should pass. If some test fails on your machine, you are
+encouraged to tell us what went wrong on the ``theano-users@googlegroups.com``
+mailing list.
+Updating
+-------------
+To update your library to the latest revision, change directory (``cd``)
+to your ``Theano`` folder and execute the following command:
+.. code-block:: bash
+    hg pull -u
+You should update frequently, bugs are fixed on a very regular basis.
 Mac
 ---
@@ -126,20 +163,21 @@ Mac
 -
    .. code-block:: bash
-        $ sudo port install gcc42 py25-zlib py25-numpy py25-scipy mercurial
+        $ sudo port install gcc44 py25-zlib py25-numpy py25-scipy mercurial
-    Note that compiling gcc42 takes a significant time (hours) so it is probably
+    Note that compiling gcc takes a significant time (hours) so it is probably
    not the best solution if you are in a rush! It may happen that SciPy
    fails to compile the first time and still compiles just fine on a second
    try. Same thing with py25-zlib.
- Install some kind of BLAS library (TODO: how?)
+- scipy depends on ATLAS (a BLAS library), which will be installed by MacPorts.
 - Set ``THEANO_BLAS_LDFLAGS`` to something which will link against said BLAS
  library.  E.g., ``THEANO_BLAS_LDFLAGS='-lcblas -latlas -lgfortran'``.
-This advice has not been tested recently, so please inform us of your results.
+These installation instructions have not tested recently, please infom us of your results! 
+We would be especially interested in dependencies that we missed listing, as well as tests
+that fail on your platform (use the ``theano-users@googlegroups.com`` mailing list).
 Windows
@@ -240,16 +278,19 @@ but this has not been tested yet.
  ``export PYTHONPATH=PYTHONPATH:$HOME/Theano``.
 - Please note that at this time, some tests (launched using ``nosetests``) are
-  still failing under Windows.
+  still failing under Windows: we are working on fixing them.
-  We are working on fixing them.
+  It may also happen that many tests may fail while running the test-suite,
+  due to insufficient memory resources: one workaround is to run nosetests
+  multiple times under individual subdirectories.
 Generating the documentation
 ----------------------------
 You can read the latest HTML documentation `here
-<http://pylearn.org/theano/contents.html>`__.
+<http://deeplearning.net/theanodoc>`__.
 You can download the latest PDF documentation `here
-<http://pylearn.org/theano/theano.pdf>`__.
+<http://deeplearning.net/theanodoc/theano.pdf>`__.
 We recommend you look at the documentation on the website, since it
 will be more current than the documentation included with the package.

--- a/doc/internal/dev_start_guide.txt
+++ b/doc/internal/dev_start_guide.txt
@@ -21,11 +21,10 @@ Developer Start Guide
 Accounts
 ========
-To obtain developer access: send an email to an admin with an username and
+To obtain developer access: register with `Assembla
-temporary password. Pending approval, this will give you access to both the
+<http://www.assembla.com/>`_ and add yourself as a watcher on the `Theano space 
-repository and Trac. You should then change your password in the
+<http://www.assembla.com/spaces/theano>`_. Then send an email to an admin asking 
-`<http://pylearn.org/theano/prefs preferences>` tab - do *NOT* use a good 
+to be promoted to a member of the project.
-password! We are using plain text http which is not secure.
 Theano code
@@ -34,10 +33,9 @@ Theano code
 *To get the source via mercurial,* you must have `mercurial
 <http://www.selenic.com/mercurial/wiki/>`__ installed.
-The code that makes up Theano is in a single repository available in
+The code that makes up Theano is in a `single repository
-`<http://pylearn.org/hg/Theano>`__.
+<http://www.assembla.com/spaces/theano/trac_mercurial_tool>`__. As a developer, 
+you should clone this repository like this:
-As a developer, you should clone this repository like this:
 .. code-block:: bash
@@ -121,9 +119,6 @@ to your ``Theano`` folder and execute the following command:
    hg pull -u
-You may also download the latest source directly as a gzip'd tar file:
-`<http://pylearn.org/hg/Theano/archive/tip.tar.gz>`__.
 Nightly test
 ============

--- a/doc/introduction.txt
+++ b/doc/introduction.txt
@@ -129,7 +129,8 @@ Getting started
  the :ref:`tutorial` first though.
-A PDF version of the online documentation may be found `here <theano.pdf>`_.
+A PDF version of the online documentation may be found `here
+<http://deeplearning.net/theanodoc/theano.pdf>`_.
 Contact us

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -331,6 +331,8 @@ Indexing
 Basic indexing.
+    Mirrors numpy's `basic indexing  <http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html>`_. Read that page first.
 Advanced indexing.
 .. _libdoc_tensor_elementwise:

--- a/doc/links.txt
+++ b/doc/links.txt
@@ -40,10 +40,10 @@ This is a sort of memo for developers and would-be developers.
 .. _mercurial: http://www.selenic.com/mercurial/wiki/
 .. _nosetests: http://somethingaboutorange.com/mrl/projects/nose/
 .. _numpy: http://numpy.scipy.org/
-.. _python: http://www.python.or
+.. _python: http://www.python.org
 .. _scipy: http://scipy.org/
-.. _autodiff: http://autodiff.org
+.. _autodiff: http://www.autodiff.org
 .. _boost.python: http://www.boost.org/doc/libs/1_38_0/libs/python/doc/index.html
 .. _cython: http://www.cython.org/
 .. _liboil: http://liboil.freedesktop.org/wiki/

--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -41,9 +41,10 @@ details about these building blocks see :ref:`variable`, :ref:`op`,
 .. figure:: apply.png 
    :align: center
-    Arrows represent references to the Python objects pointed at. The blue
-    box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
+Arrows represent references to the Python objects pointed at. The blue
-    circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.
+box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
+circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.
 The graph can be traversed starting from outputs (the result of some
@@ -104,7 +105,7 @@ how to compute the gradient of the node's outputs with respect to its
 inputs. Note that if an :ref:`op` does not provide this information, 
 it is assumed that the gradient does not defined.
 Using the 
-`chain rule <http://en.wikipedia.org/wiki/Chain_rile>`_ 
+`chain rule <http://en.wikipedia.org/wiki/Chain_rule>`_ 
 these gradients can be composed in order to obtain the expression of the 
 gradient of the graph's output with respect to the graph's inputs .

--- a/theano/sandbox/conv.py
+++ b/theano/sandbox/conv.py
@@ -29,9 +29,10 @@ class ConvOp(Op):
    #TODO: make the stacksize its own parameter, and make imshp a pair
-    def __init__(self, imshp, kshp, nkern, bsize, dx, dy, output_mode='valid',
+    def __init__(self, imshp=None, kshp=None, nkern=None, bsize=None, dx=None, dy=None, output_mode='valid',
-            unroll_batch=4,
+            unroll_batch=0,
-            unroll_kern=4,
+            unroll_kern=0,
+            unroll_patch=False,
            imshp_logical=None,
            kshp_logical=None,
            kshp_logical_top_aligned=True,
@@ -47,6 +48,7 @@ class ConvOp(Op):
        dx - patch stride rows
        dy - patch stride cols
        out_mode - 'valid', 'full'
+        unroll_patch - c code generation option
        unroll_batch - c code generation option
        unroll_kern - c code generation option
        verbose - passed to GpuConv
@@ -60,6 +62,7 @@ class ConvOp(Op):
        gradient on the filters.
+        unroll_patch. If True will use a version that is faster then without not unroll by unroll the patch loop.
        unroll_batch. If >0 will use a version that will unroll the batch loop by the value of the option. By default don't use this version of the code.
        unroll_nkern. idem as unroll_batch but unroll the kernel loop.
@@ -95,6 +98,7 @@ class ConvOp(Op):
        self.unroll_batch=unroll_batch
        self.unroll_kern=unroll_kern
+        self.unroll_patch=unroll_patch
        if self.unroll_batch>0 and self.bsize % self.unroll_batch!=0:
            if self.bsize<=self.unroll_batch:
@@ -407,6 +411,7 @@ using namespace std;
        d["self_imshp0"]=self.imshp[0]
        d["self_imshp1"]=self.imshp[1]
        d["self_imshp2"]=self.imshp[2]
+        d["mode"]=self.out_mode.upper()
        d["self_kshp0"]=self.kshp[0]
        d["self_kshp1"]=self.kshp[1]
        d["self_kshp_logical_r"] = self.kshp_logical[0]
@@ -439,8 +444,12 @@ using namespace std;
        #print self.out_mode, d["self_imshp_logical_stride_r"]
        if self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
+#            print "return imshp!=imshp_logical or self.kshp != self.kshp_logical shape version"
            return _conv_op_code_a % d
+        if self.unroll_patch:
+#            print "return unroll patch version",self.dx,self.dy
+            return _conv_op_code_unroll_patch%d
        if self.unroll_batch>0 or self.unroll_kern>0:
            if self.unroll_batch<=0: self.unroll_batch=1
            if self.unroll_kern<=0: self.unroll_kern=1
@@ -1212,3 +1221,295 @@ Py_XDECREF(img2d);
 Py_XDECREF(filtersflipped);
 """
    return ret
+_conv_op_code_unroll_patch = """
+const int mode=%(mode)s;
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *filtersflipped_arr=NULL, *img2d_arr=NULL;
+const %(type)s fill_value = 0;
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+npy_intp dim_im[2]={%(self_imshp1)s,%(self_imshp2)s};
+npy_intp dim_ker[2]={%(self_kshp0)s,%(self_kshp1)s};
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig, *filtersflipped=NULL;
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    PyErr_SetString(PyExc_ValueError, "img don't have a good shape");
+    %(fail)s;
+}
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    std:stringstream temp;
+    temp << "nddim="<<%(filtersflipped)s->nd;
+    std::string param = temp.str();
+    PyErr_SetString(PyExc_ValueError,
+      ("kernel don't have a good shape. " + param).c_str());
+    %(fail)s;
+}
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != sizeof(%(type)s)) 
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+filtersflipped = PyArray_Newshape(%(filtersflipped)s,&kerns_shape, PyArray_CORDER);
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if ((filtersflipped_arr->strides[3] != sizeof(%(type)s)) 
+     || (filtersflipped_arr->strides[2] != filtersflipped_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)filtersflipped));
+    Py_DECREF(filtersflipped);
+    filtersflipped = contig;
+    if (!PyArray_ISCONTIGUOUS(filtersflipped)){
+        PyErr_SetString(PyExc_ValueError, "filtersflipped isn't contiguous");
+        %(fail)s;
+    }
+}
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if(mode != VALID && mode != FULL){
+  PyErr_SetString(PyExc_ValueError, "invalid mode, only full and valid are supported"); %(fail)s;
+}
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+if (!img2d) %(fail)s;
+if (!filtersflipped) %(fail)s;
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  if (%(z)s) Py_DECREF(%(z)s);
+  npy_intp dims[4] = {0,0,0,0};
+  if(!dims) %(fail)s;
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  //PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+int Os[2];
+Os[0]=%(self_outshp0)s;
+Os[1]=%(self_outshp1)s;
+//I keep the formula to calculte Os in case we need it in the futur.
+//if (mode == FULL) {Os[0] = (int)ceil((dim_im[0]+dim_ker[0]-1)/float(%(self_dx)s)); Os[1] = ceil((dim_im[1]+dim_ker[1]-1)/float(%(self_dy)s));}
+//else {Os[0] = (int)ceil((dim_im[0]-dim_ker[0]+1)/float(%(self_dx)s)); Os[1] = (int)ceil((dim_im[1]-dim_ker[1]+1)/float(%(self_dy)s));}
+for(int b=0;b< %(self_bsize)s;b++){
+  for(int n_kern=0;n_kern<%(self_nkern)s;n_kern++){
+    //assertions
+    if (%(z)s->strides[0] != %(z)s->dimensions[1] *%(z)s->dimensions[2] *%(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[1] != %(z)s->dimensions[2] * %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[2] != %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[3] != sizeof(%(type)s)) %(fail)s;
+    %(type)s * __restrict__ out=(%(type)s *)(PyArray_GETPTR2(%(z)s,b,n_kern));
+    for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) out[i] = 0;
+    for(int stack_size=0;stack_size<%(self_imshp0)s;stack_size++){
+      const %(type)s * __restrict__ in=(%(type)s *)(PyArray_GETPTR2(img2d,b,stack_size));
+      const %(type)s * __restrict__ hvals=(%(type)s *)(PyArray_GETPTR2(filtersflipped,n_kern,stack_size));
+      int new_m;
+      for (int iter_m=0; iter_m < Os[0]; iter_m++) {
+        // Reposition index into input image based on requested output size
+        int pos_m = iter_m*%(self_dx)s;//The position of the patch in the image
+        if (mode == FULL) new_m = pos_m ;
+        else new_m = (pos_m+dim_ker[0]-1);
+        for (int iter_n=0; iter_n < Os[1]; iter_n++) {  // loop over columns
+          int pos_n=iter_n*%(self_dy)s;
+          %(type)s sum=0;
+          %(type)s sum2=0;
+          %(type)s sum3=0;
+          %(type)s sum4=0;
+          int nb_sum=0;
+          // Sum over kernel, if index into image is out of bounds
+          // fill with the value
+          for (int j=0; j < dim_ker[0]; j++) {
+            int ind0 = (new_m-j);
+            if(mode==FULL){
+              const %(type)s * idx_hvals=&hvals[j*dim_ker[1]];
+              if(ind0 < 0 || ind0 >= dim_im[0]){
+                if(fill_value!=0)
+                  for (int k=0; k < dim_ker[1]; k++) {
+                    sum+= idx_hvals[k] * fill_value;
+                  }
+              }else{
+                //do the part where kernel is to the right of the img
+//TODO: implement unroll patch for fill_value!=0
+                int k=0,max_k=max((int)(pos_n-dim_im[1])+1,0);
+                if(fill_value!=0){ 
+                  for(k=0;k<max_k;k++){
+                    sum+= idx_hvals[k]*fill_value;
+                  }
+                }else {k=max_k;}
+                //do the part where the kernel is on the img
+                max_k=min(pos_n+1,(int)dim_ker[1]);
+                const %(type)s * idx_in=&in[ind0*dim_im[1]];
+                if(iter_n + 4*%(self_dy)s < Os[1]
+                         && iter_n>dim_ker[1]-1+3 
+                         && iter_n<dim_im[1]-dim_ker[1]+1-3){
+                  nb_sum=4;
+//cout<<4<<endl;
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                    sum3+=idx_hvals[k]*idx_in[ind1+2*%(self_dy)s];
+                    sum4+=idx_hvals[k]*idx_in[ind1+3*%(self_dy)s];
+                  }
+                }else if(iter_n + 2*%(self_dy)s < Os[1] 
+                         && iter_n>dim_ker[1]-1
+                         && iter_n<dim_im[1]-dim_ker[1]+1){
+//cout<<2<<endl;
+                  nb_sum=2;
+//                  if(iter_n==dim_ker[1]-1){//k-1<min(pos_n+%(self_dy)s,(int)dim_ker[1])){
+//                    sum2+=idx_hvals[k-1]*idx_in[pos_n-k-%(self_dy)s];
+//                  }
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                  }
+//                  sum2+=idx_hvals[k]*idx_in[pos_n-k+%(self_dy)s];
+//                  sum+=idx_hvals[k]*idx_in[pos_n-k];
+//                  k++;
+                }else{
+//cout<<1<<endl;
+                  nb_sum=1;
+                  /*
+                  %(type)s sum_=0;
+                  if((k-max_k) & 0x1 != 0){
+                    sum+= idx_hvals[k] * idx_in[pos_n-k];
+                  }
+                  for (int ind1=pos_n-k; k<max_k; k+=2,ind1-=2) {
+                    sum+= idx_hvals[k] * idx_in[ind1];
+                    sum_+= idx_hvals[k+1] * idx_in[ind1-1];
+                  }
+                  sum+=sum_;
+                  */
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                  }
+                }
+                //do the part to the left of the img
+                if(fill_value!=0)
+                  for(;k<dim_ker[1];k++) sum+= idx_hvals[k]*fill_value;
+              }
+            }else{//valid mode
+              const %(type)s* idx_in=&in[ind0*dim_im[1]];
+              const %(type)s* idx_hvals=&hvals[j*dim_ker[1]];
+              if(iter_n + 4*%(self_dy)s < Os[1]){
+                nb_sum=4;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                  sum3+=idx_hvals[k]*idx_in[im_idx+2*%(self_dy)s];
+                  sum4+=idx_hvals[k]*idx_in[im_idx+3*%(self_dy)s];
+                }
+              }else if(iter_n + 2*%(self_dy)s < Os[1]){
+                nb_sum=2;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                }
+              }else{
+                nb_sum=1;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                }
+              }
+            }//else valid mode
+          }//for j
+          switch(nb_sum){
+          case 4: out[iter_m*dim_zz[1]+iter_n+3] %(affectation)s sum4;
+          case 3: out[iter_m*dim_zz[1]+iter_n+2] %(affectation)s sum3;
+          case 2: out[iter_m*dim_zz[1]+iter_n+1] %(affectation)s sum2;
+          case 1: out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+          }
+          iter_n+=nb_sum-1;
+/*
+          out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+          if(nb_sum>=2){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum2;
+          }
+          if(nb_sum>=3){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum3;
+          }
+          if(nb_sum>=4){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum4;
+          }
+*/
+        }//for iter_n
+      }//for iter_m
+    }//for stack_size
+    if (0 && (mode==FULL)){
+      for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) 
+        std::cout << " " << out[i];
+      std::cout << "\\n";
+    }
+  }//for n_kern
+}//for b
+Py_XDECREF(img2d);
+Py_XDECREF(filtersflipped);
+"""
--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
 import os, sys
 from theano.gof.compiledir import get_compiledir
 from theano.compile import optdb
+import theano.config as config
 import logging, copy
 _logger_name = 'theano_cuda_ndarray'
@@ -15,8 +16,34 @@ def debug(*msg):
    _logger.debug(_logger_name+'DEBUG: '+' '.join(str(m) for m in msg))
-#compile type_support.cu
+# Compile type_support.cu
-#this need that nvcc(part of cuda) is installed
+# This need that nvcc (part of cuda) is installed. If it is not, a warning is
+# printed and this module will not be working properly (we set `enable_cuda`
+# to False).
+# This variable is True by default, and set to False if something goes wrong
+# when trying to initialize cuda.
+enable_cuda = True
+# Global variable to avoid displaying the same warning multiple times.
+cuda_warning_is_displayed = False
+# Code factorized within a function so that it may be called from multiple
+# places (which is not currently the case, but may be useful in the future).
+def set_cuda_disabled():
+    """Function used to disable cuda.
+    A warning is displayed, so that the user is aware that cuda-based code is
+    not going to work.
+    Note that there is no point calling this function from outside of
+    `cuda.__init__`, since it has no effect once the module is loaded.
+    """
+    global enable_cuda, cuda_warning_is_displayed
+    enable_cuda = False
+    if not cuda_warning_is_displayed:
+        cuda_warning_is_displayed = True
+        warning('Cuda is disabled, cuda-based code will thus not be '
+                'working properly')
 old_file = os.path.join(os.path.split(__file__)[0],'type_support.so')
 if os.path.exists(old_file):
@@ -30,6 +57,10 @@ except ImportError:
    import nvcc_compiler
+    if not nvcc_compiler.is_nvcc_available():
+        set_cuda_disabled()
+    if enable_cuda:
        print __file__
        cuda_path=os.path.split(old_file)[0]
@@ -64,21 +95,20 @@ except ImportError:
        from type_support.type_support import *
+if enable_cuda:
-from theano.sandbox.cuda.type import CudaNdarrayType
+    from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda.var import (CudaNdarrayVariable,
+    from theano.sandbox.cuda.var import (CudaNdarrayVariable,
            CudaNdarrayConstant,
            CudaNdarraySharedVariable,
            shared_constructor)
-import basic_ops
+    import basic_ops
-from basic_ops import (GpuFromHost, HostFromGpu, GpuElemwise, 
+    from basic_ops import (GpuFromHost, HostFromGpu, GpuElemwise, 
            GpuDimShuffle, GpuSum, GpuReshape, 
            GpuSubtensor, GpuIncSubtensor, GpuFlatten, GpuShape)
-import opt
+    import opt
-import cuda_ndarray
+    import cuda_ndarray
-import theano.config as config
 def use(device=config.THEANO_GPU):
    if use.device_number is None:

--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -19,6 +19,15 @@ def debug(*args):
    #sys.stderr.write('DEBUG:'+ ' '.join(str(a) for a in args)+'\n')
    _logger.debug("DEBUG: "+' '.join(str(a) for a in args))
+def is_nvcc_available():
+    """Return True iff the nvcc compiler is found."""
+    try:
+        subprocess.call(['nvcc', '--version'], stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE)
+        return True
+    except:
+        return False
 def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[], lib_dirs=[], libs=[],
        preargs=[]):
    """

--- a/theano/sandbox/my_test_scan.py
+++ b/theano/sandbox/my_test_scan.py
-import numpy
-import theano
-import theano.sandbox.scan
-# generator network, only one output , type scalar ; no sequence or 
-# non sequence arguments
-def test_1():
-  def f_pow2(x_tm1):
-    return (2*x_tm1, {})
-  s = theano.tensor.dvector()
-  n_steps = theano.tensor.dscalar()
-  Y = theano.sandbox.scan.scan(f_pow2, [],s, [],n_steps = n_steps)
-  f1 = theano.function([s,n_steps], Y)
-  assert( numpy.any(f1([1],3)== [2,4,8])  )
-# simple rnn, one input, one state, weights for each; input/state are 
-# vectors, weights are scalars
-def test_2():
-    def f_rnn(u_t,x_tm1,W_in, W):
-        return (u_t*W_in+x_tm1*W, {})
-    u    = theano.tensor.dvector()
-    x0   = theano.tensor.dvector()
-    W_in = theano.tensor.dscalar()
-    W    = theano.tensor.dscalar()
-    Y = theano.sandbox.scan.scan(f_rnn, u,x0,[W_in,W])
-    f2 = theano.function([u,x0,W_in,W], Y)
-    assert(numpy.any(f2([1,2,3,4],[1],.1,1)== numpy.array([1.1,1.3,1.6,2.])))
-# simple rnn, one input, one state, weights for each; input/state are 
-# vectors, weights are scalars; using shared variables
-def test_3():
-    u    = theano.tensor.dvector()
-    x0   = theano.tensor.dvector()
-    W_in = theano.shared(.1, name = 'w_in')
-    W    = theano.shared(1., name ='w')
-    def f_rnn_shared(u_t,x_tm1):
-        return (u_t*W_in+x_tm1*W, {})
-    Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0,[])
-    f3 = theano.function([u,x0], Y)
-    assert(numpy.any(f3([1,2,3,4],[1])== numpy.array([1.1,1.3,1.6,2.])))
-# some rnn with multiple outputs and multiple inputs; other dimension 
-# instead of scalars/vectors
-def test_4():
-    W_in2 = theano.shared(numpy.array([1.,2.]), name='win2')
-    W     = theano.shared(numpy.array([[2.,1.],[1.,1.]]), name='w')
-    W_out = theano.shared(numpy.array([.5,1.]), name = 'wout')
-    W_in1 = theano.tensor.dmatrix('win')
-    u1    = theano.tensor.dmatrix('u1')
-    u2    = theano.tensor.dvector('u2')
-    x0    = theano.tensor.dmatrix('x0')
-    y0    = theano.tensor.dvector('y0')
-## Why dot doesn;t work with scalars !??
-## Why  *  doesn't support SharedVariable and TensorVariable
-    def f_rnn_cmpl(u1_t, u2_t, x_tm1, y_tm1, W_in1):
-        return ({}, [theano.dot(u1_t,W_in1) + u2_t* W_in2 + \
-                    theano.dot(x_tm1, W), theano.dot(x_tm1, W_out)])
-    Y = theano.sandbox.scan.scan(f_rnn_cmpl,[u1,u2],[x0,y0],W_in1)
-    f4 = theano.function([u1,u2,x0,y0,W_in1], Y)
-    (x,y) =  f4( numpy.array([[1,2],[1,2],[1,2]]), \
-              numpy.array([1,2,3]),             \
-              numpy.array([[0,0]]),             \
-              numpy.array([1]),                 \
-              numpy.array([[1,1],[1,1]]))
-    assert( numpy.all(x == numpy.array([[4.,5.],[18.,16.],[58.,43.]])))
-    assert( numpy.all(y == numpy.array([0.,7.,25.])))
-# basic ESN using updates 
-def test_5(): 
-    W_in = theano.shared(numpy.array([1.,1.]), name='win')
-    W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
-    W_out= theano.shared(numpy.array([.5,1.]), name='wout')
-    u  = theano.tensor.dvector('u')
-    x  = theano.shared(numpy.array([0.,0.]),'x')
-    y0 = theano.tensor.dvector('y0')
-    def f_ESN(u_t):
-        return ( theano.dot(x,W_out), \
-         { x: W_in*u_t + theano.dot(x,W) } )
-    Y = theano.sandbox.scan.scan(f_ESN,u,y0,[],outputs_taps={0:[]})
-    f5 = theano.function([u,y0],Y)
-    assert( f5( numpy.array([1,2,3]), numpy.array([0])) == \
-             numpy.array([0.,1.4,3.15]))
-# basic ESN using updates ; moving backwards
-def test_6(): 
-    W_in = theano.shared(numpy.array([1.,1.]), name='win')
-    W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
-    W_out= theano.shared(numpy.array([.5,1.]), name='wout')
-    u  = theano.tensor.dvector('u')
-    x  = theano.shared(numpy.array([0.,0.]),'x')
-    y0 = theano.tensor.dvector('y0')
-    def f_ESN(u_t):
-        return ( theano.dot(x,W_out), \
-         { x: W_in*u_t + theano.dot(x,W) } )
-    Y = theano.sandbox.scan.scan(f_ESN,u,y0,[],outputs_taps={0:[]}, \
-                                 go_backwards = True)
-    f6 = theano.function([u,y0],Y)
-    assert( f6( numpy.array([1,2,3]), numpy.array([0])) == \
-             numpy.array([0., 4.5, 3.45]))
-'''
- TO TEST: 
-    - test taps (for sequences and outputs )
-    - test gradient (one output)
-    - test gradient (multiple outputs)
-    - test gradient (go_bacwards) 
-    - test gradient (multiple outputs / some uncomputable )
-    - test gradient (truncate_gradient)
-    - test gradient (force_gradient) 
-    - test inplace map
-'''
-if __name__=='__main__':
-    test_1()
-    test_2()
-    test_3()
-    test_4()
-    test_5()
-    test_6()
--- a/theano/sandbox/scan.py
+++ b/theano/sandbox/scan.py
@@ -62,17 +62,6 @@ def scan(fn, sequences, initial_states, non_sequences, inplace_map={},
    # compute number of sequences and number of seqs   
    n_seqs     = len(seqs)
-    # see if there are outputs that do not feed anything back to the function
-    # applied recursively
-    #outs_tapkeys = outputs_taps.keys()
-    #outs_tapkeys.sort()
-    #for k in outs_tapkeys:
-    #    if outputs_taps[k] == []:
-    #        # add empty lists where you have outputs that do not have past 
-    #        # values
-    #        init_outs = init_outs[:k] + [[]] + init_outs[k:]
    n_outs   = len(init_outs)
@@ -185,7 +174,8 @@ class Scan(theano.Op):
        self.destroy_map = {}
        if inplace:
-            self.destroy_map = inplace_map
+            for i in inplace_map.keys():
+                self.destroy_map.update({i: [inplace_map[i]] } )
        self.seqs_taps      = seqs_taps
        self.outs_taps      = outs_taps
@@ -205,11 +195,23 @@ class Scan(theano.Op):
                                   updates = updates, mode = mode)
        g_y = [outputs[0].type()]
-        g_args = theano.tensor.grad(outputs[0],inputs, g_cost = g_y[-1])
+        def compute_gradient(y, g_y):
+            gmap = theano.gradient.grad_sources_inputs( \
+                        [(y,g_y)], theano.gof.graph.inputs([y]), False)
+            def zero(p):
+              return theano.tensor.TensorConstant(theano.tensor.TensorType(\
+                      dtype=p.type.dtype, broadcastable=[]),
+                      numpy.asarray(0,dtype = p.type.dtype))
+            return [gmap.get(p, zero(p)) for p in inputs]
+        g_args = compute_gradient( outputs[0], g_y[-1]) 
        # for all outputs compute gradients and then sum them up
        for y in outputs[1:]:
            g_y += [y.type()]
-            g_args_y = theano.tensor.grad(y,inputs, g_cost=g_y[-1])
+            g_args_y = compute_gradient( y,g_y[-1])
            for i in xrange(len(g_args)):
                g_args[i] += g_args_y[i]
@@ -256,6 +258,7 @@ class Scan(theano.Op):
               (self.n_args == other.n_args)
      return rval
    def __hash__(self):
      return hash(type(self)) ^ \
             hash(self.n_seqs) ^ \

--- a/theano/sandbox/test_conv.py
+++ b/theano/sandbox/test_conv.py
@@ -41,7 +41,7 @@ def flip(kern, kshp):
 global_rng = N.random.RandomState(3423489)
 dmatrix4=T.TensorType('float64', (False, False, False, False))
-def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_convolve2=False, do_print=True, repeat=1):
+def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_convolve2=False, do_print=True, repeat=1, unroll_patch=0):
        # build actual input images
        imgval = global_rng.rand(bsize, imshp[0], imshp[1], imshp[2])
@@ -121,7 +121,7 @@ def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll
                hidval1=outval.copy()
            # ConvOp
-            conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode, unroll_batch=unroll_batch, unroll_kern=unroll_kern)(inputs4, kerns4)
+            conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode, unroll_batch=unroll_batch, unroll_kern=unroll_kern, unroll_patch=unroll_patch)(inputs4, kerns4)
            l1shp=N.hstack((nkern,
                            getFilterOutShp(imshp, kshp, ss, conv_mode)))
            propup2 = function([inputs4, kerns4], conv_op)
@@ -328,7 +328,7 @@ class TestConvOp(unittest.TestCase):
        ssizess = [[(1,1),(1,2)],[(1,1),(2,2)]]
        convmodes = ['valid','full']
        do_convolve2=True
-        unroll = [(0,0),(1,1),(2,2),(3,2)]#(batch,kern)
+        unroll = [(0,0,False),(0,0,True),(1,1,False),(2,2,False),(3,2,False)]#(batch,kern,patch)
        do_speed_test = False
        # TODO: this version show a bug that was fixed
@@ -338,6 +338,11 @@ class TestConvOp(unittest.TestCase):
 #        nkerns = [2,2] # per output pixel
 #        ssizes = [(1,1),(2,2)]#2,2)]
+#        bsizes = [1,1] # batch size
+#        imshp_starts = [(1,10,10),(1,5,6)]
+#        kshpss = ([[2,3],[3,2]],[[2,2],[2,2]])
+#        nkernss = [[1,1],[1,1]] # per output pixel
        N.set_printoptions(threshold=N.nan)
        # symbolic stuff
@@ -356,8 +361,8 @@ class TestConvOp(unittest.TestCase):
            unroll_batch = [1,2,4,5,10,20]
            unroll_kern = [1,2,4,5,10,20]
-            unroll_batch = [1,2,5]
+            unroll_batch = [1,4,5]
-            unroll_kern = [1,2,5]
+            unroll_kern = [1,4,5]
            bsize = 20 # batch size
            imshp_start = (1,48,48)#un square shape to test more corner case.
@@ -374,9 +379,17 @@ class TestConvOp(unittest.TestCase):
            timing = N.zeros((len(unroll_batch),len(unroll_kern),3))
            t_b_k=[]
            #calculate the timing with unrolling
+            t_=[[ 7.60572791,  3.95069814,  3.74271464], [ 4.05631089,  2.90384555,  2.93613672], [ 3.90551591,  2.92595196,  3.00102282]]
+            best=[]
+            worst=[]
+            best=[0.52690219879150391, 2.4266397953033447]
+            worst=[0.92042708396911621, 6.8822150230407715]
+            t_=[]
            for unroll_b, n_b in zip(unroll_batch,range(len(unroll_batch))):
                for unroll_k, n_k in zip(unroll_kern,range(len(unroll_kern))):
                    t_b_k.append(str(unroll_b)+"/"+str(unroll_k))
+                    if not t_:
                        tctot, tpytot, ntot=[],[],[]
                        for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                            for ss, n_ss in zip(ssizes,range(len(ssizes))):
@@ -384,36 +397,68 @@ class TestConvOp(unittest.TestCase):
                                tctot+=[tctot_]
                                tpytot+=[tpytot_]
                                ntot+=[ntot_]
+                        if unroll_b==4 and unroll_k==4:
+                            print "unroll 4/4",tctot
+                            best=tctot
+                        if unroll_b==1 and unroll_k==1:
+                            print "unroll 1/1",tctot
+                            worst=tctot
                        timing[n_b,n_k]=[sum(tctot), sum(tpytot), sum(ntot)]
+            if not t_:
+                t=timing[:,:,0]#We select only the c timing.
+            else:
+                t=t_
+            t=N.asarray(t)
            #calculate the old timing
-            tctot,tpytot,ntot=0,0,0
+            tctot_=[0.52555489540100098, 6.6634182929992676]
+#            tctot_=[]
+            tctot,tpytot,ntot=[],[],[]
+            if not tctot_:
                for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                    for ss, n_ss in zip(ssizes,range(len(ssizes))):
                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate)
-                    tctot+=tctot_
+                        tctot+=[tctot_]
-                    tpytot+=tpytot_
+                        tpytot+=[tpytot_]
-                    ntot+=ntot_
+                        ntot+=[ntot_]
-            print "old code timing %.3fs"%tctot
+            else: tctot=N.asarray(tctot_)
+            print "old code timing %.3fs"%sum(tctot),tctot
-#            print timing
+            best=N.asarray(best)
-            t=timing[:,:,0]#We select only the c timing.
+            worst=N.asarray(worst)
            print "timing for unrolled version"
            print t_b_k
            print t
            print "max %.3fs"%t.max(), "max param(batch unloop size/kernel unloop size)", t_b_k[t.argmax()]
            print "min %.3fs"%t.min(), "min param(batch unloop size/kernel unloop size)", t_b_k[t.argmin()]
-            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),tctot/t.min())
+            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),sum(tctot)/t.min())
+            print worst/best,tctot/best
+            tctot_patch = []
+            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                     tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=2)
+                     tctot_patch += [tctot_]
+            t_patch=sum(tctot_patch)
+            print "unroll_patch time", tctot_patch
+            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch,sum(tctot)/t_patch)
+            print best/tctot_patch, worst/tctot_patch
+            print best
+            print worst
+            print tctot
+            print tctot_patch
            return
        for i in range(len(kshpss)):
            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                for ss, n_ss in zip(ssizess[i],range(len(ssizess[i]))):
-                    for un_b, un_k in unroll:
+                    for un_b, un_k, un_p in unroll:
                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(
                            conv_mode, ss, bsizes[i], imshp_starts[i], 
                            kshpss[i], nkernss[i],
                            img=img, unroll_batch=un_b, unroll_kern=un_k,
+                            unroll_patch=un_p,
                            validate=True)
                        tctot+=[tctot_]
                        tpytot+=[tpytot_]
@@ -428,6 +473,11 @@ class TestConvOp(unittest.TestCase):
        d=N.asarray(ntot)/tpytot
        print 'speed up py theano(ConvOp) vs convolve2d: %.3fx'%d.mean(),d
+    def init_data(self,shape):
+        return N.ones(shape)
+        return N.random.random(shape)
    def test_ConvOpGrad(self):
        """
        test the gradient in float and double
@@ -442,7 +492,7 @@ class TestConvOp(unittest.TestCase):
        kshps = [(2,3)]
        imshps = [(2,3,4)]
        modes = ['valid', 'full']
-        unroll = [(0,0),(1,1),(2,3)]
+        unroll = [(0,0,True),(1,1,False),(2,3,False),(1,1,False),(0,0,False)]#(batch,kern,patch)
        ssizes = [(1,1),(2,2)]
        for typ in types:
@@ -457,12 +507,12 @@ class TestConvOp(unittest.TestCase):
                    imgvals = N.array(N.random.random(N.hstack((bsize,imshp))),dtype=imgs.dtype)
                    for kshp in kshps:
                        t=numpy.array([imshp[1]-kshp[0],imshp[2]-kshp[1]])
-                        kernvals = N.array(N.random.rand(nkern,visdim,kshp[0],
+                        kernvals = N.array(self.init_data((nkern,visdim,kshp[0],
-                                                         kshp[1]),dtype=kerns.dtype)
+                                                          kshp[1])),dtype=kerns.dtype)
                        # 'full' mode should support kernels bigger than the input
                        if mode == 'valid' and (t<0).any():
                            continue
-                        for un_b,un_k in unroll:
+                        for un_b,un_k, un_p in unroll:
                                for ss in ssizes:
                                    print 'test_ConvOpGrad'
                                    print 'mode type:', mode, typ
@@ -476,14 +526,14 @@ class TestConvOp(unittest.TestCase):
                                    def test_i(imgs):
                                        convop = ConvOp(imshp, kshp, nkern, bsize, ss[0], ss[1],
-                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k)
+                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p)
                                        return convop(imgs, kernvals)
                                    def test_k(kerns):
                                        convop = ConvOp(imshp, kshp, nkern, bsize, ss[0], ss[1],
-                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k)
+                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p)
                                        return convop(imgvals, kerns)
+                                    print mode, imshp, kshp, un_b, un_k, ss
                                    #TODO the tolerance needed to pass is very high for float32(0.17). Is this acceptable? Expected?
 				    tol = None
 				    if typ=="float32":

--- a/theano/sandbox/test_scan.py
+++ b/theano/sandbox/test_scan.py
-from scan import Scan
 import unittest
 import theano
+import theano.sandbox.scan
 import random
 import numpy.random
@@ -74,6 +75,14 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps = None, tol = None,
+def compareArrays(a,b):
+    if type(a) in (list,tuple):
+        a = numpy.array(a)
+    if type(b) in (list, tuple):
+        b = numpy.array(b)
+    return numpy.all( abs(a-b) < 1e-5)
@@ -82,10 +91,9 @@ class T_Scan(unittest.TestCase):
        utt.seed_rng()
    # generator network, only one output , type scalar ; no sequence or 
    # non sequence arguments
-    def test_1():
+    def test_1(self):
      def f_pow2(x_tm1):
        return (2*x_tm1, {})
@@ -94,11 +102,12 @@ class T_Scan(unittest.TestCase):
      Y = theano.sandbox.scan.scan(f_pow2, [],s, [],n_steps = n_steps)
      f1 = theano.function([s,n_steps], Y)
-      assert( numpy.any(f1([1],3)== [2,4,8])  )
+      assert(compareArrays(f1([1],3), [2,4,8]))
    # simple rnn, one input, one state, weights for each; input/state are 
    # vectors, weights are scalars
-    def test_2():
+    def test_2(self):
        def f_rnn(u_t,x_tm1,W_in, W):
            return (u_t*W_in+x_tm1*W, {})
@@ -110,13 +119,14 @@ class T_Scan(unittest.TestCase):
        Y = theano.sandbox.scan.scan(f_rnn, u,x0,[W_in,W])
        f2    = theano.function([u,x0,W_in,W], Y)
+        v_u   = numpy.array([1.,2.,3.,4.])
-        assert(numpy.any(f2([1,2,3,4],[1],.1,1)== \
+        v_x0  = numpy.array([1])
-                numpy.array([1.1,1.3,1.6,2.])))
+        v_out = numpy.array([1.1,1.3,1.6,2.])
+        assert(compareArrays( f2(v_u,v_x0,.1,1), v_out   ) )
    # simple rnn, one input, one state, weights for each; input/state are 
    # vectors, weights are scalars; using shared variables
-    def test_3():
+    def test_3(self):
        u    = theano.tensor.dvector()
        x0   = theano.tensor.dvector()
@@ -129,13 +139,15 @@ class T_Scan(unittest.TestCase):
        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0,[])
        f3    = theano.function([u,x0], Y)
+        v_u   = numpy.array([1.,2.,3.,4.])
-        assert(numpy.any(f3([1,2,3,4],[1])== numpy.array([1.1,1.3,1.6,2.])))
+        v_x0  = numpy.array([1.])
+        v_out = numpy.array([1.1,1.3,1.6,2.])
+        assert(compareArrays(f3(v_u,v_x0),v_out))
    # some rnn with multiple outputs and multiple inputs; other dimension 
    # instead of scalars/vectors
-    def test_4():
+    def test_4(self):
        W_in2 = theano.shared(numpy.array([1.,2.]), name='win2')
        W     = theano.shared(numpy.array([[2.,1.],[1.,1.]]), name='w')
@@ -153,19 +165,21 @@ class T_Scan(unittest.TestCase):
        Y = theano.sandbox.scan.scan(f_rnn_cmpl,[u1,u2],[x0,y0],W_in1)
        f4     = theano.function([u1,u2,x0,y0,W_in1], Y)
+        v_u1   = numpy.array([[1.,2.],[1.,2.],[1.,2.]])
+        v_u2   = numpy.array([1.,2.,3.])
+        v_x0   = numpy.array([[0.,0.]])
+        v_y0   = numpy.array([1])
+        v_Win1 = numpy.array([[1.,1.],[1.,1.]])
+        v_x    = numpy.array([[4.,5.],[18.,16.],[58.,43.]])
+        v_y    = numpy.array([0.,7.,25.])
+        (x,y) =  f4( v_u1, v_u2, v_x0, v_y0, v_Win1)
-        (x,y) =  f4( numpy.array([[1,2],[1,2],[1,2]]), \
+        assert( compareArrays(x,v_x)) 
-                  numpy.array([1,2,3]),             \
+        assert( compareArrays(y,v_y))
-                  numpy.array([[0,0]]),             \
-                  numpy.array([1]),                 \
-                  numpy.array([[1,1],[1,1]]))
-        assert( numpy.all(x == numpy.array([[4.,5.],[18.,16.],[58.,43.]])))
-        assert( numpy.all(y == numpy.array([0.,7.,25.])))
    # basic ESN using updates 
-    def test_5(): 
+    def test_5(self): 
        W_in = theano.shared(numpy.array([1.,1.]), name='win')
        W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
        W_out= theano.shared(numpy.array([.5,1.]), name='wout')
@@ -181,11 +195,14 @@ class T_Scan(unittest.TestCase):
        Y = theano.sandbox.scan.scan(f_ESN,u,y0,[],outputs_taps={0:[]})
        f5    = theano.function([u,y0],Y)
-        assert( f5( numpy.array([1,2,3]), numpy.array([0])) == \
+        v_u   = numpy.array([1.,2.,3.])
-                 numpy.array([0.,1.4,3.15]))
+        v_y0  = numpy.array([0.])
+        v_out  = numpy.array([0.,1.5,3.15])
+        out = f5( v_u, v_y0 )
+        assert( compareArrays(v_out, out))
    # basic ESN using updates ; moving backwards
-    def test_6(): 
+    def test_6(self): 
        W_in = theano.shared(numpy.array([1.,1.]), name='win')
        W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
        W_out= theano.shared(numpy.array([.5,1.]), name='wout')
@@ -202,20 +219,100 @@ class T_Scan(unittest.TestCase):
                                     go_backwards = True)
        f6    = theano.function([u,y0],Y)
-        assert( f6( numpy.array([1,2,3]), numpy.array([0])) == \
+        v_u   = numpy.array([1.,2.,3.])
-                 numpy.array([0., 4.5, 3.45]))
+        v_y0  = numpy.array([0])
+        v_out = numpy.array([0.,4.5,3.45])
+        out   = f6(v_u, v_y0)
+        assert( compareArrays(out, v_out))
+    # simple rnn, one input, one state, weights for each; input/state are 
+    # vectors, weights are scalars; using shared variables and past 
+    # taps (sequences and outputs)
+    def test_7(self):
+        u    = theano.tensor.dvector()
+        x0   = theano.tensor.dvector()
+        W_in = theano.shared(.1, name = 'w_in')
+        W    = theano.shared(1., name ='w')
+        def f_rnn_shared(u_tm2, x_tm1, x_tm2):
+            return (u_tm2*W_in+x_tm1*W+x_tm2, {})
+        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0, [], \
+                 sequences_taps = {0:[-2]}, outputs_taps = {0:[-1,-2]})
+        f7   = theano.function([u,x0], Y)
+        v_u  = numpy.asarray([1.,2.,3.,4.])
+        v_x0 = numpy.asarray([1.,2.])
+        out  = numpy.asarray([3.1,5.3])
+        assert (compareArrays( out, f7(v_u, v_x0)))
+    # simple rnn, one input, one state, weights for each; input/state are 
+    # vectors, weights are scalars; using shared variables and past 
+    # taps (sequences and outputs) and future taps for sequences
+    def test_8(self):
+        u    = theano.tensor.dvector()
+        x0   = theano.tensor.dvector()
+        W_in = theano.shared(.1, name = 'w_in')
+        W    = theano.shared(1., name ='w')
+        def f_rnn_shared(u_tm2,u_tp2, x_tm1, x_tm2):
+            return ((u_tm2+u_tp2)*W_in+x_tm1*W+x_tm2, {})
+        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0, [], \
+                 sequences_taps = {0:[-2,2]}, outputs_taps = {0:[-1,-2]})
+        f8   = theano.function([u,x0], Y)
+        v_u  = numpy.array([1.,2.,3.,4.,5.,6.])
+        v_x0 = numpy.array([1.,2.])
+        out  = numpy.array([3.6, 6.4])
+        assert (compareArrays( out, f8(v_u, v_x0) ) )
+    '''
+    NOTE : BROKEN .. inplace doesn't work due to a stochasticOpimization 
+    TODO : talk james
+    # simple rnn ; compute inplace
+    def test_9(self):
+        u    = theano.tensor.dvector()
+        mu   = theano.Param( u, mutable = True)
+        x0   = theano.tensor.dvector()
+        W_in = theano.shared(.1)
+        W    = theano.shared(1.)
+        def f_rnn_shared(u_t, x_tm1):
+            return (u_t*W_in + x_tm1*W, {})
+        Y = theano.sandbox.scan.scan(f_rnn_shared, u, x0,[], \
+                    inplace_map={0:0} )
+        f9   = theano.function([mu,x0], Y , #mode = 'FAST_RUN')
+                                mode = 'DEBUG_MODE')
+        v_u  = numpy.array([1.,2.,3.])
+        v_x0 = numpy.array([1.])
+        out = f9(v_u, v_x0)
+        v_out = numpy.array([1.1,1.3,1.6])
+        assert (compareArrays(out, v_out))
+        print v_u
+        assert (compareArrays(v_u, out))
+     '''
+    # test gradient simple network 
+    def test_10(self):
+        pass
    '''
     TO TEST: 
-        - test taps (for sequences and outputs )
        - test gradient (one output)
        - test gradient (multiple outputs)
        - test gradient (go_bacwards) 
        - test gradient (multiple outputs / some uncomputable )
        - test gradient (truncate_gradient)
        - test gradient (force_gradient)
-        - test inplace map
+        - test_gradient (taps past/future)
    '''

--- a/theano/tensor/nnet.py
+++ b/theano/tensor/nnet.py
@@ -1020,13 +1020,18 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
    #           / softmax(x)
    #   which arises from the gradient of log(softmax(x))[arange(y.shape[0]), y]
    #
-    # TODO: explain variants of case 1.
-    # TODO: explain other variants of case 2.
    # In some cases, in case 2., insted of "-1. like (AdvancedSubtensor...)",
    # we can have "-1. like ([-1] * AdvancedSubtensor...)". This case will be
    # recognized too, but other variants, even with the same shape, might not
    # (yet).
+    # The base cases are realized when the gradient of the
+    # cost wrt the output is equal to 1.  When this gradient
+    # has another (scalar) value, it typically appears in the
+    # second argument of AdvancedIncSubtensor. In that case, we
+    # try to extract it, and feed it as the output gradient of
+    # crossentropy_softmax_1hot_with_bias_dx.
    #
    # N.B. Regarding clients -- This substitution is important for numerical stability, so we
    # perform the substitution even when intermediate values have multiple clients.
@@ -1052,43 +1057,60 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
        else:
            return
-        # Check that incr has the form -1./sm[arange(len(y)), y]
+        # In the base case (output gradient = 1), incr is -1./sm[arange(len(y)), y]
+        # Here, we are looking for the AdvancedSubtensor term (sm[arange(len(y)), y]),
+        # the remaining of the expression will be used to compute outgrad_factor
+        # outgrad_factor will be constructed in 3 steps as follow:
+        # outgrad_factor = +/- 1 (initial sign)
+        # outgrad_factor *= numerator
+        # outgrad_factor /= denominator
+        adv_subtensor = None
+        outgrad_factor = 1.
+        # If there's a 'minus' sign before the whole expression, put it in
+        # outgrad_factor and iterate
+        if incr.owner and incr.owner.op == tensor.neg:
+            outgrad_factor = -1.
+            incr = incr.owner.inputs[0]
        if incr.owner and incr.owner.op == tensor.true_div:
            num, denom = incr.owner.inputs
-            if not (hasattr(num, 'data') and numpy.all(num.data == -1)):
+            # set outgrad_factor according to the numerator,
+            # it may be divided later
+            if hasattr(num, 'data') and numpy.all(num.data == -1):
+                # Base case, num is -1
+                outgrad_factor *= 1.
+            elif numpy.all(num.broadcastable):
+                # Otherwise, it should be a scalar
+                outgrad_factor *= -num
+            else:
                return
-            #else: OK
            if not denom.owner:
                return
-            adv_subtensor = None
            if isinstance(denom.owner.op, tensor.AdvancedSubtensor):
+                # Base case
                adv_subtensor = denom
-                mult_factor = 1
+                outgrad_factor /= 1.
            elif denom.owner.op == tensor.mul:
-                # Try to find the AdvancedSubtensor node mentionned above
+                # Try to find the AdvancedSubtensor node mentionned above,
-                # For now, we support only the case where the other inputs
+                # and a scalar that is equal to the output gradient
-                # of the "mul" node are of integer type, so we are sure it
-                # does not affect the gradient computation.
                for i, input in enumerate(denom.owner.inputs):
                    if input.owner and isinstance(input.owner.op, tensor.AdvancedSubtensor):
-                        adv_subtensor = input
                        other_inputs = [in_ for (j, in_) in enumerate(denom.owner.inputs) if j!=i]
                        if len(other_inputs) == 1:
-                            mult_factor = other_inputs[0]
+                            rest = other_inputs[0]
                        else:
-                            mult_factor = tensor.mul(*[other_inputs])
+                            rest = tensor.mul(*[other_inputs])
-                        # Check that mult_factor is of integer type
+                        # Check that rest is a scalar
-                        if mult_factor.dtype.startswith('int')\
+                        if numpy.all(rest.broadcastable):
-                                or mult_factor.dtype.startswith('uint'):
+                            adv_subtensor = input
-                            #OK
+                            outgrad_factor /= rest
                            break
-                        else:
-                            # That subtensor was not right
-                            adv_subtensor = None
            else:
                return
@@ -1103,6 +1125,8 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
                #else: OK
            else:
                return
+        else:
+            return
        # Check that rows is arange(labels.shape[0])
        if not _check_rows_is_arange_len_labels(rows, labels):
@@ -1147,7 +1171,7 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
            if incr.owner and incr.owner.op == tensor.fill:
                model, value = incr.owner.inputs
                adv_subtensor = None
-                mult_factor = 1
+                outgrad_factor = None
                if model.owner and isinstance(model.owner.op, tensor.AdvancedSubtensor):
                    adv_subtensor = model
                else:
@@ -1169,17 +1193,16 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
                    if not (maybe_log_sm is log_sm and maybe_rows is rows and maybe_labels is labels):
                        return
                    #else: OK
+                else:
+                    return
                # In the base case, value is the constant '-1'
                if hasattr(value, 'data') and numpy.all(value.data == -1):
-                    mult_factor = 1
+                    outgrad_factor = 1.
-                # In the case of -1/denom, if denom is of integer type
+                # Otherwise, it should be a scalar, and the output gradient
-                elif value.owner and value.owner.op == tensor.true_div:
+                # would be -value
-                    val_num, val_denom = value.owner.inputs
+                elif numpy.all(value.broadcastable):
-                    if hasattr(val_num, 'data') and numpy.all(val_num.data == -1):
+                    outgrad_factor = -value
-                        if val_denom.dtype.startswith('int')\
-                                or val_denom.dtype.startswith('uint'):
-                            mult_factor = val_denom
                else:
                    return
@@ -1204,11 +1227,10 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
    # Dimension check before substitution
    if labels.ndim == 1 and x_var.ndim == 2:
-        if mult_factor is not None:
+        if outgrad_factor is not None:
-            out_grad = tensor.fill(x_var[:,0], 1./mult_factor)
+            out_grad = tensor.fill(x_var[:,0], outgrad_factor)
            return [crossentropy_softmax_1hot_with_bias_dx(out_grad, sm, labels)]
        else:
-            print 'mult_factor is None?'
            return
    else:
        return

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -346,7 +346,7 @@ def local_IncSubtensor_serialize(node):
    #
    #  add(x, incsubtensor(b, c), incsubtensor(b, d))
-    #  -> incsubtensor(incsubtensor(add(x,b), c), d)
+    #  -> incsubtensor(incsubtensor(add(x,b,b), c), d)
    """
    def movable(i):
@@ -354,7 +354,8 @@ def local_IncSubtensor_serialize(node):
        return i.owner \
                and isinstance(i.owner.op, T.IncSubtensor) \
                and i.type == o_type \
-                and len(i.clients) == 1
+                and len(i.clients) == 1 \
+                and not i.owner.op.set_instead_of_inc
    if node.op == T.add:
        o_type = node.outputs[0].type
@@ -383,7 +384,8 @@ def local_IncSubtensor_serialize(node):
 @gof.local_optimizer([None])
 def local_inplace_setsubtensor(node):
    if isinstance(node.op, T.IncSubtensor) and not node.op.inplace:
-        new_op = T.IncSubtensor(node.op.idx_list, inplace=True)
+        new_op = T.IncSubtensor(node.op.idx_list, inplace=True, \
+                        set_instead_of_inc=node.op.set_instead_of_inc)
        new_node = new_op(*node.inputs)
        return [new_node]
    return False
@@ -932,8 +934,11 @@ def local_neg_neg(node):
 @register_specialize
 @gof.local_optimizer([T.neg])
 def local_neg_div_neg(node):
+    """- (-a / b) -> a / b
+    Also performs - (c / b) -> ((-c) / b) when c is a scalar constant.
+    """
    if node.op == T.neg:
-        """- (-a / b) -> a / b"""
        if node.inputs[0].owner and node.inputs[0].owner.op == T.true_div:
            frac = node.inputs[0]
            num, denom = frac.owner.inputs
@@ -942,6 +947,11 @@ def local_neg_div_neg(node):
                    # No other clients of the original division
                    new_num = num.owner.inputs[0]
                    return [T.true_div(new_num, denom)]
+            elif numpy.all(num.broadcastable) and isinstance(num, gof.Constant):
+                if len(frac.clients) == 1:
+                    new_num = -num.data
+                    return [T.true_div(new_num, denom)]
 @gof.local_optimizer([T.mul])
 def local_mul_zero(node):

--- a/theano/tensor/raw_random.py
+++ b/theano/tensor/raw_random.py
@@ -265,7 +265,9 @@ def permutation_helper(random_state, n, shape):
    """
    # n should be a 0-dimension array
    assert n.shape == ()
-    n = n.item()
+    # Note that it is important to convert `n` into an integer, because if it
+    # is a long, the numpy permutation function will crash on Windows.
+    n = int(n.item())
    out_shape = list(shape)
    out_shape.append(n)

--- a/theano/tensor/sharedvar.py
+++ b/theano/tensor/sharedvar.py
@@ -35,7 +35,7 @@ class ScalarSharedVariable(SharedVariable, _tensor_py_operators):
 @shared_constructor
 def scalar_constructor(value, name=None, strict=False, dtype=None):
-    """SharedVariable constructor for scalar values. Defaults to int64 or float64. 
+    """SharedVariable constructor for scalar values. Default: int64 or float64. 
    :note: We implement this using 0-d tensors for now.
@@ -50,12 +50,14 @@ def scalar_constructor(value, name=None, strict=False, dtype=None):
        else:
            dtype = type(value).__name__
-    type = TensorType(dtype=dtype, broadcastable=[])
+    tensor_type = TensorType(dtype=dtype, broadcastable=[])
    try:
-        # don't pass the dtype to asarray because we want this to fail if strict is True and the
+        # Do not pass the dtype to asarray because we want this to fail if
-        # types do not match
+        # strict is True and the types do not match.
-        rval = ScalarSharedVariable(type=type, value=numpy.asarray(value), name=name, strict=strict)
+        rval = ScalarSharedVariable(type=tensor_type,
+                value=numpy.asarray(value),
+                name=name, strict=strict)
        return rval
    except:
        traceback.print_exc()

--- a/theano/tensor/tests/test_nnet.py
+++ b/theano/tensor/tests/test_nnet.py
@@ -223,93 +223,13 @@ class T_CrossentropyCategorical1Hot(unittest.TestCase):
        assert not has_softmax
        assert not has_softmaxdx
-def test_argmax_pushdown():
+    def test_get_rid_of_advanced_indexing_version_of_xent(self):
-    x = tensor.dmatrix()
-    env = gof.Env(
-            [x],
-            [tensor.max(softmax(tensor.exp(tensor.tanh(sigmoid(x)))))])
-    theano.compile.mode.optdb.query(
-            theano.compile.mode.OPT_FAST_RUN).optimize(env)
-    #print 'AFTER'
-    #for node in env.toposort():
-        #print node.op
-    assert len(env.toposort()) == 2 # an output_guard is second
-    assert env.toposort()[0].op == tensor._max_and_argmax
-def test_argmax_pushdown_bias():
-    x = tensor.dmatrix()
-    b = tensor.dvector()
-    env = gof.Env(
-            [x,b],
-            [tensor.max(softmax_with_bias(x, b))])
-    theano.compile.mode.optdb.query(
-            theano.compile.mode.OPT_FAST_RUN).optimize(env)
-    print 'AFTER'
-    for node in env.toposort():
-        print node.op
-    assert len(env.toposort()) == 4
-    assert isinstance(env.toposort()[0].op, tensor.DimShuffle)
-    assert isinstance(env.toposort()[1].op, tensor.Elemwise)
-    assert isinstance(env.toposort()[2].op, tensor.MaxAndArgmax)
-    assert str(env.toposort()[3].op) == 'OutputGuard'
-def test_asymptotic_32():
-    """
-    This test makes sure that our functions behave sensibly when huge values are present
-    """
-    #TODO: consider adding the optimization of crossentropy into the current mode for the
-    # purpose of running this test
-    for dtype in 'float32', 'float64':
-        if dtype == 'float32':
-            x = tensor.fmatrix()
-            x2 = tensor.fvector()
-        else:
-            x = tensor.dmatrix()
-            x2 = tensor.dvector()
-        y = tensor.lvector()
-        c = categorical_crossentropy(softmax(x+x2), y)
-        f = theano.function([x,y,x2], [c.sum(), tensor.grad(c.sum(), x)], mode='FAST_RUN')
-        if 0:
-            for i, n in enumerate( f.maker.env.toposort()):
-                print i, n
-        xval = numpy.zeros((5, 5), dtype=dtype)
-        x2val = numpy.zeros(5, dtype=xval.dtype)
-        for i in xrange(100):
-            cval, gxval =  f(xval, numpy.arange(5), x2val)
-            xval -= 100.3 * gxval
-            #print cval, gxval
-        assert cval == 0 # no problem going to zero error
-        #what about when x gets really big?
-        xval = numpy.zeros((5, 5), dtype=dtype)
-        x2val = numpy.zeros(5, dtype=xval.dtype)
-        for i in xrange(100):
-            cval, gxval =  f(xval, numpy.arange(5), x2val)
-            xval += 100000.3 * gxval
-            #print cval, gxval
-        assert cval > 61750000
-        assert gxval[0,0] == -1.0
-        assert gxval[0,1] == 0.25
-def test_get_rid_of_advanced_indexing_version_of_xent():
        verbose = 0
-    if 0: mode = 'DEBUG_MODE'
+        # TODO: add the optimization in FAST_COMPILE?
-    else: mode = 'FAST_RUN'
+        # In the mean time, run it as 'FAST_RUN' instead
+        mode = theano.compile.mode.get_default_mode()
+        if mode == 'FAST_COMPILE':
+            mode = 'FAST_RUN'
        rng = numpy.random.RandomState(utt.fetch_seed())
@@ -326,13 +246,15 @@ def test_get_rid_of_advanced_indexing_version_of_xent():
                print i, node
            # Last node should be the output
            print i, pprint(node.outputs[0])
+            print
        ## Basic case
        expressions = [
                T.sum(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
                -T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
                -T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
-            T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y])]
+                T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y])
+                ]
        for expr in expressions:
            # Verify the optimizer worked on the expressions
@@ -401,6 +323,187 @@ def test_get_rid_of_advanced_indexing_version_of_xent():
            g(x_val, b_val, y_val)
+    def test_scale_cost(self):
+        # TODO: add the optimization in FAST_COMPILE?
+        # In the mean time, run it as 'FAST_RUN' instead
+        mode = theano.compile.mode.get_default_mode()
+        if mode == 'FAST_COMPILE':
+            mode = 'FAST_RUN'
+        rng = numpy.random.RandomState(utt.fetch_seed())
+        x_val = rng.randn(3,5)
+        b_val = rng.randn(5)
+        y_val = numpy.asarray([2,4,1])
+        x = T.dmatrix('x')
+        b = T.dvector('b')
+        y = T.lvector('y')
+        a = T.dscalar('a')
+        def print_graph(func):
+            for i, node in enumerate(func.maker.env.toposort()):
+                print i, node
+            # Last node should be the output
+            print i, pprint(node.outputs[0])
+        def validate_fn_graph(func):
+            # The graph of the function should not have softmax anymore
+            has_cx1hot = False
+            has_softmax = False
+            for node in func.maker.env.toposort():
+                if node.op == crossentropy_softmax_argmax_1hot_with_bias:
+                    has_cx1hot = True
+                if node.op == softmax:
+                    has_softmax = True
+            assert has_cx1hot
+            assert not has_softmax
+        def validate_grad_graph(func):
+            # The graph of the gradient should not have softmaxgrad anymore
+            has_cx1hotdx = False
+            has_softmax = False
+            has_softmaxdx = False
+            for node in func.maker.env.toposort():
+                if node.op == crossentropy_softmax_1hot_with_bias_dx:
+                    has_cx1hotdx = True
+                if node.op == softmax:
+                    has_softmax = True
+                if node.op == softmax_grad:
+                    has_softmaxdx = True
+            assert has_cx1hotdx
+            assert has_softmax
+            assert not has_softmaxdx
+        ## Cases to test
+        expressions = [
+                a * T.sum(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -a * T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * (-T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y]))),
+                a * T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                -a * T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                a * (-T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y])),
+                a * T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                a * T.mean(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -a * T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * (-T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y]))),
+                a * T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * T.mean(-T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                -a * T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                a * (-T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y])),
+                a * T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                ]
+        for expr in expressions:
+            # Verify the optimizer worked on the expressions
+            f = theano.function([x,y,a], expr, mode=mode)
+            assert 5 <= len(f.maker.env.toposort()) <= 10
+            validate_fn_graph(f)
+            f(x_val, y_val, 0.1)
+            # Verify the gradient wrt x
+            g = theano.function([x,y,a], T.grad(expr, x), mode=mode)
+            assert 5 <= len(g.maker.env.toposort()) <= 12
+            validate_grad_graph(g)
+            g(x_val, y_val, 0.1)
+            # Verify the gradient when providing output gradient
+            h = theano.function([x,y,a], T.grad(expr, x, g_cost=a*x.sum()), mode=mode)
+            assert 8 <= len(h.maker.env.toposort()) <= 17
+            validate_grad_graph(h)
+            h(x_val, y_val, 0.1)
+def test_argmax_pushdown():
+    x = tensor.dmatrix()
+    env = gof.Env(
+            [x],
+            [tensor.max(softmax(tensor.exp(tensor.tanh(sigmoid(x)))))])
+    theano.compile.mode.optdb.query(
+            theano.compile.mode.OPT_FAST_RUN).optimize(env)
+    #print 'AFTER'
+    #for node in env.toposort():
+        #print node.op
+    assert len(env.toposort()) == 2 # an output_guard is second
+    assert env.toposort()[0].op == tensor._max_and_argmax
+def test_argmax_pushdown_bias():
+    x = tensor.dmatrix()
+    b = tensor.dvector()
+    env = gof.Env(
+            [x,b],
+            [tensor.max(softmax_with_bias(x, b))])
+    theano.compile.mode.optdb.query(
+            theano.compile.mode.OPT_FAST_RUN).optimize(env)
+    print 'AFTER'
+    for node in env.toposort():
+        print node.op
+    assert len(env.toposort()) == 4
+    assert isinstance(env.toposort()[0].op, tensor.DimShuffle)
+    assert isinstance(env.toposort()[1].op, tensor.Elemwise)
+    assert isinstance(env.toposort()[2].op, tensor.MaxAndArgmax)
+    assert str(env.toposort()[3].op) == 'OutputGuard'
+def test_asymptotic_32():
+    """
+    This test makes sure that our functions behave sensibly when huge values are present
+    """
+    #TODO: consider adding the optimization of crossentropy into the current mode for the
+    # purpose of running this test
+    for dtype in 'float32', 'float64':
+        if dtype == 'float32':
+            x = tensor.fmatrix()
+            x2 = tensor.fvector()
+        else:
+            x = tensor.dmatrix()
+            x2 = tensor.dvector()
+        y = tensor.lvector()
+        c = categorical_crossentropy(softmax(x+x2), y)
+        f = theano.function([x,y,x2], [c.sum(), tensor.grad(c.sum(), x)], mode='FAST_RUN')
+        if 0:
+            for i, n in enumerate( f.maker.env.toposort()):
+                print i, n
+        xval = numpy.zeros((5, 5), dtype=dtype)
+        x2val = numpy.zeros(5, dtype=xval.dtype)
+        for i in xrange(100):
+            cval, gxval =  f(xval, numpy.arange(5), x2val)
+            xval -= 100.3 * gxval
+            #print cval, gxval
+        assert cval == 0 # no problem going to zero error
+        #what about when x gets really big?
+        xval = numpy.zeros((5, 5), dtype=dtype)
+        x2val = numpy.zeros(5, dtype=xval.dtype)
+        for i in xrange(100):
+            cval, gxval =  f(xval, numpy.arange(5), x2val)
+            xval += 100000.3 * gxval
+            #print cval, gxval
+        assert cval > 61750000
+        assert gxval[0,0] == -1.0
+        assert gxval[0,1] == 0.25
    #   hint - call the argmax push-down optimization first too

--- a/theano/tensor/tests/test_randomstreams.py
+++ b/theano/tensor/tests/test_randomstreams.py
@@ -283,12 +283,12 @@ class T_RandomStreams(unittest.TestCase):
        assert numpy.all(fn_val1 == numpy_val1)
    def test_shuffle_row_elements(self):
-        """Test that RandomStreams.shuffle_row_elements generates the right results"""
+        """Ensure RandomStreams.shuffle_row_elements generates right results"""
        # Check over two calls to see if the random state is correctly updated.
+        # On matrices, for each row, the elements of that row should be
-        # On matrices, for each row, the elements of that row should be shuffled.
+        # shuffled.
-        # Note that this differs from numpy.random.shuffle, where all the elements
+        # Note that this differs from numpy.random.shuffle, where all the
-        # of the matrix are shuffled.
+        # elements of the matrix are shuffled.
        mm = Module()
        mm.random = RandomStreams(234)
        m_input = tensor.dmatrix()