Merged

24ef1606 · Olivier Delalleau · d7286bf8 · 43cf804a · 24ef1606 · 24ef1606
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -168,7 +168,7 @@ latex_font_size = '11pt'
 # Grouping the document tree into LaTeX files. List of tuples
 # (source start file, target name, title, author, document class [howto/manual]).
 latex_documents = [
-  ('contents', 'theano.tex', 'theano Documentation',
+  ('index', 'theano.tex', 'theano Documentation',
   'LISA lab, University of Montreal', 'manual'),
 ]


--- a/doc/index.txt
+++ b/doc/index.txt
@@ -37,7 +37,7 @@ Roughly in order of what you'll want to check out:
 * :ref:`extending` -- Learn to add a Type, Op, or graph optimization.
 * :ref:`internal` -- How to maintaining Theano, LISA-specific tips, and more...

-You can download the latest `PDF documentation <http://pylearn.org/theano/theano.pdf>`_, rather than reading it online.
+You can download the latest `PDF documentation <http://deeplearning.net/theanodoc/theano.pdf>`_, rather than reading it online.

 Community
 =========
@@ -46,7 +46,7 @@ Community

 * Register and post to `theano-dev`_ if you want to talk to the developers.

-* We try to stay organized with `Theano's Trac <trac/>`__ 
+* We try to stay organized with `Theano's Trac <http://trac-hg.assembla.com/theano/report/1>`__ 

 * Come visit us in Montreal!  Most of the developers are students in the LISA_ group at the `University of Montreal`_.


--- a/doc/install.txt
+++ b/doc/install.txt
@@ -20,7 +20,7 @@ to be installed:
        We develop mainly on 64-bit Linux machines. 32-bit architectures are
        not well-tested.

-    python >= 2.5
+    python >= 2.5 (2.4 should be supported as well)

    `numpy <http://numpy.scipy.org/>`_ >= 1.2
        Earlier versions have memory leaks.
@@ -30,6 +30,8 @@ to be installed:
        is buggy in 0.6. (scipy.csc_matrix dot has a bug with singleton
        dimensions. There may be more bugs.)

+    A BLAS installation (with Level 3 functionality)
+
 The following libraries and software are optional:

    g++, python-dev
@@ -42,41 +44,49 @@ The following libraries and software are optional:
    `mercurial <http://www.selenic.com/mercurial/>`_
        To download bleeding-edge version of Theano.

+.. _install_bleeding_edge:
+
+Getting the code
+-----------------

-Easy install
------------
+If you are a developer of Theano, then check out the :ref:`dev_start_guide` guide. 

-The following command will install the latest release of Theano
-on your system:
+The following are general instructions that will set you up with the bleeding-edge 
+version of Theano. First, get the code using `mercurial <http://www.selenic.com/mercurial/wiki/>`__:

 .. code-block:: bash

-    easy_install Theano
+    hg clone http://hg.assembla.com/theano Theano

-Manual install
--------------
+Configuring PYTHONPATH
+---------------------------
+
+The subdirectory Theano/theano has to be located in a path
+mentioned in your PYTHONPATH. In order to do that, you can either
+create a symbolic link to Theano/theano in a directory already
+mentioned in your PYTHONPATH environment variable, or modify the
+PYTHONPATH so that it mentions Theano.

-To install the latest release of Theano from source, visit the `downloads
-<http://pylearn.org/theano/downloads/>`_ page and download the release you
-want. Unpack the release, and type:
+To create a symbolic link:

 .. code-block:: bash

-    python setup.py build
-    python setup.py test
-    python setup.py install
+    ln -s Theano/theano <someplace on your PYTHONPATH>/theano

-.. _install_bleeding_edge:
+To modify the environment variable PYTHONPATH in bash, you may do this:

-Bleeding Edge
--------------
+.. code-block:: bash

-Feeling lucky and want to run bleeding-edge code?
-Then check out the :ref:`dev_start_guide` guide.
+    export PYTHONPATH=<path to Theano's parent dir>/Theano:$PYTHONPATH

+In csh:

-Configuring the environment
---------------------------
+.. code-block:: csh
+
+    setenv PYTHONPATH <path to Theano's parent dir>/Theano:$PYTHONPATH
+
+Configuring Theano's environmental variables
+---------------------------------------------

 Two environment variables are used to control automatic code
 generation. It is possible to use Theano in a way which avoids all
@@ -118,6 +128,33 @@ automatic code generation, but that way is much, much slower.

    Omitting this variable defaults the mode to ``'FAST_RUN'``.

+Testing your installation
+---------------------------
+
+Once you have completed these steps, you should run the theano test suite like this:
+
+.. code-block:: bash
+
+    cd Theano
+    nosetests #execute all the tests
+
+All tests should pass. If some test fails on your machine, you are
+encouraged to tell us what went wrong on the ``theano-users@googlegroups.com``
+mailing list.
+
+Updating
+-------------
+
+
+To update your library to the latest revision, change directory (``cd``)
+to your ``Theano`` folder and execute the following command:
+
+.. code-block:: bash
+
+    hg pull -u
+
+You should update frequently, bugs are fixed on a very regular basis.
+
 Mac
 ---

@@ -126,20 +163,21 @@ Mac
 -
    .. code-block:: bash

-        $ sudo port install gcc42 py25-zlib py25-numpy py25-scipy mercurial
+        $ sudo port install gcc44 py25-zlib py25-numpy py25-scipy mercurial

-    Note that compiling gcc42 takes a significant time (hours) so it is probably
+    Note that compiling gcc takes a significant time (hours) so it is probably
    not the best solution if you are in a rush! It may happen that SciPy
    fails to compile the first time and still compiles just fine on a second
    try. Same thing with py25-zlib.

- Install some kind of BLAS library (TODO: how?)
+- scipy depends on ATLAS (a BLAS library), which will be installed by MacPorts.

 - Set ``THEANO_BLAS_LDFLAGS`` to something which will link against said BLAS
  library.  E.g., ``THEANO_BLAS_LDFLAGS='-lcblas -latlas -lgfortran'``.

-This advice has not been tested recently, so please inform us of your results.
-
+These installation instructions have not tested recently, please infom us of your results! 
+We would be especially interested in dependencies that we missed listing, as well as tests
+that fail on your platform (use the ``theano-users@googlegroups.com`` mailing list).


 Windows
@@ -247,9 +285,9 @@ Generating the documentation
 ----------------------------

 You can read the latest HTML documentation `here
-<http://pylearn.org/theano/contents.html>`__.
+<http://deeplearning.net/theanodoc>`__.
 You can download the latest PDF documentation `here
-<http://pylearn.org/theano/theano.pdf>`__.
+<http://deeplearning.net/theanodoc/theano.pdf>`__.

 We recommend you look at the documentation on the website, since it
 will be more current than the documentation included with the package.

--- a/doc/internal/dev_start_guide.txt
+++ b/doc/internal/dev_start_guide.txt
@@ -21,11 +21,10 @@ Developer Start Guide
 Accounts
 ========

-To obtain developer access: send an email to an admin with an username and
-temporary password. Pending approval, this will give you access to both the
-repository and Trac. You should then change your password in the
-`<http://pylearn.org/theano/prefs preferences>` tab - do *NOT* use a good 
-password! We are using plain text http which is not secure.
+To obtain developer access: register with `Assembla
+<http://www.assembla.com/>`_ and add yourself as a watcher on the `Theano space 
+<http://www.assembla.com/spaces/theano>`_. Then send an email to an admin asking 
+to be promoted to a member of the project.


 Theano code
@@ -34,10 +33,9 @@ Theano code
 *To get the source via mercurial,* you must have `mercurial
 <http://www.selenic.com/mercurial/wiki/>`__ installed.

-The code that makes up Theano is in a single repository available in
-`<http://pylearn.org/hg/Theano>`__.
-
-As a developer, you should clone this repository like this:
+The code that makes up Theano is in a `single repository
+<http://www.assembla.com/spaces/theano/trac_mercurial_tool>`__. As a developer, 
+you should clone this repository like this:

 .. code-block:: bash

@@ -121,9 +119,6 @@ to your ``Theano`` folder and execute the following command:

    hg pull -u

-You may also download the latest source directly as a gzip'd tar file:
-`<http://pylearn.org/hg/Theano/archive/tip.tar.gz>`__.
-
 Nightly test
 ============


--- a/doc/introduction.txt
+++ b/doc/introduction.txt
@@ -5,43 +5,40 @@
 Theano at a Glance
 ==================

-Theano is a Python library that allows you to define, optimize, and evaluate
-mathematical expressions involving multi-dimensional arrays. Using Theano it is
+Theano is a Python library that lets you to define, optimize, and evaluate
+mathematical expressions, especially ones with multi-dimensional arrays
+(numpy.ndarray).  Using Theano it is
 possible to attain speeds rivaling hand-crafted C implementations for problems
 involving large amounts of data.  It can also surpass C on a CPU by many orders
 of magnitude by taking advantage of recent GPUs.

-Theano melds some aspects of a computer algebra system (CAS) with
-aspects of an optimizing compiler. It can even transform some or all
-of the mathematical expression into C code and compile it into native
-machine instructions. This combination of CAS with optimizing
-compilation is particularly useful for tasks in which complicated
-mathematical expressions are evaluated repeatedly and evaluation speed
-is critical.
-
-Theano supports a range of numerical types in multiple dimensions and
-a number of well-tested operations. It also allows you to compute the
-gradient of an expression with respect to another. Symbolic
-expressions may be compiled into functions, which work on the same
-data structures as numpy_, allowing for easy interoperability.
+Theano combines aspects of a computer algebra system (CAS) with aspects of an
+optimizing compiler. It can also generate customized C code for many
+mathematical operations.  This combination of CAS with optimizing compilation
+is particularly useful for tasks in which complicated mathematical expressions
+are evaluated repeatedly and evaluation speed is critical.  For situations
+where many different expressions are each evaluated once Theano can minimize
+the amount of compilation/analysis overhead, but still provide symbolic
+features such as automatic differentiation.

 Theano's compiler applies many optimizations of varying complexity to
 these symbolic expressions. These optimizations include, but are not
 limited to:

+* use of GPU for computations
 * constant folding
-* merging of similar subgraphs, to avoid calculating the same values
-  more than once
-* arithmetic simplification (``x*y/x -> y``)
-* inserting efficient BLAS_ operations
-* using inplace operations wherever it is safe to do so.
-
-Theano defines several optimizations which improve the numerical
-stability of computations.
-
-Theano was written at the LISA_ lab to support the development of
-efficient machine learning algorithms while minimizing human time. We
-use it especially in gradient-based learning techniques.  Theano is
+* merging of similar subgraphs, to avoid redundant calculation
+* arithmetic simplification (e.g. ``x*y/x -> y``, ``--x -> x``)
+* inserting efficient BLAS_ operations (e.g. ``GEMM``) in a variety of
+  contexts
+* using memory aliasing to avoid calculation
+* using inplace operations wherever it does not interfere with aliasing
+* loop fusion for elementwise sub-expressions
+* improvements to numerical stability (e.g.  :math:`\log(1+\exp(x))` and :math:`\log(\sum_i \exp(x[i]))`)
+* for a complete list, see :ref:`_optimizations`
+
+Theano was written at the LISA_ lab to support rapid development of
+efficient machine learning algorithms. Theano is
 named after the `Greek mathematician`_, who may have been Pythagoras'
 wife.  Theano is released under a BSD license (:ref:`link <license>`).

@@ -92,30 +89,28 @@ machine instructions.
 What does it do that they don't?
 ================================

-Theano is a python library and optimizing compiler for manipulating
+Theano is a Python library and optimizing compiler for manipulating
 and evaluating expressions, especially matrix-valued
 ones. Manipulation of matrices is typically done using the numpy
 package, so what does Theano do that Python and numpy do not?

- *execution speed optimizations*: Theano can use `g++` to compile
-  parts your expression graph into native machine code, which runs
-  much faster than python.
+- *execution speed optimizations*: Theano can use `g++` or `nvcc` to compile
+  parts your expression graph into CPU or GPU instructions, which run
+  much faster than pure Python.

 - *symbolic differentiation*: Theano can automatic build symbolic graphs
  for computing gradients.

- *stability optimizations*: Theano can recognize numerically unstable
+- *stability optimizations*: Theano can recognize [some] numerically unstable
  expressions and compute them with more stable algorithms.

-There exist another symbolic package in Python, namely sympy_. Theano
-is different from sympy in the sense that while Theano allows symbolic
-manipulation it puts more emphasis on the evaluation of these expressions
-and being able to repeatedly evaluate them on many different inputs. Theano
-is also better suited to handling large tensors which have no
-assumed structures.
+The closest Python package to Theano is sympy_.
+Theano focuses more on tensor expressions than Sympy, and has more machinery
+for compilation.  Sympy has more sophisticated algebra rules and can
+handle a wider variety of mathematical operations (such as series, limits, and integrals).

 If numpy_ is to be compared to MATLAB_ and sympy_ to Mathematica_,
-Theano is a sort of hybrid of the two which tries to make the best of
+Theano is a sort of hybrid of the two which tries to combine the best of
 both worlds.


@@ -134,7 +129,8 @@ Getting started
  the :ref:`tutorial` first though.


-A PDF version of the online documentation may be found `here <theano.pdf>`_.
+A PDF version of the online documentation may be found `here
+<http://deeplearning.net/theanodoc/theano.pdf>`_.


 Contact us

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -331,6 +331,8 @@ Indexing

 Basic indexing.

+    Mirrors numpy's `basic indexing  <http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html>`_. Read that page first.
+
 Advanced indexing.

 .. _libdoc_tensor_elementwise:

--- a/doc/links.txt
+++ b/doc/links.txt
@@ -40,10 +40,10 @@ This is a sort of memo for developers and would-be developers.
 .. _mercurial: http://www.selenic.com/mercurial/wiki/
 .. _nosetests: http://somethingaboutorange.com/mrl/projects/nose/
 .. _numpy: http://numpy.scipy.org/
-.. _python: http://www.python.or
+.. _python: http://www.python.org
 .. _scipy: http://scipy.org/

-.. _autodiff: http://autodiff.org
+.. _autodiff: http://www.autodiff.org
 .. _boost.python: http://www.boost.org/doc/libs/1_38_0/libs/python/doc/index.html
 .. _cython: http://www.cython.org/
 .. _liboil: http://liboil.freedesktop.org/wiki/

--- a/doc/tutorial/symbolic_graphs.txt
+++ b/doc/tutorial/symbolic_graphs.txt
@@ -41,9 +41,10 @@ details about these building blocks see :ref:`variable`, :ref:`op`,
 .. figure:: apply.png 
    :align: center

-    Arrows represent references to the Python objects pointed at. The blue
-    box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
-    circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.
+
+Arrows represent references to the Python objects pointed at. The blue
+box is an :ref:`apply` node. Red boxes are :ref:`variable` nodes. Green
+circles are :ref:`Ops <op>`. Purple boxes are :ref:`Types <type>`.


 The graph can be traversed starting from outputs (the result of some
@@ -104,7 +105,7 @@ how to compute the gradient of the node's outputs with respect to its
 inputs. Note that if an :ref:`op` does not provide this information, 
 it is assumed that the gradient does not defined.
 Using the 
-`chain rule <http://en.wikipedia.org/wiki/Chain_rile>`_ 
+`chain rule <http://en.wikipedia.org/wiki/Chain_rule>`_ 
 these gradients can be composed in order to obtain the expression of the 
 gradient of the graph's output with respect to the graph's inputs .


--- a/theano/sandbox/conv.py
+++ b/theano/sandbox/conv.py
@@ -29,9 +29,10 @@ class ConvOp(Op):

    #TODO: make the stacksize its own parameter, and make imshp a pair

-    def __init__(self, imshp, kshp, nkern, bsize, dx, dy, output_mode='valid',
-            unroll_batch=4,
-            unroll_kern=4,
+    def __init__(self, imshp=None, kshp=None, nkern=None, bsize=None, dx=None, dy=None, output_mode='valid',
+            unroll_batch=0,
+            unroll_kern=0,
+            unroll_patch=False,
            imshp_logical=None,
            kshp_logical=None,
            kshp_logical_top_aligned=True,
@@ -47,6 +48,7 @@ class ConvOp(Op):
        dx - patch stride rows
        dy - patch stride cols
        out_mode - 'valid', 'full'
+        unroll_patch - c code generation option
        unroll_batch - c code generation option
        unroll_kern - c code generation option
        verbose - passed to GpuConv
@@ -60,6 +62,7 @@ class ConvOp(Op):
        gradient on the filters.


+        unroll_patch. If True will use a version that is faster then without not unroll by unroll the patch loop.
        unroll_batch. If >0 will use a version that will unroll the batch loop by the value of the option. By default don't use this version of the code.
        unroll_nkern. idem as unroll_batch but unroll the kernel loop.

@@ -95,6 +98,7 @@ class ConvOp(Op):

        self.unroll_batch=unroll_batch
        self.unroll_kern=unroll_kern
+        self.unroll_patch=unroll_patch

        if self.unroll_batch>0 and self.bsize % self.unroll_batch!=0:
            if self.bsize<=self.unroll_batch:
@@ -407,6 +411,7 @@ using namespace std;
        d["self_imshp0"]=self.imshp[0]
        d["self_imshp1"]=self.imshp[1]
        d["self_imshp2"]=self.imshp[2]
+        d["mode"]=self.out_mode.upper()
        d["self_kshp0"]=self.kshp[0]
        d["self_kshp1"]=self.kshp[1]
        d["self_kshp_logical_r"] = self.kshp_logical[0]
@@ -439,8 +444,12 @@ using namespace std;
        #print self.out_mode, d["self_imshp_logical_stride_r"]

        if self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
+#            print "return imshp!=imshp_logical or self.kshp != self.kshp_logical shape version"
            return _conv_op_code_a % d

+        if self.unroll_patch:
+#            print "return unroll patch version",self.dx,self.dy
+            return _conv_op_code_unroll_patch%d
        if self.unroll_batch>0 or self.unroll_kern>0:
            if self.unroll_batch<=0: self.unroll_batch=1
            if self.unroll_kern<=0: self.unroll_kern=1
@@ -1212,3 +1221,295 @@ Py_XDECREF(img2d);
 Py_XDECREF(filtersflipped);
 """
    return ret
+
+_conv_op_code_unroll_patch = """
+const int mode=%(mode)s;
+int typenum=0, typenum_f=0;
+PyArrayObject *ain1=NULL, *ain2=NULL, *filtersflipped_arr=NULL, *img2d_arr=NULL;
+const %(type)s fill_value = 0;
+
+int type_im=PyArray_TYPE(%(img2d)s);
+int type_ker=PyArray_TYPE(%(filtersflipped)s);
+
+npy_intp dim_zz[2]={%(self_outshp0)s,%(self_outshp1)s};
+npy_intp dim_im[2]={%(self_imshp1)s,%(self_imshp2)s};
+npy_intp dim_ker[2]={%(self_kshp0)s,%(self_kshp1)s};
+
+PyArray_Dims img2d_shape;
+npy_intp img2d_dim[4]={1,1,0,0};
+img2d_shape.ptr=img2d_dim;
+img2d_shape.len=4;
+
+PyArray_Dims kerns_shape;
+npy_intp kerns_dim[4]={1,1,0,0};
+kerns_shape.ptr=kerns_dim;
+kerns_shape.len=4;
+PyObject *img2d=NULL, *contig, *filtersflipped=NULL;
+
+if(%(img2d)s->nd==2){
+  img2d_dim[3]=%(img2d)s->dimensions[1];
+  img2d_dim[2]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==3){
+  img2d_dim[3]=%(img2d)s->dimensions[2];
+  img2d_dim[2]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else if(%(img2d)s->nd==4){
+  img2d_dim[3]=%(img2d)s->dimensions[3];
+  img2d_dim[2]=%(img2d)s->dimensions[2];
+  img2d_dim[1]=%(img2d)s->dimensions[1];
+  img2d_dim[0]=%(img2d)s->dimensions[0];
+}else {
+    PyErr_SetString(PyExc_ValueError, "img don't have a good shape");
+    %(fail)s;
+}
+
+if(%(filtersflipped)s->nd==3){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else if(%(filtersflipped)s->nd==4){
+  kerns_dim[3]=%(filtersflipped)s->dimensions[3];
+  kerns_dim[2]=%(filtersflipped)s->dimensions[2];
+  kerns_dim[1]=%(filtersflipped)s->dimensions[1];
+  kerns_dim[0]=%(filtersflipped)s->dimensions[0];
+}else{
+    std:stringstream temp;
+    temp << "nddim="<<%(filtersflipped)s->nd;
+    std::string param = temp.str();
+    PyErr_SetString(PyExc_ValueError,
+      ("kernel don't have a good shape. " + param).c_str());
+    %(fail)s;
+}
+
+img2d = PyArray_Newshape(%(img2d)s,&img2d_shape, PyArray_CORDER);
+img2d_arr = (PyArrayObject*)img2d;
+if ((img2d_arr->strides[3] != sizeof(%(type)s)) 
+     || (img2d_arr->strides[2] != img2d_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)img2d));
+    Py_DECREF(img2d);
+    img2d = contig;
+    if (!PyArray_ISCONTIGUOUS(img2d)){
+        PyErr_SetString(PyExc_ValueError, "img2d isn't contiguous");
+        %(fail)s;
+    }
+}
+img2d_arr = (PyArrayObject*)img2d;
+
+filtersflipped = PyArray_Newshape(%(filtersflipped)s,&kerns_shape, PyArray_CORDER);
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+if ((filtersflipped_arr->strides[3] != sizeof(%(type)s)) 
+     || (filtersflipped_arr->strides[2] != filtersflipped_arr->dimensions[3]*sizeof(%(type)s))){
+    contig = (PyObject*)(PyArray_GETCONTIGUOUS((PyArrayObject*)filtersflipped));
+    Py_DECREF(filtersflipped);
+    filtersflipped = contig;
+    if (!PyArray_ISCONTIGUOUS(filtersflipped)){
+        PyErr_SetString(PyExc_ValueError, "filtersflipped isn't contiguous");
+        %(fail)s;
+    }
+}
+filtersflipped_arr = (PyArrayObject*)filtersflipped;
+
+if(mode != VALID && mode != FULL){
+  PyErr_SetString(PyExc_ValueError, "invalid mode, only full and valid are supported"); %(fail)s;
+}
+typenum = PyArray_ObjectType((PyObject*)%(img2d)s, 0);
+typenum_f = PyArray_ObjectType((PyObject*)%(filtersflipped)s, 0);
+if (typenum < 0) {PyErr_SetString(PyExc_ValueError, "Invalid type"); %(fail)s;}
+if (typenum != typenum_f) {PyErr_SetString(PyExc_ValueError, "Input types must match"); %(fail)s;}
+
+if (!img2d) %(fail)s;
+if (!filtersflipped) %(fail)s;
+if ((!%(z)s)
+  || *PyArray_DIMS(%(z)s)!=4
+  ||(%(z)s->dimensions[0] != %(self_bsize)s)
+  ||(%(z)s->dimensions[1] != %(self_nkern)s)
+  ||(%(z)s->dimensions[2] != dim_zz[0])
+  || (%(z)s->dimensions[3] != dim_zz[1])
+  )
+{
+  if (%(z)s) Py_DECREF(%(z)s);
+  npy_intp dims[4] = {0,0,0,0};
+  if(!dims) %(fail)s;
+  dims[0]=%(self_bsize)s;
+  dims[1]=%(self_nkern)s;
+  dims[2]=dim_zz[0];
+  dims[3]=dim_zz[1];
+  %(z)s = (PyArrayObject*) PyArray_ZEROS(4, dims, typenum,0);
+}else{
+  //PyArray_FILLWBYTE((PyObject*)%(z)s,0);
+}
+
+int Os[2];
+Os[0]=%(self_outshp0)s;
+Os[1]=%(self_outshp1)s;
+//I keep the formula to calculte Os in case we need it in the futur.
+//if (mode == FULL) {Os[0] = (int)ceil((dim_im[0]+dim_ker[0]-1)/float(%(self_dx)s)); Os[1] = ceil((dim_im[1]+dim_ker[1]-1)/float(%(self_dy)s));}
+//else {Os[0] = (int)ceil((dim_im[0]-dim_ker[0]+1)/float(%(self_dx)s)); Os[1] = (int)ceil((dim_im[1]-dim_ker[1]+1)/float(%(self_dy)s));}
+
+for(int b=0;b< %(self_bsize)s;b++){
+  for(int n_kern=0;n_kern<%(self_nkern)s;n_kern++){
+
+    //assertions
+    if (%(z)s->strides[0] != %(z)s->dimensions[1] *%(z)s->dimensions[2] *%(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[1] != %(z)s->dimensions[2] * %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[2] != %(z)s->dimensions[3] * sizeof(%(type)s)) %(fail)s;
+    if (%(z)s->strides[3] != sizeof(%(type)s)) %(fail)s;
+
+    %(type)s * __restrict__ out=(%(type)s *)(PyArray_GETPTR2(%(z)s,b,n_kern));
+    for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) out[i] = 0;
+
+    for(int stack_size=0;stack_size<%(self_imshp0)s;stack_size++){
+
+      const %(type)s * __restrict__ in=(%(type)s *)(PyArray_GETPTR2(img2d,b,stack_size));
+      const %(type)s * __restrict__ hvals=(%(type)s *)(PyArray_GETPTR2(filtersflipped,n_kern,stack_size));
+
+      int new_m;
+
+      for (int iter_m=0; iter_m < Os[0]; iter_m++) {
+        // Reposition index into input image based on requested output size
+        int pos_m = iter_m*%(self_dx)s;//The position of the patch in the image
+        if (mode == FULL) new_m = pos_m ;
+        else new_m = (pos_m+dim_ker[0]-1);
+
+        for (int iter_n=0; iter_n < Os[1]; iter_n++) {  // loop over columns
+          int pos_n=iter_n*%(self_dy)s;
+          %(type)s sum=0;
+          %(type)s sum2=0;
+          %(type)s sum3=0;
+          %(type)s sum4=0;
+          int nb_sum=0;
+          // Sum over kernel, if index into image is out of bounds
+          // fill with the value
+          for (int j=0; j < dim_ker[0]; j++) {
+            int ind0 = (new_m-j);
+
+            if(mode==FULL){
+              const %(type)s * idx_hvals=&hvals[j*dim_ker[1]];
+              if(ind0 < 0 || ind0 >= dim_im[0]){
+                if(fill_value!=0)
+                  for (int k=0; k < dim_ker[1]; k++) {
+                    sum+= idx_hvals[k] * fill_value;
+                  }
+              }else{
+                //do the part where kernel is to the right of the img
+//TODO: implement unroll patch for fill_value!=0
+                int k=0,max_k=max((int)(pos_n-dim_im[1])+1,0);
+                if(fill_value!=0){ 
+                
+                  for(k=0;k<max_k;k++){
+                    sum+= idx_hvals[k]*fill_value;
+                  }
+                }else {k=max_k;}
+                
+                //do the part where the kernel is on the img
+                max_k=min(pos_n+1,(int)dim_ker[1]);
+                const %(type)s * idx_in=&in[ind0*dim_im[1]];
+
+                if(iter_n + 4*%(self_dy)s < Os[1]
+                         && iter_n>dim_ker[1]-1+3 
+                         && iter_n<dim_im[1]-dim_ker[1]+1-3){
+                  nb_sum=4;
+//cout<<4<<endl;
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                    sum3+=idx_hvals[k]*idx_in[ind1+2*%(self_dy)s];
+                    sum4+=idx_hvals[k]*idx_in[ind1+3*%(self_dy)s];
+                  }
+                }else if(iter_n + 2*%(self_dy)s < Os[1] 
+                         && iter_n>dim_ker[1]-1
+                         && iter_n<dim_im[1]-dim_ker[1]+1){
+//cout<<2<<endl;
+                  nb_sum=2;
+//                  if(iter_n==dim_ker[1]-1){//k-1<min(pos_n+%(self_dy)s,(int)dim_ker[1])){
+//                    sum2+=idx_hvals[k-1]*idx_in[pos_n-k-%(self_dy)s];
+//                  }
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                    sum2+=idx_hvals[k]*idx_in[ind1+%(self_dy)s];
+                  }
+//                  sum2+=idx_hvals[k]*idx_in[pos_n-k+%(self_dy)s];
+//                  sum+=idx_hvals[k]*idx_in[pos_n-k];
+//                  k++;
+                }else{
+//cout<<1<<endl;
+                  nb_sum=1;
+                  /*
+                  %(type)s sum_=0;
+                  if((k-max_k) & 0x1 != 0){
+                    sum+= idx_hvals[k] * idx_in[pos_n-k];
+                  }
+                  for (int ind1=pos_n-k; k<max_k; k+=2,ind1-=2) {
+                    sum+= idx_hvals[k] * idx_in[ind1];
+                    sum_+= idx_hvals[k+1] * idx_in[ind1-1];
+                  }
+                  sum+=sum_;
+                  */
+                  for (int ind1=pos_n-k; k<max_k; k++,ind1--) {
+                    sum+=idx_hvals[k]*idx_in[ind1];
+                  }
+                }
+                //do the part to the left of the img
+                if(fill_value!=0)
+                  for(;k<dim_ker[1];k++) sum+= idx_hvals[k]*fill_value;
+              }
+            }else{//valid mode
+              const %(type)s* idx_in=&in[ind0*dim_im[1]];
+              const %(type)s* idx_hvals=&hvals[j*dim_ker[1]];
+              if(iter_n + 4*%(self_dy)s < Os[1]){
+                nb_sum=4;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                  sum3+=idx_hvals[k]*idx_in[im_idx+2*%(self_dy)s];
+                  sum4+=idx_hvals[k]*idx_in[im_idx+3*%(self_dy)s];
+                }
+              }else if(iter_n + 2*%(self_dy)s < Os[1]){
+                nb_sum=2;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                  sum2+=idx_hvals[k]*idx_in[im_idx+%(self_dy)s];
+                }
+              }else{
+                nb_sum=1;
+                for (int k=dim_ker[1]-1,im_idx=pos_n; k >=0; k--,im_idx++) {
+                  sum+=idx_hvals[k]*idx_in[im_idx];
+                }
+              }
+            }//else valid mode
+          }//for j
+          switch(nb_sum){
+          case 4: out[iter_m*dim_zz[1]+iter_n+3] %(affectation)s sum4;
+          case 3: out[iter_m*dim_zz[1]+iter_n+2] %(affectation)s sum3;
+          case 2: out[iter_m*dim_zz[1]+iter_n+1] %(affectation)s sum2;
+          case 1: out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+          }
+          iter_n+=nb_sum-1;
+/*
+          out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum;
+          if(nb_sum>=2){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum2;
+          }
+          if(nb_sum>=3){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum3;
+          }
+          if(nb_sum>=4){
+            iter_n++;
+            out[iter_m*dim_zz[1]+iter_n] %(affectation)s sum4;
+          }
+*/
+        }//for iter_n
+      }//for iter_m
+    }//for stack_size
+    if (0 && (mode==FULL)){
+      for (int i = 0; i < dim_zz[0]*dim_zz[1]; ++i) 
+        std::cout << " " << out[i];
+      std::cout << "\\n";
+    }
+  }//for n_kern
+}//for b
+Py_XDECREF(img2d);
+Py_XDECREF(filtersflipped);
+"""
--- a/theano/sandbox/scan.py
+++ b/theano/sandbox/scan.py
@@ -62,17 +62,6 @@ def scan(fn, sequences, initial_states, non_sequences, inplace_map={},

    # compute number of sequences and number of seqs   
    n_seqs     = len(seqs)
-
-    # see if there are outputs that do not feed anything back to the function
-    # applied recursively
-    #outs_tapkeys = outputs_taps.keys()
-    #outs_tapkeys.sort()
-    #for k in outs_tapkeys:
-    #    if outputs_taps[k] == []:
-    #        # add empty lists where you have outputs that do not have past 
-    #        # values
-    #        init_outs = init_outs[:k] + [[]] + init_outs[k:]
-
    n_outs   = len(init_outs)



--- a/theano/sandbox/test_conv.py
+++ b/theano/sandbox/test_conv.py
@@ -41,7 +41,7 @@ def flip(kern, kshp):
 global_rng = N.random.RandomState(3423489)

 dmatrix4=T.TensorType('float64', (False, False, False, False))
-def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_convolve2=False, do_print=True, repeat=1):
+def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_convolve2=False, do_print=True, repeat=1, unroll_patch=0):

        # build actual input images
        imgval = global_rng.rand(bsize, imshp[0], imshp[1], imshp[2])
@@ -121,7 +121,7 @@ def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll
                hidval1=outval.copy()

            # ConvOp
-            conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode, unroll_batch=unroll_batch, unroll_kern=unroll_kern)(inputs4, kerns4)
+            conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode, unroll_batch=unroll_batch, unroll_kern=unroll_kern, unroll_patch=unroll_patch)(inputs4, kerns4)
            l1shp=N.hstack((nkern,
                            getFilterOutShp(imshp, kshp, ss, conv_mode)))
            propup2 = function([inputs4, kerns4], conv_op)
@@ -328,7 +328,7 @@ class TestConvOp(unittest.TestCase):
        ssizess = [[(1,1),(1,2)],[(1,1),(2,2)]]
        convmodes = ['valid','full']
        do_convolve2=True
-        unroll = [(0,0),(1,1),(2,2),(3,2)]#(batch,kern)
+        unroll = [(0,0,False),(0,0,True),(1,1,False),(2,2,False),(3,2,False)]#(batch,kern,patch)
        do_speed_test = False

        # TODO: this version show a bug that was fixed
@@ -338,6 +338,11 @@ class TestConvOp(unittest.TestCase):
 #        nkerns = [2,2] # per output pixel
 #        ssizes = [(1,1),(2,2)]#2,2)]

+#        bsizes = [1,1] # batch size
+#        imshp_starts = [(1,10,10),(1,5,6)]
+#        kshpss = ([[2,3],[3,2]],[[2,2],[2,2]])
+#        nkernss = [[1,1],[1,1]] # per output pixel
+
        N.set_printoptions(threshold=N.nan)

        # symbolic stuff
@@ -356,8 +361,8 @@ class TestConvOp(unittest.TestCase):

            unroll_batch = [1,2,4,5,10,20]
            unroll_kern = [1,2,4,5,10,20]
-            unroll_batch = [1,2,5]
-            unroll_kern = [1,2,5]
+            unroll_batch = [1,4,5]
+            unroll_kern = [1,4,5]
            
            bsize = 20 # batch size
            imshp_start = (1,48,48)#un square shape to test more corner case.
@@ -374,46 +379,86 @@ class TestConvOp(unittest.TestCase):
            timing = N.zeros((len(unroll_batch),len(unroll_kern),3))
            t_b_k=[]
            #calculate the timing with unrolling
+
+            t_=[[ 7.60572791,  3.95069814,  3.74271464], [ 4.05631089,  2.90384555,  2.93613672], [ 3.90551591,  2.92595196,  3.00102282]]
+            best=[]
+            worst=[]
+            best=[0.52690219879150391, 2.4266397953033447]
+            worst=[0.92042708396911621, 6.8822150230407715]
+            t_=[]
            for unroll_b, n_b in zip(unroll_batch,range(len(unroll_batch))):
                for unroll_k, n_k in zip(unroll_kern,range(len(unroll_kern))):
                    t_b_k.append(str(unroll_b)+"/"+str(unroll_k))
-                    tctot, tpytot, ntot=[],[],[]
-                    for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
-                        for ss, n_ss in zip(ssizes,range(len(ssizes))):
-                            tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=unroll_b, unroll_kern=unroll_k, validate=validate)
-                            tctot+=[tctot_]
-                            tpytot+=[tpytot_]
-                            ntot+=[ntot_]
-                    timing[n_b,n_k]=[sum(tctot), sum(tpytot), sum(ntot)]
-
+                    if not t_:
+                        tctot, tpytot, ntot=[],[],[]
+                        for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                            for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                                tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=unroll_b, unroll_kern=unroll_k, validate=validate)
+                                tctot+=[tctot_]
+                                tpytot+=[tpytot_]
+                                ntot+=[ntot_]
+                        if unroll_b==4 and unroll_k==4:
+                            print "unroll 4/4",tctot
+                            best=tctot
+                        if unroll_b==1 and unroll_k==1:
+                            print "unroll 1/1",tctot
+                            worst=tctot
+                        timing[n_b,n_k]=[sum(tctot), sum(tpytot), sum(ntot)]
+            if not t_:
+                t=timing[:,:,0]#We select only the c timing.
+            else:
+                t=t_
+            t=N.asarray(t)
            #calculate the old timing
-            tctot,tpytot,ntot=0,0,0
-            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
-                for ss, n_ss in zip(ssizes,range(len(ssizes))):
-                    tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate)
-                    tctot+=tctot_
-                    tpytot+=tpytot_
-                    ntot+=ntot_
-            print "old code timing %.3fs"%tctot
-
-#            print timing
-            t=timing[:,:,0]#We select only the c timing.
+            tctot_=[0.52555489540100098, 6.6634182929992676]
+#            tctot_=[]
+            tctot,tpytot,ntot=[],[],[]
+            if not tctot_:
+                for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                    for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate)
+                        tctot+=[tctot_]
+                        tpytot+=[tpytot_]
+                        ntot+=[ntot_]
+            else: tctot=N.asarray(tctot_)
+            print "old code timing %.3fs"%sum(tctot),tctot
+            best=N.asarray(best)
+            worst=N.asarray(worst)
            print "timing for unrolled version"
            print t_b_k
            print t
            print "max %.3fs"%t.max(), "max param(batch unloop size/kernel unloop size)", t_b_k[t.argmax()]
            print "min %.3fs"%t.min(), "min param(batch unloop size/kernel unloop size)", t_b_k[t.argmin()]
-            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),tctot/t.min())
+            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),sum(tctot)/t.min())
+            print worst/best,tctot/best
+
+            tctot_patch = []
+            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
+                for ss, n_ss in zip(ssizes,range(len(ssizes))):
+                     tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=2)
+                     tctot_patch += [tctot_]
+
+            t_patch=sum(tctot_patch)
+            print "unroll_patch time", tctot_patch
+            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch,sum(tctot)/t_patch)
+            print best/tctot_patch, worst/tctot_patch
+            
+            print best
+            print worst
+            print tctot
+            print tctot_patch
            return
+
        
        for i in range(len(kshpss)):
            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                for ss, n_ss in zip(ssizess[i],range(len(ssizess[i]))):
-                    for un_b, un_k in unroll:
+                    for un_b, un_k, un_p in unroll:
                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(
                            conv_mode, ss, bsizes[i], imshp_starts[i], 
                            kshpss[i], nkernss[i],
                            img=img, unroll_batch=un_b, unroll_kern=un_k,
+                            unroll_patch=un_p,
                            validate=True)
                        tctot+=[tctot_]
                        tpytot+=[tpytot_]
@@ -428,6 +473,11 @@ class TestConvOp(unittest.TestCase):
        d=N.asarray(ntot)/tpytot
        print 'speed up py theano(ConvOp) vs convolve2d: %.3fx'%d.mean(),d

+
+    def init_data(self,shape):
+        return N.ones(shape)
+        return N.random.random(shape)
+        
    def test_ConvOpGrad(self):
        """
        test the gradient in float and double
@@ -442,9 +492,9 @@ class TestConvOp(unittest.TestCase):
        kshps = [(2,3)]
        imshps = [(2,3,4)]
        modes = ['valid', 'full']
-        unroll = [(0,0),(1,1),(2,3)]
+        unroll = [(0,0,True),(1,1,False),(2,3,False),(1,1,False),(0,0,False)]#(batch,kern,patch)
        ssizes = [(1,1),(2,2)]
-        
+
        for typ in types:
            imgs  = T.TensorType(typ, (False, False, False, False),'imgs')
            kerns = T.TensorType(typ, (False, False, False, False),'kerns')
@@ -457,12 +507,12 @@ class TestConvOp(unittest.TestCase):
                    imgvals = N.array(N.random.random(N.hstack((bsize,imshp))),dtype=imgs.dtype)
                    for kshp in kshps:
                        t=numpy.array([imshp[1]-kshp[0],imshp[2]-kshp[1]])
-                        kernvals = N.array(N.random.rand(nkern,visdim,kshp[0],
-                                                         kshp[1]),dtype=kerns.dtype)
+                        kernvals = N.array(self.init_data((nkern,visdim,kshp[0],
+                                                          kshp[1])),dtype=kerns.dtype)
                        # 'full' mode should support kernels bigger than the input
                        if mode == 'valid' and (t<0).any():
                            continue
-                        for un_b,un_k in unroll:
+                        for un_b,un_k, un_p in unroll:
                                for ss in ssizes:
                                    print 'test_ConvOpGrad'
                                    print 'mode type:', mode, typ
@@ -476,14 +526,14 @@ class TestConvOp(unittest.TestCase):

                                    def test_i(imgs):
                                        convop = ConvOp(imshp, kshp, nkern, bsize, ss[0], ss[1],
-                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k)
+                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p)
                                        return convop(imgs, kernvals)

                                    def test_k(kerns):
                                        convop = ConvOp(imshp, kshp, nkern, bsize, ss[0], ss[1],
-                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k)
+                                                        output_mode=mode, unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p)
                                        return convop(imgvals, kerns)
-
+                                    print mode, imshp, kshp, un_b, un_k, ss
                                    #TODO the tolerance needed to pass is very high for float32(0.17). Is this acceptable? Expected?
 				    tol = None
 				    if typ=="float32":

--- a/theano/sandbox/test_scan.py
+++ b/theano/sandbox/test_scan.py
-from scan import Scan

 import unittest
 import theano
+import theano.sandbox.scan
+

 import random
 import numpy.random
@@ -74,6 +75,14 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps = None, tol = None,



+def compareArrays(a,b):
+    if type(a) in (list,tuple):
+        a = numpy.array(a)
+    if type(b) in (list, tuple):
+        b = numpy.array(b)
+
+    return numpy.all( abs(a-b) < 1e-5)
+


 
@@ -85,7 +94,7 @@ class T_Scan(unittest.TestCase):

    # generator network, only one output , type scalar ; no sequence or 
    # non sequence arguments
-    def test_1():
+    def test_1(self):
      def f_pow2(x_tm1):
        return (2*x_tm1, {})
    
@@ -94,11 +103,12 @@ class T_Scan(unittest.TestCase):
      Y = theano.sandbox.scan.scan(f_pow2, [],s, [],n_steps = n_steps)
    
      f1 = theano.function([s,n_steps], Y)
-      assert( numpy.any(f1([1],3)== [2,4,8])  )
+      
+      assert(compareArrays(f1([1],3), [2,4,8]))

    # simple rnn, one input, one state, weights for each; input/state are 
    # vectors, weights are scalars
-    def test_2():
+    def test_2(self):
        def f_rnn(u_t,x_tm1,W_in, W):
            return (u_t*W_in+x_tm1*W, {})
    
@@ -109,14 +119,15 @@ class T_Scan(unittest.TestCase):

        Y = theano.sandbox.scan.scan(f_rnn, u,x0,[W_in,W])
    
-        f2 = theano.function([u,x0,W_in,W], Y)
-        
-        assert(numpy.any(f2([1,2,3,4],[1],.1,1)== \
-                numpy.array([1.1,1.3,1.6,2.])))
+        f2    = theano.function([u,x0,W_in,W], Y)
+        v_u   = numpy.array([1.,2.,3.,4.])
+        v_x0  = numpy.array([1])
+        v_out = numpy.array([1.1,1.3,1.6,2.])
+        assert(compareArrays( f2(v_u,v_x0,.1,1), v_out   ) )

    # simple rnn, one input, one state, weights for each; input/state are 
    # vectors, weights are scalars; using shared variables
-    def test_3():
+    def test_3(self):
    
        u    = theano.tensor.dvector()
        x0   = theano.tensor.dvector()
@@ -128,14 +139,16 @@ class T_Scan(unittest.TestCase):
    
        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0,[])

-        f3 = theano.function([u,x0], Y)
-        
-        assert(numpy.any(f3([1,2,3,4],[1])== numpy.array([1.1,1.3,1.6,2.])))
+        f3    = theano.function([u,x0], Y)
+        v_u   = numpy.array([1.,2.,3.,4.])
+        v_x0  = numpy.array([1.])
+        v_out = numpy.array([1.1,1.3,1.6,2.])
+        assert(compareArrays(f3(v_u,v_x0),v_out))


    # some rnn with multiple outputs and multiple inputs; other dimension 
    # instead of scalars/vectors
-    def test_4():
+    def test_4(self):
    
        W_in2 = theano.shared(numpy.array([1.,2.]), name='win2')
        W     = theano.shared(numpy.array([[2.,1.],[1.,1.]]), name='w')
@@ -152,20 +165,22 @@ class T_Scan(unittest.TestCase):

        Y = theano.sandbox.scan.scan(f_rnn_cmpl,[u1,u2],[x0,y0],W_in1)
    
-        f4 = theano.function([u1,u2,x0,y0,W_in1], Y)
-    
-        (x,y) =  f4( numpy.array([[1,2],[1,2],[1,2]]), \
-                  numpy.array([1,2,3]),             \
-                  numpy.array([[0,0]]),             \
-                  numpy.array([1]),                 \
-                  numpy.array([[1,1],[1,1]]))
-    
-        assert( numpy.all(x == numpy.array([[4.,5.],[18.,16.],[58.,43.]])))
-        assert( numpy.all(y == numpy.array([0.,7.,25.])))
+        f4     = theano.function([u1,u2,x0,y0,W_in1], Y)
+        v_u1   = numpy.array([[1.,2.],[1.,2.],[1.,2.]])
+        v_u2   = numpy.array([1.,2.,3.])
+        v_x0   = numpy.array([[0.,0.]])
+        v_y0   = numpy.array([1])
+        v_Win1 = numpy.array([[1.,1.],[1.,1.]])
+        v_x    = numpy.array([[4.,5.],[18.,16.],[58.,43.]])
+        v_y    = numpy.array([0.,7.,25.])
+        (x,y) =  f4( v_u1, v_u2, v_x0, v_y0, v_Win1)
+         
+        assert( compareArrays(x,v_x)) 
+        assert( compareArrays(y,v_y))


    # basic ESN using updates 
-    def test_5(): 
+    def test_5(self): 
        W_in = theano.shared(numpy.array([1.,1.]), name='win')
        W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
        W_out= theano.shared(numpy.array([.5,1.]), name='wout')
@@ -180,12 +195,15 @@ class T_Scan(unittest.TestCase):
    
        Y = theano.sandbox.scan.scan(f_ESN,u,y0,[],outputs_taps={0:[]})
    
-        f5 = theano.function([u,y0],Y)
-        assert( f5( numpy.array([1,2,3]), numpy.array([0])) == \
-                 numpy.array([0.,1.4,3.15]))
+        f5    = theano.function([u,y0],Y)
+        v_u   = numpy.array([1.,2.,3.])
+        v_y0  = numpy.array([0.])
+        v_out  = numpy.array([0.,1.5,3.15])
+        out = f5( v_u, v_y0 )
+        assert( compareArrays(v_out, out))

    # basic ESN using updates ; moving backwards
-    def test_6(): 
+    def test_6(self): 
        W_in = theano.shared(numpy.array([1.,1.]), name='win')
        W    = theano.shared(numpy.array([[.1,0.],[.0,.1]]),name='w')
        W_out= theano.shared(numpy.array([.5,1.]), name='wout')
@@ -201,9 +219,55 @@ class T_Scan(unittest.TestCase):
        Y = theano.sandbox.scan.scan(f_ESN,u,y0,[],outputs_taps={0:[]}, \
                                     go_backwards = True)
    
-        f6 = theano.function([u,y0],Y)
-        assert( f6( numpy.array([1,2,3]), numpy.array([0])) == \
-                 numpy.array([0., 4.5, 3.45]))
+        f6    = theano.function([u,y0],Y)
+        v_u   = numpy.array([1.,2.,3.])
+        v_y0  = numpy.array([0])
+        v_out = numpy.array([0.,4.5,3.45])
+        out   = f6(v_u, v_y0)
+        
+        assert( compareArrays(out, v_out))
+
+    # simple rnn, one input, one state, weights for each; input/state are 
+    # vectors, weights are scalars; using shared variables and past 
+    # taps (sequences and outputs)
+    def test_7(self):
+    
+        u    = theano.tensor.dvector()
+        x0   = theano.tensor.dvector()
+        W_in = theano.shared(.1, name = 'w_in')
+        W    = theano.shared(1., name ='w')
+    
+        def f_rnn_shared(u_tm2, x_tm1, x_tm2):
+            return (u_tm2*W_in+x_tm1*W+x_tm2, {})
+    
+        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0, [], \
+                 sequences_taps = {0:[-2]}, outputs_taps = {0:[-1,-2]})
+
+        f7 = theano.function([u,x0], Y)
+        
+        #print f7([1,2,3,4],[1,2])
+        
+    # simple rnn, one input, one state, weights for each; input/state are 
+    # vectors, weights are scalars; using shared variables and past 
+    # taps (sequences and outputs) and future taps for sequences
+    def test_8(self):
+    
+        u    = theano.tensor.dvector()
+        x0   = theano.tensor.dvector()
+        W_in = theano.shared(.1, name = 'w_in')
+        W    = theano.shared(1., name ='w')
+    
+        def f_rnn_shared(u_tm2,u_tp2, x_tm1, x_tm2):
+            return ((u_tm2+u_tp2)*W_in+x_tm1*W+x_tm2, {})
+    
+        Y = theano.sandbox.scan.scan(f_rnn_shared, u,x0, [], \
+                 sequences_taps = {0:[-2,2]}, outputs_taps = {0:[-1,-2]})
+
+        f8 = theano.function([u,x0], Y)
+        
+        #print f8([1,2,3,4,5,6],[1,2])
+        
+


    '''
@@ -214,7 +278,8 @@ class T_Scan(unittest.TestCase):
        - test gradient (go_bacwards) 
        - test gradient (multiple outputs / some uncomputable )
        - test gradient (truncate_gradient)
-        - test gradient (force_gradient) 
+        - test gradient (force_gradient)
+        - test_gradient (taps past/future)
        - test inplace map
    '''


--- a/theano/tensor/nnet.py
+++ b/theano/tensor/nnet.py
@@ -1020,13 +1020,18 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
    #           / softmax(x)
    #   which arises from the gradient of log(softmax(x))[arange(y.shape[0]), y]
    #
-    # TODO: explain variants of case 1.
-    # TODO: explain other variants of case 2.
    # In some cases, in case 2., insted of "-1. like (AdvancedSubtensor...)",
    # we can have "-1. like ([-1] * AdvancedSubtensor...)". This case will be
    # recognized too, but other variants, even with the same shape, might not
    # (yet).

+    # The base cases are realized when the gradient of the
+    # cost wrt the output is equal to 1.  When this gradient
+    # has another (scalar) value, it typically appears in the
+    # second argument of AdvancedIncSubtensor. In that case, we
+    # try to extract it, and feed it as the output gradient of
+    # crossentropy_softmax_1hot_with_bias_dx.
+
    #
    # N.B. Regarding clients -- This substitution is important for numerical stability, so we
    # perform the substitution even when intermediate values have multiple clients.
@@ -1052,43 +1057,60 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
        else:
            return

-        # Check that incr has the form -1./sm[arange(len(y)), y]
+        # In the base case (output gradient = 1), incr is -1./sm[arange(len(y)), y]
+        # Here, we are looking for the AdvancedSubtensor term (sm[arange(len(y)), y]),
+        # the remaining of the expression will be used to compute outgrad_factor
+        # outgrad_factor will be constructed in 3 steps as follow:
+        # outgrad_factor = +/- 1 (initial sign)
+        # outgrad_factor *= numerator
+        # outgrad_factor /= denominator
+        adv_subtensor = None
+        outgrad_factor = 1.
+
+        # If there's a 'minus' sign before the whole expression, put it in
+        # outgrad_factor and iterate
+        if incr.owner and incr.owner.op == tensor.neg:
+            outgrad_factor = -1.
+            incr = incr.owner.inputs[0]
+
        if incr.owner and incr.owner.op == tensor.true_div:
            num, denom = incr.owner.inputs

-            if not (hasattr(num, 'data') and numpy.all(num.data == -1)):
+            # set outgrad_factor according to the numerator,
+            # it may be divided later
+            if hasattr(num, 'data') and numpy.all(num.data == -1):
+                # Base case, num is -1
+                outgrad_factor *= 1.
+            elif numpy.all(num.broadcastable):
+                # Otherwise, it should be a scalar
+                outgrad_factor *= -num
+            else:
                return
-            #else: OK

            if not denom.owner:
                return

-            adv_subtensor = None
            if isinstance(denom.owner.op, tensor.AdvancedSubtensor):
+                # Base case
                adv_subtensor = denom
-                mult_factor = 1
+                outgrad_factor /= 1.
            elif denom.owner.op == tensor.mul:
-                # Try to find the AdvancedSubtensor node mentionned above
-                # For now, we support only the case where the other inputs
-                # of the "mul" node are of integer type, so we are sure it
-                # does not affect the gradient computation.
+                # Try to find the AdvancedSubtensor node mentionned above,
+                # and a scalar that is equal to the output gradient
                for i, input in enumerate(denom.owner.inputs):
                    if input.owner and isinstance(input.owner.op, tensor.AdvancedSubtensor):
-                        adv_subtensor = input
                        other_inputs = [in_ for (j, in_) in enumerate(denom.owner.inputs) if j!=i]
                        if len(other_inputs) == 1:
-                            mult_factor = other_inputs[0]
+                            rest = other_inputs[0]
                        else:
-                            mult_factor = tensor.mul(*[other_inputs])
+                            rest = tensor.mul(*[other_inputs])

-                        # Check that mult_factor is of integer type
-                        if mult_factor.dtype.startswith('int')\
-                                or mult_factor.dtype.startswith('uint'):
-                            #OK
+                        # Check that rest is a scalar
+                        if numpy.all(rest.broadcastable):
+                            adv_subtensor = input
+                            outgrad_factor /= rest
                            break
-                        else:
-                            # That subtensor was not right
-                            adv_subtensor = None
+
            else:
                return

@@ -1101,6 +1123,8 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
                if not (maybe_sm is sm and maybe_rows is rows and maybe_labels is labels):
                    return
                #else: OK
+            else:
+                return
        else:
            return

@@ -1147,7 +1171,7 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
            if incr.owner and incr.owner.op == tensor.fill:
                model, value = incr.owner.inputs
                adv_subtensor = None
-                mult_factor = 1
+                outgrad_factor = None
                if model.owner and isinstance(model.owner.op, tensor.AdvancedSubtensor):
                    adv_subtensor = model
                else:
@@ -1169,17 +1193,16 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):
                    if not (maybe_log_sm is log_sm and maybe_rows is rows and maybe_labels is labels):
                        return
                    #else: OK
+                else:
+                    return

                # In the base case, value is the constant '-1'
                if hasattr(value, 'data') and numpy.all(value.data == -1):
-                    mult_factor = 1
-                # In the case of -1/denom, if denom is of integer type
-                elif value.owner and value.owner.op == tensor.true_div:
-                    val_num, val_denom = value.owner.inputs
-                    if hasattr(val_num, 'data') and numpy.all(val_num.data == -1):
-                        if val_denom.dtype.startswith('int')\
-                                or val_denom.dtype.startswith('uint'):
-                            mult_factor = val_denom
+                    outgrad_factor = 1.
+                # Otherwise, it should be a scalar, and the output gradient
+                # would be -value
+                elif numpy.all(value.broadcastable):
+                    outgrad_factor = -value
                else:
                    return

@@ -1204,11 +1227,10 @@ def local_advanced_indexing_crossentropy_onehot_grad(node):

    # Dimension check before substitution
    if labels.ndim == 1 and x_var.ndim == 2:
-        if mult_factor is not None:
-            out_grad = tensor.fill(x_var[:,0], 1./mult_factor)
+        if outgrad_factor is not None:
+            out_grad = tensor.fill(x_var[:,0], outgrad_factor)
            return [crossentropy_softmax_1hot_with_bias_dx(out_grad, sm, labels)]
        else:
-            print 'mult_factor is None?'
            return
    else:
        return

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -346,7 +346,7 @@ def local_IncSubtensor_serialize(node):

    #
    #  add(x, incsubtensor(b, c), incsubtensor(b, d))
-    #  -> incsubtensor(incsubtensor(add(x,b), c), d)
+    #  -> incsubtensor(incsubtensor(add(x,b,b), c), d)
    
    """
    def movable(i):
@@ -354,7 +354,8 @@ def local_IncSubtensor_serialize(node):
        return i.owner \
                and isinstance(i.owner.op, T.IncSubtensor) \
                and i.type == o_type \
-                and len(i.clients) == 1
+                and len(i.clients) == 1 \
+                and not i.owner.op.set_instead_of_inc

    if node.op == T.add:
        o_type = node.outputs[0].type
@@ -383,7 +384,8 @@ def local_IncSubtensor_serialize(node):
 @gof.local_optimizer([None])
 def local_inplace_setsubtensor(node):
    if isinstance(node.op, T.IncSubtensor) and not node.op.inplace:
-        new_op = T.IncSubtensor(node.op.idx_list, inplace=True)
+        new_op = T.IncSubtensor(node.op.idx_list, inplace=True, \
+                        set_instead_of_inc=node.op.set_instead_of_inc)
        new_node = new_op(*node.inputs)
        return [new_node]
    return False
@@ -932,8 +934,11 @@ def local_neg_neg(node):
 @register_specialize
 @gof.local_optimizer([T.neg])
 def local_neg_div_neg(node):
+    """- (-a / b) -> a / b
+
+    Also performs - (c / b) -> ((-c) / b) when c is a scalar constant.
+    """
    if node.op == T.neg:
-        """- (-a / b) -> a / b"""
        if node.inputs[0].owner and node.inputs[0].owner.op == T.true_div:
            frac = node.inputs[0]
            num, denom = frac.owner.inputs
@@ -942,6 +947,11 @@ def local_neg_div_neg(node):
                    # No other clients of the original division
                    new_num = num.owner.inputs[0]
                    return [T.true_div(new_num, denom)]
+            elif numpy.all(num.broadcastable) and isinstance(num, gof.Constant):
+                if len(frac.clients) == 1:
+                    new_num = -num.data
+                    return [T.true_div(new_num, denom)]
+

 @gof.local_optimizer([T.mul])
 def local_mul_zero(node):

--- a/theano/tensor/tests/test_nnet.py
+++ b/theano/tensor/tests/test_nnet.py
@@ -223,6 +223,204 @@ class T_CrossentropyCategorical1Hot(unittest.TestCase):
        assert not has_softmax
        assert not has_softmaxdx

+    def test_get_rid_of_advanced_indexing_version_of_xent(self):
+        verbose = 0
+        # TODO: add the optimization in FAST_COMPILE?
+        # In the mean time, run it as 'FAST_RUN' instead
+        mode = theano.compile.mode.get_default_mode()
+        if mode == 'FAST_COMPILE':
+            mode = 'FAST_RUN'
+
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        x_val = rng.randn(3,5)
+        b_val = rng.randn(5)
+        y_val = numpy.asarray([2,4,1])
+
+        x = T.dmatrix('x')
+        b = T.dvector('b')
+        y = T.lvector('y')
+
+        def print_graph(func):
+            for i, node in enumerate(func.maker.env.toposort()):
+                print i, node
+            # Last node should be the output
+            print i, pprint(node.outputs[0])
+            print
+
+        ## Basic case
+        expressions = [
+                T.sum(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y])
+                ]
+
+        for expr in expressions:
+            # Verify the optimizer worked on the expressions
+            f = theano.function([x,y], expr, mode=mode)
+            if verbose: print_graph(f)
+            assert len(f.maker.env.toposort()) == 4
+            f(x_val, y_val)
+
+            # Also verify the gradient wrt x
+            g = theano.function([x,y], T.grad(expr, x), mode=mode)
+            if verbose: print_graph(g)
+            assert len(g.maker.env.toposort()) == 4
+            g(x_val, y_val)
+
+
+        ## Test that a biased softmax is optimized correctly
+        bias_expressions = [
+                T.sum(-T.log(softmax(x+b)[T.arange(y.shape[0]), y])),
+                -T.sum(T.log(softmax(b+x)[T.arange(y.shape[0]), y])),
+                -T.sum(T.log(softmax(x+b))[T.arange(y.shape[0]), y]),
+                T.sum(-T.log(softmax(b+x))[T.arange(y.shape[0]), y])]
+
+        for expr in bias_expressions:
+            f = theano.function([x,b,y], expr, mode=mode)
+            if verbose: print_graph(f)
+            assert len(f.maker.env.toposort()) == 2 # [big_op, sum]
+            f(x_val, b_val, y_val)
+
+            g = theano.function([x,b,y], T.grad(expr, x), mode=mode)
+            if verbose: print_graph(g)
+            assert len(g.maker.env.toposort()) == 4
+            g(x_val, b_val, y_val)
+
+        ## Test that using "mean" instead of sum works, too
+        mean_expressions = [
+                T.mean(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                T.mean(-T.log(softmax(x))[T.arange(y.shape[0]), y])]
+
+        for expr in mean_expressions:
+            f = theano.function([x,y], expr, mode=mode)
+            if verbose: print_graph(f)
+            assert len(f.maker.env.toposort()) == 7
+            f(x_val, y_val)
+
+            g = theano.function([x,y], T.grad(expr, x), mode=mode)
+            if verbose: print_graph(g)
+            assert len(g.maker.env.toposort()) == 8
+            g(x_val, y_val)
+
+        mean_bias_expressions = [
+                T.mean(-T.log(softmax(x+b)[T.arange(y.shape[0]), y])),
+                -T.mean(T.log(softmax(b+x)[T.arange(y.shape[0]), y])),
+                -T.mean(T.log(softmax(x+b))[T.arange(y.shape[0]), y]),
+                T.mean(-T.log(softmax(b+x))[T.arange(y.shape[0]), y])]
+
+        for expr in mean_bias_expressions:
+            f = theano.function([x,b,y], expr, mode=mode)
+            if verbose: print_graph(f)
+            assert len(f.maker.env.toposort()) == 5
+
+            g = theano.function([x,b,y], T.grad(expr, x), mode=mode)
+            if verbose: print_graph(g)
+            assert len(g.maker.env.toposort()) == 8
+            g(x_val, b_val, y_val)
+
+
+    def test_scale_cost(self):
+        # TODO: add the optimization in FAST_COMPILE?
+        # In the mean time, run it as 'FAST_RUN' instead
+        mode = theano.compile.mode.get_default_mode()
+        if mode == 'FAST_COMPILE':
+            mode = 'FAST_RUN'
+
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        x_val = rng.randn(3,5)
+        b_val = rng.randn(5)
+        y_val = numpy.asarray([2,4,1])
+
+        x = T.dmatrix('x')
+        b = T.dvector('b')
+        y = T.lvector('y')
+        a = T.dscalar('a')
+
+        def print_graph(func):
+            for i, node in enumerate(func.maker.env.toposort()):
+                print i, node
+            # Last node should be the output
+            print i, pprint(node.outputs[0])
+
+        def validate_fn_graph(func):
+            # The graph of the function should not have softmax anymore
+            has_cx1hot = False
+            has_softmax = False
+            for node in func.maker.env.toposort():
+                if node.op == crossentropy_softmax_argmax_1hot_with_bias:
+                    has_cx1hot = True
+                if node.op == softmax:
+                    has_softmax = True
+
+            assert has_cx1hot
+            assert not has_softmax
+
+        def validate_grad_graph(func):
+            # The graph of the gradient should not have softmaxgrad anymore
+            has_cx1hotdx = False
+            has_softmax = False
+            has_softmaxdx = False
+            for node in func.maker.env.toposort():
+                if node.op == crossentropy_softmax_1hot_with_bias_dx:
+                    has_cx1hotdx = True
+                if node.op == softmax:
+                    has_softmax = True
+                if node.op == softmax_grad:
+                    has_softmaxdx = True
+
+            assert has_cx1hotdx
+            assert has_softmax
+            assert not has_softmaxdx
+
+
+        ## Cases to test
+        expressions = [
+                a * T.sum(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -a * T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * (-T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y]))),
+                a * T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+
+                a * T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                -a * T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                a * (-T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y])),
+                a * T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+
+                a * T.mean(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                -a * T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+                a * (-T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y]))),
+                a * T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
+
+                a * T.mean(-T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                -a * T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                a * (-T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y])),
+                a * T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
+                ]
+
+        for expr in expressions:
+            # Verify the optimizer worked on the expressions
+            f = theano.function([x,y,a], expr, mode=mode)
+            assert 5 <= len(f.maker.env.toposort()) <= 10
+            validate_fn_graph(f)
+            f(x_val, y_val, 0.1)
+
+            # Verify the gradient wrt x
+            g = theano.function([x,y,a], T.grad(expr, x), mode=mode)
+            assert 5 <= len(g.maker.env.toposort()) <= 12
+            validate_grad_graph(g)
+            g(x_val, y_val, 0.1)
+
+            # Verify the gradient when providing output gradient
+            h = theano.function([x,y,a], T.grad(expr, x, g_cost=a*x.sum()), mode=mode)
+            assert 8 <= len(h.maker.env.toposort()) <= 17
+            validate_grad_graph(h)
+            h(x_val, y_val, 0.1)
+
+
 def test_argmax_pushdown():
    x = tensor.dmatrix()

@@ -306,101 +504,6 @@ def test_asymptotic_32():
        assert gxval[0,1] == 0.25


-def test_get_rid_of_advanced_indexing_version_of_xent():
-    verbose = 0
-    if 0: mode = 'DEBUG_MODE'
-    else: mode = 'FAST_RUN'
-
-    rng = numpy.random.RandomState(utt.fetch_seed())
-
-    x_val = rng.randn(3,5)
-    b_val = rng.randn(5)
-    y_val = numpy.asarray([2,4,1])
-
-    x = T.dmatrix('x')
-    b = T.dvector('b')
-    y = T.lvector('y')
-
-    def print_graph(func):
-        for i, node in enumerate(func.maker.env.toposort()):
-            print i, node
-        # Last node should be the output
-        print i, pprint(node.outputs[0])
-
-    ## Basic case
-    expressions = [
-            T.sum(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
-            -T.sum(T.log(softmax(x)[T.arange(y.shape[0]), y])),
-            -T.sum(T.log(softmax(x))[T.arange(y.shape[0]), y]),
-            T.sum(-T.log(softmax(x))[T.arange(y.shape[0]), y])]
-
-    for expr in expressions:
-        # Verify the optimizer worked on the expressions
-        f = theano.function([x,y], expr, mode=mode)
-        if verbose: print_graph(f)
-        assert len(f.maker.env.toposort()) == 4
-        f(x_val, y_val)
-
-        # Also verify the gradient wrt x
-        g = theano.function([x,y], T.grad(expr, x), mode=mode)
-        if verbose: print_graph(g)
-        assert len(g.maker.env.toposort()) == 4
-        g(x_val, y_val)
-
-
-    ## Test that a biased softmax is optimized correctly
-    bias_expressions = [
-            T.sum(-T.log(softmax(x+b)[T.arange(y.shape[0]), y])),
-            -T.sum(T.log(softmax(b+x)[T.arange(y.shape[0]), y])),
-            -T.sum(T.log(softmax(x+b))[T.arange(y.shape[0]), y]),
-            T.sum(-T.log(softmax(b+x))[T.arange(y.shape[0]), y])]
-
-    for expr in bias_expressions:
-        f = theano.function([x,b,y], expr, mode=mode)
-        if verbose: print_graph(f)
-        assert len(f.maker.env.toposort()) == 2 # [big_op, sum]
-        f(x_val, b_val, y_val)
-
-        g = theano.function([x,b,y], T.grad(expr, x), mode=mode)
-        if verbose: print_graph(g)
-        assert len(g.maker.env.toposort()) == 4
-        g(x_val, b_val, y_val)
-
-    ## Test that using "mean" instead of sum works, too
-    mean_expressions = [
-            T.mean(-T.log(softmax(x)[T.arange(y.shape[0]), y])),
-            -T.mean(T.log(softmax(x)[T.arange(y.shape[0]), y])),
-            -T.mean(T.log(softmax(x))[T.arange(y.shape[0]), y]),
-            T.mean(-T.log(softmax(x))[T.arange(y.shape[0]), y])]
-
-    for expr in mean_expressions:
-        f = theano.function([x,y], expr, mode=mode)
-        if verbose: print_graph(f)
-        assert len(f.maker.env.toposort()) == 7
-        f(x_val, y_val)
-
-        g = theano.function([x,y], T.grad(expr, x), mode=mode)
-        if verbose: print_graph(g)
-        assert len(g.maker.env.toposort()) == 8
-        g(x_val, y_val)
-
-    mean_bias_expressions = [
-            T.mean(-T.log(softmax(x+b)[T.arange(y.shape[0]), y])),
-            -T.mean(T.log(softmax(b+x)[T.arange(y.shape[0]), y])),
-            -T.mean(T.log(softmax(x+b))[T.arange(y.shape[0]), y]),
-            T.mean(-T.log(softmax(b+x))[T.arange(y.shape[0]), y])]
-
-    for expr in mean_bias_expressions:
-        f = theano.function([x,b,y], expr, mode=mode)
-        if verbose: print_graph(f)
-        assert len(f.maker.env.toposort()) == 5
-
-        g = theano.function([x,b,y], T.grad(expr, x), mode=mode)
-        if verbose: print_graph(g)
-        assert len(g.maker.env.toposort()) == 8
-        g(x_val, b_val, y_val)
-
-


    #   hint - call the argmax push-down optimization first too