Merged

f4cecd01 · Olivier Delalleau · 602ee1b1 · 10f86fe3 · f4cecd01 · f4cecd01
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -137,17 +137,23 @@ following methods:
  the gradient of the Op's output but rather the gradient of some
  other criterion C with respect to the Op's input.

-  If the outputs of your op are [ f_1, ... f_n], then
-  ``output_derivatives`` gives [ grad_{f_1} C, grad_{f_2} C, ... , grad_{f_n} C ]
-  If the inputs of your op are [x_1, ..., x_n], then your Op.grad should
-  return [ grad_{x_1} C, grad_{x_2} C, ..., grad_{x_n} C ]
-
-  where (grad_{y} z)_i = partial z / partial y_i  (and i can have any
-  number of dimensions)
-  (note: in the case where i is 2 dimensional, this definition of grad
+  If the outputs of your op are :math:`[ f_1, ... f_n]`, then
+  ``output_derivatives`` gives
+  :math:`[ grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C) ]`.
+  If the inputs of your op are :math:`[x_1, ..., x_m]`, then your Op.grad
+  should return :math:`[ grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C) ]`,
+  where :math:`(grad_{y} z)_i = \frac{\partial z}{\partial y_i}`
+  (and :math:`i` can have any number of dimensions).
+  (Note: in the case where i is 2 dimensional, this definition of grad
  is different from the standard mathematical definition of the gradient
-  of a scalar with respect to a matrix, where you transpose the indices)
- 
+  of a scalar with respect to a matrix, where you transpose the indices.)
+
+  In other words, :func:`grad` does not return
+  :math:`\frac{\partial f_i}{\partial x_j}`, but
+  :math:`\frac{\partial C}{\partial x_j} =
+  \frac{\partial C}{\partial f_i} \cdot \frac{\partial f_i}{\partial x_j}`.
+  Both the partial derivation and that multiplication have to be done by
+  :func:`grad`.


 At a bare minimum, a new Op must define ``make_node`` and ``perform``, which have no defaults.

--- a/doc/index.txt
+++ b/doc/index.txt
@@ -8,7 +8,7 @@ arrays efficiently. Theano features:

 * **tight integration with numpy** -- Use `numpy.ndarray` in Theano-compiled functions.
 * **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
-* **symbolic differentiation** -- Let Theano do your derivatives.
+* **efficient symbolic differentiation** -- Theano does your derivatives for function with one or many inputs.
 * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny.
 * **dynamic C code generation** -- Evaluate expressions faster.
 * **extensive unit-testing and self-verification** -- Detect and diagnose many types of mistake.

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -272,61 +272,84 @@ that fail on your platform (use the ``theano-users@googlegroups.com`` mailing li
 but note that you must first register to it, by going to `theano-users`_).


-Windows V1(bigger install, but simpler instruction + try instruction for gpu)
-----------------------------------------------------------------------------
-
- If you don't have Python yet, I would recommend the Python(x,y)
-  distribution. It is only one installation and contains the most
-  important packages (NumPy, SciPy, IPython, Matplotlib, Mingw, Nose,
-  etc.).
-
- Next you should install Mercurial and download Theano.
-  Command line version: http://mercurial.selenic.com/
-  One gui version(Tortoise hg): http://mercurial.selenic.com/downloads/
-  the command is
-  hg clone http://hg.assembla.com/theano Theano
-
- Theano needs 1 environment variable:
-  a) system variable PYTHONPATH with value C:\...\Theano
-  (installation folder of theano)
-
-  In the USERPROFILE directory 
-  you should create a configuration file .theanorc.
-  .theanorc.txt is also accepted on Windows if the environment 
-  variable THEANORC is not set. The file should have the following
-  two lines:
+Windows V1 (bigger install, but simpler instructions + tentative GPU instructions)
+----------------------------------------------------------------------------------
+
+- Install `Python(x,y) <http://www.pythonxy.com>`_. It is a single installation
+  file that contains additional packages like Numpy, Scipy, IPython, Matplotlib,
+  MinGW, Nose, etc. Note that this implies you do not already have a Python
+  installation (if you do have one, then you will need to either remove it first,
+  or install those additional packages manually as described in the V2 instructions).
+
+- Install Mercurial. You can use either the
+  `command-line version <http://groups.google.com/group/theano-users>`_ or the
+  `GUI version <http://groups.google.com/group/theano-users>`_ (for the purpose of
+  simply downloading Theano, the command line version is enough).
+
+- Start a shell (hit the Start button and run the ``cmd`` command) and navigate to
+  the directory where you want to install Theano (it is ok to just stay in the
+  default directory, which should be your user profile directory). Then download
+  Theano with:

+    .. code-block:: bash
+
+        hg clone http://hg.assembla.com/theano Theano
+
+- Add (or edit) the PYTHONPATH environment variable (available through Control
+  Panel / System / Advanced / Environment Variables), so that it contains
+  the full installation directory of Theano. Restart a shell (``cmd``) to verify
+  that it works:
+
+    .. code-block:: bash
+
+        C:\Users\login>echo %PYTHONPATH%
+        C:\Users\login\Theano
+        
+- Create a new ``.theanorc`` text file (or ``.theanorc.txt``, which is easier
+  to create under Windows) in your user profile directory, with the following
+  two lines:
+ 
    .. code-block:: bash

      [blas]
      ldflags =

-  This is enough to run Theano! It will use NumPy for dot products
-  which, however, is pretty fast (see below).
-  
-  To test that theano read correctly the .theanorc or .theanorc.txt file,
-  in python run:
+- You are now ready to run Theano.
+  It will use NumPy for dot products, which is still pretty fast (see below for
+  optional instructions on how to compile your own BLAS library).
+  To test that theano read correctly your configuration file, run Python (easiest
+  way is to just type ``python`` in a shell) and run the following:

-  .. code-block:: bash
+  .. code-block:: python

      import theano
      print theano.config.blas.ldflags

-  That should print the same content as what is in your config file.
-  
- (Optional) If you want a faster and/or multithreaded BLAS library, you can
-  compile GotoBLAS2. I did not try to compile ATLAS because I read that
-  it is slower than Goto and very difficult to compile (especially for
+  This should print the same content as in your config file, i.e. nothing
+  (if your config file was not read properly, it would print ``-lblas``).
+
+Windows V1.5 (optional follow-up to V1 instructions)
+----------------------------------------------------
+
+- If you want a faster and/or multithreaded BLAS library, you can
+  compile GotoBLAS2. We did not try to compile ATLAS because we read that
+  it is slower than Goto and more difficult to compile (especially on
  Windows).
-  GotoBLAS can be downloaded after a simple registration (the most
-  recent version is 1.13 right now). To compile it, you need to install
-  two more programs: MSYS and Perl (for example ActivePerl). Actually,
-  the GotoBLAS makefiles expect a full UNIX environment (like Cygwin)
-  but the BLAS compilation seems to work with only MSYS and Perl. The
-  LAPACK compilation fails, but we don't need it anyway.
+  GotoBLAS2 can be downloaded
+  `here <http://www.tacc.utexas.edu/tacc-projects/gotoblas2/downloads>`_
+  after registering on the website (we tested v1.13).
+  To compile it, you also need to install MSYS and Perl (for instance
+  ActivePerl).
+  The GotoBLAS makefiles actually expect a full UNIX environment (like
+  Cygwin) but the BLAS compilation seems to work with only MSYS and Perl.
+  The LAPACK compilation fails, but is not needed anyways.
+
+  (WORK-IN-PROGRESS, TO BE CONTINUED)

  Compilation steps:
-  a) Unpack GotoBLAS2 (using 7-zip or the MSYS tar command)
+
+  a) Unpack GotoBLAS2 (using `7-zip <http://www.7-zip.org/>`_ or the
+     MSYS tar command).

  b) open MSYS, change directory to GotoBLAS2 (cd command)

@@ -354,7 +377,7 @@ Windows V1(bigger install, but simpler instruction + try instruction for gpu)
  b) The Windows binaries of NumPy were compiled with ATLAS and are surprisingly fast.
  c) GotoBLAS is even faster, in particular if you have several kernels.

- (Optional) Gpu on Windows. Not sur it work! Can you report success/error on the `theano-user <http://groups.google.ca/group/theano-users?pli=1>`_ mailing list?
+- (Optional) Gpu on Windows. Not sur it work! Can you report success/error on the `theano-users <http://groups.google.com/group/theano-users>`_ mailing list?

  Those are indication for 32 bits version of python, the one that come with pythonxy is 32 bits.


--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -174,9 +174,9 @@ Config Attributes

    A list of optimizer tags that we don't want included in the default Mode.
    If multiple tags, separate them by ':'.
-    Ex: to remove the elemwise inplace optimizer(slow for big graph)
-        use the flags: optimizer_excluding:inplace_opt
-	inplace_opt is the name of that optimization.
+    Ex: to remove the elemwise inplace optimizer(slow for big graph),
+    use the flags: optimizer_excluding:inplace_opt, where
+    inplace_opt is the name of that optimization.

 .. attribute:: optimizer_including


--- a/doc/library/gradient.txt
+++ b/doc/library/gradient.txt
@@ -18,11 +18,15 @@ awkward to use when :func:`tensor.grad` can do the job.

 .. function:: grad_sources_inputs(sources, graph_inputs, warn_type=True)

-    A gradient source is a pair (``r``, ``g_r``), in which ``r`` is a `Variable`, and ``g_r`` is a
-    `Variable` that is a gradient wrt ``r``.
+    A gradient source is a pair (``v``, ``g_v``), in which ``v`` is
+    a `Variable`, and ``g_v`` is a `Variable` that is a gradient wrt
+    ``v``. More specifically, ``g_v`` is the gradient of an external
+    scalar cost, ``cost`` (that is not explicitly used), wrt ``v``.

    This function traverses the graph backward from the ``r`` sources,
-    calling ``op.grad(...)`` for all ops with some non-None gradient on an output.
+    calling ``op.grad(...)`` for all ops with some non-None gradient
+    on an output, to compute gradients of ``cost`` wrt intermediate
+    variables and ``graph_inputs``.

    The ``op.grad(...)`` functions are called like this:

@@ -30,14 +34,20 @@ awkward to use when :func:`tensor.grad` can do the job.

        op.grad(op.inputs[:], [total_gradient(v) for v in op.outputs])

-    This call to ``op.grad`` should return a list or tuple: one symbolic gradient per input.
-    If ``op`` has a single input, then ``op.grad``  should return a list or tuple of length 1.
+    This call to ``op.grad`` should return a list or tuple: one symbolic
+    gradient per input. These gradients represent the gradients of
+    the same implicit ``cost`` mentionned above, wrt ``op.inputs``.  Note
+    that this is **not** the same as the gradient of ``op.outputs`` wrt
+    ``op.inputs``.

-    For each input wrt to which ``op`` is not differentiable, it should return ``None`` instead
-    of a `Variable` instance.
+    If ``op`` has a single input, then ``op.grad`` should return a list
+    or tuple of length 1.
+    For each input wrt to which ``op`` is not differentiable, it should
+    return ``None`` instead of a `Variable` instance.
+
+    If a source ``r`` receives a gradient from another source ``r2``,
+    then the effective gradient on ``r`` is the sum of both gradients.

-    If a source ``r`` receives a gradient from another source ``r2``, then the effective
-    gradient on ``r`` is the sum of both gradients.

    :type sources: list of pairs of Variable: (v, gradient-on-v) to 
                   initialize the total_gradient dictionary

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -463,15 +463,6 @@ TensorVariable
            (0, 'x', 1) -> AxB to Ax1xB
            (1, 'x', 0) -> AxB to Bx1xA

-
-        See :func:`dimshuffle`.
-
-        (The above link just points back to this paragraph. Maybe whoever
-        wrote that meant to refer to theano.tensor.DimShuffle)
-
-        
-
-
    .. method:: flatten(ndim=1)

        Returns a view of this tensor with `ndim` dimensions, whose shape for the first
@@ -500,11 +491,11 @@ Shaping and Shuffling
 =====================

 To re-order the dimensions of a variable, to insert or remove broadcastable
-dimensions, see :meth:`_tensor_py_operators.dimshuffle`
+dimensions, see :meth:`_tensor_py_operators.dimshuffle`.

 .. function:: shape(x)

-    Returns lvector representing shape of `x`
+    Returns an lvector representing the shape of `x`.

 .. function:: reshape(x, newshape, ndim=None)

@@ -562,7 +553,8 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`

    Make `x` broadcastable in the specified axes `axes`. For
    example, `unbroadcast(x,0)` will make the first dimension of `x`
-    broadcastable.
+    broadcastable. When performing the function, if the length of `x`
+    along that dimension is not 1, a ``ValueError`` will be raised.

 .. function:: flatten(x, outdim=1)
    
@@ -1105,10 +1097,14 @@ Gradient / Differentiation

    Return symbolic gradients for one or more variables with respect to some
    cost.
-    
+
+    For more information about how automatic differentiation works in Theano,
+    see :mod:`gradient`. For information on how to implement the gradient of
+    a certain Op, see :func:`grad`.
+
    :type cost: 0-d tensor variable
    :type wrt: tensor variable or list of tensor variables
-    :type g_cost: same as `cost`
+    :type g_cost: same as type of `cost`
    :type consider_constant: list of variables
    :type warn_type: bool

@@ -1121,7 +1117,7 @@ Gradient / Differentiation
       expression

    :rtype: variable or list of variables (matching `wrt`)
-    :returns: gradients with respect to cost for each of the `wrt` terms 
+    :returns: gradients of the cost with respect to each of the `wrt` terms




--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
+
+.. _basictutaliasing:
+
+===============
+Memory Aliasing
+===============
+
+The aggressive reuse of memory is one of the ways Theano makes code fast, and
+it's important for the correctness and speed of your program that you understand
+which buffers Theano might alias to which others.
+
+This file describes the principles for how Theano treats memory, and explains
+when you might want to change the default behaviour of some functions and
+methods for faster performance.
+
+
+The memory model: 2 spaces
+==========================
+
+There are some simple principles that guide Theano's treatment of memory.  The
+main idea is that there is a pool of memory managed by Theano, and Theano tracks
+changes to values in that pool.
+
+ 1. Theano manages its own memory space, which typically does not overlap with
+    the memory of normal python variables that non-theano code creates.
+
+ 1. Theano functions only modify buffers that are in its memory space.
+
+ 1. Theano's memory space includes the buffers allocated to store shared
+    variables and the temporaries used to evaluate Functions.
+
+ 1. Physically, Theano's memory space may be spread across the host, a GPU
+    device(s), and in the future may even include objects on a remote machine.
+
+ 1. The memory allocated for a shared variable buffer is unique: it is never
+    aliased to another shared variable.
+
+ 1. Theano's managed memory is constant while Theano Functions are not running
+    and theano library code is not running.
+
+ 1. The default behaviour of Function is to return user-space values for
+    outputs, and to expect user-space values for inputs.
+    
+The distinction between Theano-managed memory and user-managed memory can be
+broken down by some theano functions (e.g. In, Out,shared, get_value)) by using
+a ``borrow=True`` flag.  This can make those methods faster (by avoiding copy
+operations) at the expense of risking subtle bugs in the overall program (by
+aliasing memory).
+
+The rest of this section is aimed at helping you to understand when it is safe
+to use the ``borrow=True`` argument and reap the benefit of faster code.
+
+Borrowing when creating shared variables
+========================================
+
+A ``borrow`` argument can be provided to the shared-variable constructor.
+
+.. code-block:: python
+
+    import numpy, theano
+    np_array = numpy.ones(2, dtype='float32')
+
+    s_default = shared(np_array)
+    s_false   = shared(np_array, borrow=False)
+    s_true    = shared(np_array, borrow=True)
+
+By default (``s_default``) and when explicitly setting ``borrow=False``, the
+shared variable we construct gets a [deep] copy of ``np_array``.  So changes we
+subsequently make to ``np_array`` have no effect on our shared variable.
+
+.. code-block:: python
+
+    np_array += 1 # now it is an array of 2.0 s
+
+    s_default.value  # -> array([1.0, 1.0])
+    s_false.value    # -> array([2.0, 2.0])
+
+If we are running this with the CPU as the device, 
+then changes we make to np_array *right away* will show up in ``s_false.value``
+because numpy arrays are mutable, and ``s_false`` is using the ``np_array``
+object as it's internal buffer.
+
+However, this aliasing of ``np_array`` and ``s_false`` is *inconsistent and fragile*!
+It is inconsistent because if Theano is using a GPU device, then the borrow flag
+has no effect.
+It is fragile because
+if we call a theano function that updates the value of ``s_false`` the aliasing
+relationship *may* or *may not* be broken (it depends on what the Theano
+function does).
+
+*Take home message:*
+
+It is safe practice (and a good idea) to use ``borrow=True`` in a shared
+variable constructor when the shared variable stands for a large object (in
+terms of memory footprint) and you do not want to create copies of it in memory
+.
+
+It is not a reliable technique to use ``borrow=True`` to modify shared variables
+by side-effect, because with some devices (e.g. GPU devices) this technique will
+not work.
+
+Borrowing when accessing value of shared variables
+==================================================
+
+Retrieving
+----------
+
+A ``borrow`` argument can also be used to control how a shared variable's value is retrieved.
+
+.. code-block:: python
+
+    s = shared(np_array)
+
+    v_false = s.get_value(borrow=False) # N.B. borrow default is False
+    v_true = s.get_value(borrow=True)
+
+
+When ``borrow=False`` is passed to ``get_value``, it means that the return value
+may not be aliased to any part of Theano's internal memory.
+When ``borrow=True`` is passed to ``get_value``, it means that the return value
+*might* be aliased to some of Theano's internal memory.
+But both of these calls might create copies of the internal memory.
+
+The reason that ``borrow=True`` might still make a copy is that the internal
+representation of a shared variable might not be what you expect.  When you
+create a variable by passing a numpy array for example, then ``get_value()``
+must return a numpy array too.  That's how Theano can make the GPU use
+transparent.  But when you are using a GPU (or in future perhaps a remote machine), then the numpy.ndarray
+is not the internal representation of your data.
+If you really want Theano to return its internal representation *and never copy it*
+then you should use the ``return_internal_type=True`` argument to
+``get_value``.  It will never copy the internal object (always return in
+constant time), but might return various datatypes depending on contextual
+factors (e.g. the compute device, the dtype of the numpy array).
+
+.. code-block:: python
+
+    v_internal = s.get_internal_value(borrow=True, return_internal_type=True)
+
+It is possible to use ``borrow=False`` in conjunction with
+``return_internal_type=True``, which will return a deep copy of the internal object.
+This is primarily for internal debugging, not for typical use.
+
+*Take home message:*
+
+It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
+your code does not modify the return value.  *Do not use this to modify a shared
+variable by side-effect* because it will make your code device-dependent.
+Modification of GPU variables by this sort of side-effect is impossible.
+
+Assigning
+---------
+
+Shared variables also have a ``set_value`` method that can accept an optional ``borrow=True`` argument.
+The semantics are similar to those of creating a new shared variable -
+``borrow=False`` is the default and ``borrow=True`` means that Theano *may*
+reuse the buffer you provide as the internal storage for the variable.
+
+A standard pattern for manually updating the value of a shared variable is as
+follows.
+
+.. code-block:: python
+
+    s.set_value(
+        some_inplace_fn(s.get_value(borrow=True)),
+        borrow=True)
+
+This pattern works regardless of the compute device, and when the compute device
+makes it possible to expose Theano's internal variables without a copy, then it
+goes as fast as an in-place update.
+
+
+Borrowing when constructing Function objects
+============================================
+
+A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
+that control how ``theano.function`` handles its arguments and return value[s]. 
+
+.. code-block:: python
+
+    import theano, theano.tensor
+
+    x = theano.tensor.matrix()
+    y = 2*x
+    f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
+
+
+Borrowing an input means that Theano will treat the argument you provide as if
+it were part of Theano's pool of temporaries.  Consequently, your input
+may be reused as a buffer (and overwritten!) during the computation of other variables in the
+course of evaluating that function (e.g. ``f``).  
+
+Borrowing an output means that Theano will not insist on allocating a fresh
+output buffer every time you call the function.  It will possibly reuse the same one as
+a previous call, and overwrite the old contents.  Consequently, it may overwrite
+old return values by side effect.
+
+It is also possible to pass an ``return_internal_type=True`` flag to the ``Out``
+variable which has the same interpretation as the ``return_internal_type`` flag
+to the shared variable's ``get_value`` function.
+
+*Take home message:*
+When an input ``x`` to a function is not needed after the function returns and you
+would like to make it available to Theano as additional workspace, then consider
+marking it with ``In(x, borrow=True)``.  It may make the function faster and
+reduce its memory requirement.
+When a return value ``y`` is large (in terms of memory footprint), and you only need to read from it once, right
+away when it's returned, then consider marking it with an ``Out(y,
+borrow=True)``.
+
+
--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -144,6 +144,8 @@ array([[ 0.25      ,  0.19661193],
 The resulting function computes the gradient of its first argument
 with respect to the second. In this way, Theano can be used for
 `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_.
+As opposed to what this page tell, Theano do efficient symbolic differentiation 
+even for function with many inputs.

 .. note::
   

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -510,7 +510,7 @@ class Function(object):

        # Set positional arguments
        i = 0
-        for arg in args:
+        for arg_index, arg in enumerate(args):
            #TODO: provide a Param option for skipping the filter if we
            #      really want speed.
            s = self.input_storage[i]
@@ -520,7 +520,7 @@ class Function(object):
                try:
                    s.storage[0] = s.type.filter(arg, strict=s.strict)
                except Exception, e:
-                    e.args = tuple(list(e.args)+["Bad input argument at index %d"%(list(args).index(arg))])
+                    e.args = tuple(list(e.args)+["Bad input argument at index %d" % arg_index])
                    raise
            s.provided += 1
            i+=1

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
 import os
+import subprocess
+import logging
+
 from theano.configparser import TheanoConfigParser, AddConfigVar, EnumStr, StrParam, IntParam, FloatParam, BoolParam

+_logger = logging.getLogger('theano.configdefaults')
+def warning(*msg):
+    _logger.warning('WARNING theano.configdefaults: '+' '.join(msg))
+
 config = TheanoConfigParser()

 AddConfigVar('floatX',
@@ -24,10 +31,22 @@ AddConfigVar('mode',
        "Default compilation mode",
        EnumStr('Mode', 'ProfileMode', 'DebugMode', 'FAST_RUN', 'FAST_COMPILE', 'PROFILE_MODE', 'DEBUG_MODE'))

-#Keep the default linker the same as the one for the mode FAST_RUN
-AddConfigVar('linker',
-        "Default linker. If not None, will use this linker with the Mode object(not ProfileMode or DebugMode)",
-        EnumStr('c|py', 'py', 'c', 'c|py_nogc', 'c&py'))
+# Test whether or not gcc is present: disable C code if it is not
+try:
+    subprocess.Popen('gcc', stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    # Keep the default linker the same as the one for the mode FAST_RUN
+    AddConfigVar('linker',
+            "Default linker. If not None, will use this linker with the Mode "+
+            "object(not ProfileMode or DebugMode)",
+            EnumStr('c|py', 'py', 'c', 'c|py_nogc', 'c&py'))
+except OSError:
+    # gcc is not present, linker should default to python only
+    AddConfigVar('linker',
+            "Default linker. If not None, will use this linker with the Mode object(not ProfileMode or DebugMode)",
+            EnumStr('py', 'c|py', 'c', 'c|py_nogc', 'c&py'))
+    warning('GCC not detected ! Theano will be unable to execute optimized '+
+            'C-implementations (for both CPU and GPU) and will default to '+
+            'Python implementations. Performance will be severely degraded.')

 #Keep the default optimizer the same as the one for the mode FAST_RUN
 AddConfigVar('optimizer',
@@ -88,19 +107,23 @@ AddConfigVar('experimental.mrg',
 ###
 ### To disable some warning about old bug that are fixed now.
 ###
+AddConfigVar('warn.old_bug_default',
+             "If False, will disable by default the warning about old Theano bug. If you never used Theano, you set it to False.",
+             BoolParam(True))
+default_warn = config.warn.old_bug_default

 AddConfigVar('warn.argmax_pushdown_bug',
             "Warn if in past version of Theano we generated a bug with the optimisation theano.tensor.nnet.nnet.local_argmax_pushdown optimization. Was fixed 27 may 2010",
-             BoolParam(True))
+             BoolParam(default_warn))

 AddConfigVar('warn.gpusum_01_011_0111_bug',
             "Warn if we are in a case where old version of Theano had a silent bug with GpuSum pattern 01,011 and 0111 when the first dimensions was bigger then 4096. Was fixed 31 may 2010",
-             BoolParam(True))
+             BoolParam(default_warn))

 AddConfigVar('warn.sum_sum_bug',
             "Warn if we are in a case where Theano version between version 9923a40c7b7a and the 2 august 2010(fixed date), generated an error in that case. This happen when their is 2 consecutive sum in the graph, bad code was generated. Was fixed 2 August 2010",
-             BoolParam(True))
+             BoolParam(default_warn))

 AddConfigVar('warn.sum_div_dimshuffle_bug',
             "Warn if previous versions of Theano (between rev. 3bd9b789f5e8, 2010-06-16, and cfc6322e5ad4, 2010-08-03) would have given incorrect result. This bug was triggered by sum of division of dimshuffled tensors.",
-             BoolParam(True))
+             BoolParam(default_warn))
--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -50,7 +50,7 @@ class CLinkerObject(object):
         - `MethodNotDefined`: Subclass does not implement this method

        """
-        raise utils.MethodNotDefined("c_lib_dirs", type(self), self.__class__.__name__)
+        raise utils.MethodNotDefined("c_header_dirs", type(self), self.__class__.__name__)

    def c_libraries(self):
        """Optional: Return a list of libraries required by code returned by

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -23,7 +23,7 @@ def debugprint(obj, depth=-1, print_type=False, file=None):
    :type file: None, 'str', or file-like object
    :param file: print to this file ('str' means to return a string)

-    :returns: str if `file`=='str', else file arg
+    :returns: string if `file` == 'str', else file arg

    Each line printed represents a Variable in the graph.
    The indentation of each line corresponds to its depth in the symbolic graph.

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -18,6 +18,7 @@ from theano.sandbox.cuda.nnet import (
        GpuCrossentropySoftmax1HotWithBiasDx,
        GpuSoftmax, GpuSoftmaxWithBias)
 from theano.compile import optdb
+from theano.tensor.blas import _is_real_vector, _is_real_matrix
 #optdb.print_summary()  # this shows what is currently registered (in a so-far crude way...)

 gpu_optimizer = EquilibriumDB()
@@ -57,12 +58,12 @@ class InputToGpuOptimizer(Optimizer):
                    if new_input.type==input.type:
                        env.replace_validate(input, new_input, "To allow further optimisation to move Ops to gpu")
                except Exception, e:
-                    #as we currently only support float32, this can fail. 
-                    #Using try except make that we won't need 
+                    #as we currently only support float32, this can fail.
+                    #Using try except make that we won't need
                    pass

 #we register it before all other gpu optimizer to be sure that the input are on the gpu.
-gpu_seqopt.register('InputToGpuOptimizer', InputToGpuOptimizer(), 
+gpu_seqopt.register('InputToGpuOptimizer', InputToGpuOptimizer(),
                    0, 'fast_run', 'fast_compile', 'merge')#TODO: how to make it mandatory for gpu_seqopt?

 @local_optimizer([])
@@ -72,9 +73,9 @@ def local_cut_gpu_host_gpu(node):
    if tensor.opt.opt.check_chain(node, host_from_gpu, gpu_from_host):
        return [node.inputs[0].owner.inputs[0]]
    return False
-gpu_cut_copies.register('cut_gpu_host_transfers', local_cut_gpu_host_gpu, 
+gpu_cut_copies.register('cut_gpu_host_transfers', local_cut_gpu_host_gpu,
        'fast_run', 'inplace', 'gpu')
-gpu_cut_copies.register('cut_gpu_constant_transfers', tensor.opt.constant_folding, 
+gpu_cut_copies.register('cut_gpu_constant_transfers', tensor.opt.constant_folding,
        'fast_run', 'gpu')
 #register it into canonicalize to allow other optimization to work without
 #botering with this useless pattern.
@@ -83,7 +84,7 @@ compile.optdb['canonicalize'].register('local_cut_gpu_host_gpu', local_cut_gpu_h
 @register_opt()
 @local_optimizer([])
 def local_gpu_elemwise_0(node):
-    """elemwise(..., host_from_gpu, ...) 
+    """elemwise(..., host_from_gpu, ...)
       -> host_from_gpu(elemwise(gpu_from_host, ..., gpu_from_host)
    """
    if isinstance(node.op, tensor.Elemwise):
@@ -92,25 +93,29 @@ def local_gpu_elemwise_0(node):
                #don't set any inplace pattern. gpu_insert_inplace_optimizer will do it later
                new_op = GpuElemwise(node.op.scalar_op)

+                #   first establish that float32 can store all inputs
+                upcastable = set(['float32', 'int8', 'int16', 'uint8', 'uint16'])
                # case 1 - all inputs are already float32
                if numpy.all([i.type.dtype == 'float32' for i in node.inputs]):
                    #TODO: change this when fusion makes Elemwise with multiple outputs
-                    return [host_from_gpu(new_op(*(gpu_from_host(i) for i in node.inputs)))]
-
-                # THIS IS PROBABLY TRUE....
-                # case 2 - it would still be ok if some inputs were upcast to float32
-                #   first establish that float32 can store all inputs
-                upcastable = set(['float32', 'int8', 'int16', 'uint8', 'uint16'])
-                if numpy.all([i.type.dtype in upcastable for i in node.inputs]):
+                    gpu_elemwise = new_op(*(gpu_from_host(i) for i in node.inputs))
+                # case 2 - it is still ok if some inputs were upcast to float32
+                elif numpy.all([i.type.dtype in upcastable for i in node.inputs]):
                    # second - establish that a new node with upcasted inputs has the same outputs
                    # types as the original node
                    casted = node.op.make_node(*[tensor.cast(i, 'float32') for i in node.inputs])
                    if [o.type for o in casted.outputs] == [o.type for o in node.outputs]:

                        new_inputs = [gpu_from_host(tensor.cast(i, 'float32')) for i in node.inputs]
+                        gpu_elemwise = new_op(*new_inputs)
+                    else:
+                        return False
+                else:
+                    return False

-                        return [host_from_gpu(new_op(*new_inputs))]
+                gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner).outputs[0]

+                return [host_from_gpu(gpu_elemwise)]
 @register_opt()
 @local_optimizer([])
 def local_gpu_elemwise_1(node):
@@ -124,7 +129,9 @@ def local_gpu_elemwise_1(node):
            #don't set any inplace pattern. gpu_insert_inplace_optimizer will do it later
            new_op = GpuElemwise(elemwise_node.op.scalar_op)
            if all([i.dtype=='float32' for i in elemwise_node.inputs]):
-                return [new_op(*[gpu_from_host(i) for i in elemwise_node.inputs])]
+                gpu_elemwise = new_op(*[gpu_from_host(i) for i in elemwise_node.inputs])
+                gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner).outputs[0]
+                return [gpu_elemwise]
    return False

 @register_opt()
@@ -138,18 +145,68 @@ def local_gpu_dimshuffle_0(node):
        input, = node.inputs
        if input.owner and isinstance(input.owner.op, HostFromGpu):
            # move the add to a GpuAdd
-            new_op = GpuDimShuffle(node.op.input_broadcastable, 
+            new_op = GpuDimShuffle(node.op.input_broadcastable,
                    node.op.new_order)
            return [host_from_gpu(new_op(gpu_from_host(input)))]
    if node.op == gpu_from_host:
        host_input = node.inputs[0]
        if host_input.owner and isinstance(host_input.owner.op, tensor.DimShuffle):
            dimshuffle_node = host_input.owner
-            new_op = GpuDimShuffle(dimshuffle_node.op.input_broadcastable, 
+            new_op = GpuDimShuffle(dimshuffle_node.op.input_broadcastable,
                    dimshuffle_node.op.new_order)
            return [new_op(gpu_from_host(dimshuffle_node.inputs[0]))]
    return False

+
+@register_opt()
+@local_optimizer([])
+def local_gpu_dot_to_dot22(node):
+    """
+    gpu_from_host(dot) -> gpudot(gpu_from_host)
+    dot(host_from_gpu) -> host_from_gpu(gpudot)
+
+    This optimization solves the vector-matrix multiplication issue by
+    transforming the vector into a matrix, apply gpudot22 and reshaping
+    the output.
+
+    A more suitable solution would be to use the right cublas call
+    """
+    if node.op == gpu_from_host:
+        host_input = node.inputs[0]
+        if host_input.owner and host_input.owner.op == tensor.basic.dot:
+            x, y = host_input.owner.inputs
+            # case one vector X matrix
+            if _is_real_vector(x) and _is_real_matrix(y):
+                new_op = GpuDimShuffle((False,), ['x',0])
+                shape_out = y.shape[0],dimshuffle(['x'])
+                gpu_x = new_op(gpu_from_host(x))
+                gpu_y = gpu_from_host(y)
+            # case two matrix X vector
+            elif _is_real_matrix(x) and _is_real_vector(y):
+                new_op = GpuDimShuffle((False,), [0,'x'])
+                shape_out = x.shape[1].dimshuffle(['x'])
+                gpu_x = gpu_from_host(x)
+                gpu_y = new_op(gpu_from_host(y))
+            return [GpuReshape(1)(gpu_dot22(gpu_x, gpu_y), shape_out)]
+    if node.op == tensor.basic.dot:
+        if numpy.any([(i.owner and i.owner.op == host_from_gpu) for i in node.inputs]):
+            x, y = node.inputs
+            if _is_real_vector(x) and _is_real_matrix(y):
+                new_op = GpuDimShuffle((False,), ['x',0])
+                shape_out = y.shape[0].dimshuffle(['x'])
+                gpu_x = new_op(gpu_from_host(x))
+                gpu_y = gpu_from_host(y)
+
+            elif _is_real_matrix(x) and _is_real_vector(y):
+                new_op = GpuDimShuffle((False,), [0,'x'])
+                shape_out = x.shape[1].dimshuffle(['x'])
+                gpu_x = gpu_from_host(x)
+                gpu_y = new_op(gpu_from_host(y))
+            return [host_from_gpu(GpuReshape(1)(gpu_dot22(gpu_x, gpu_y),
+                                                shape_out))]
+    return False
+
+
 @register_opt()
 @local_optimizer([])
 def local_gpu_dot22(node):
@@ -188,6 +245,50 @@ def local_gpu_dot22scalar(node):
            return [host_from_gpu(gpu_dot22scalar(gpu_from_host(x), gpu_from_host(y),tensor.blas._as_scalar(scalar)))]
    return False

+
+@register_opt()
+@local_optimizer([])
+def local_gpu_gemv_as_gemm(node):
+    """
+    gpu_from_host(gemv) -> gpu_gemv(gpu_from_host)
+    gemm(host_from_gpu) -> host_from_gpu(gpu_gemv)
+
+    This optimization solves the vector-matrix multiplication issue by
+    transforming the vector into a matrix, apply gpudot22 and reshaping
+    the output.
+
+    A more suitable solution would be to use the right cublas call
+    """
+    gemvs = {tensor.blas.gemv_inplace: gpu_gemm_inplace,
+            tensor.blas.gemv_no_inplace: gpu_gemm_no_inplace}
+    if node.op == gpu_from_host:
+        host_input = node.inputs[0]
+        if host_input.owner and host_input.owner.op in gemvs:
+            op = host_input.owner.op
+            z, a, x, y, b = host_input.owner.inputs
+            return [
+                GpuDimShuffle((False,True),[0])(gemvs[op](
+                    GpuDimShuffle((False,),[0,'x'])(gpu_from_host(z))
+                    , a
+                    , gpu_from_host(x)
+                    , GpuDimShuffle((False,),[0,'x'])(gpu_from_host(y))
+                    , b))]
+    if node.op in gemvs:
+        z, a, x, y, b = node.inputs
+        x_on_gpu = (x.owner and x.owner.op == host_from_gpu)
+        y_on_gpu = (y.owner and y.owner.op == host_from_gpu)
+        z_on_gpu = (z.owner and z.owner.op == host_from_gpu)
+        if x_on_gpu or y_on_gpu or z_on_gpu:
+            return [host_from_gpu(GpuDimShuffle((False,True),[0])(
+                gemvs[node.op](
+                    GpuDimShuffle((False,),[0,'x'])(gpu_from_host(z))
+                    , a
+                    , gpu_from_host(x)
+                    , GpuDimShuffle((False,),[0,'x'])(gpu_from_host(y))
+                    , b)))]
+    return False
+
+
 @register_opt()
 @local_optimizer([])
 def local_gpu_gemm(node):
@@ -421,7 +522,7 @@ def local_gpu_crossentorpy_softmax_argmax_1hot_with_bias(node):
        x,b,y = node.inputs
        if x.owner and x.owner.op == host_from_gpu:
            gpu_x, = x.owner.inputs
-            # if y is a cast to integers, we can go to the underlying thing if we want, 
+            # if y is a cast to integers, we can go to the underlying thing if we want,
            # since this gpu op will cast to integers internally anyway
            int_cast_ops = (
                    tensor.basic._convert_to_int32,
@@ -436,8 +537,8 @@ def local_gpu_crossentorpy_softmax_argmax_1hot_with_bias(node):
                gpu_from_host(b),
                gpu_from_host(cast(y, 'float32')))
            am_dtype = node.outputs[2].type.dtype
-            return [host_from_gpu(gpu_nll), 
-                    host_from_gpu(gpu_sm), 
+            return [host_from_gpu(gpu_nll),
+                    host_from_gpu(gpu_sm),
                    cast(host_from_gpu(gpu_am), am_dtype)]
    return False

@@ -633,7 +734,7 @@ else:

 #GpuElemwise inplace
 gpu_insert_inplace_optimizer = tensor.opt.insert_inplace_optimizer_op(GpuElemwise)
-compile.optdb.register('gpu_inplace_opt', gpu_insert_inplace_optimizer, 75, 'fast_run', 'inplace','gpu_inplace') 
+compile.optdb.register('gpu_inplace_opt', gpu_insert_inplace_optimizer, 75, 'fast_run', 'inplace','gpu_inplace')

 @register_opt()
 @local_optimizer([tensor.Alloc])
@@ -654,7 +755,7 @@ def local_gpualloc(node):
        new_out = host_from_gpu(gpu_alloc(val2, *shp))
        # Sigh. it's an annoying thing about theano
        # that you can't add information to the graph.
-        # If for some reason it has come to light that 
+        # If for some reason it has come to light that
        # one of the dimensions is broadcastable, we have to hide that
        # or the optimization won't go through.
        if new_out.type != old_out.type:
@@ -668,24 +769,42 @@ def local_gpualloc(node):
        #if old_out.type != new_out.type:
            #import pdb; pdb.set_trace()
        return [new_out]
-            
-@register_opt()
-@local_optimizer([])
-def local_gpu_huge_add_or_mul(node):
-    """
-    The gpu code generator for elemwise fusion knows when there are too many inputs, but add
-    doesn't.  So there's this workaround.

-    The CUDA c compiler limits the number of arguments to 256 bytes' worth or something.
+def max_inputs_to_GpuElemwise(node):
+    """
+    return the maximum number of input this Apply node to an GpuElemwise can accept.
+    This is needed as currently their is a limit of 256 bytes of paramter for the gpu function.
+    This mesure the number of paramter we put in our gpu function and compute the maximum number of inputs that respect the 256 bytes limits.
    """
-    if isinstance(node.op, GpuElemwise) and node.op.scalar_op in (scal.add, scal.mul):
-        if len(node.inputs)>10: 
-            # TODO: look up how arguments are passed to the GpuElemwise function
-            #   and figure out how many arguments can fit in 256 bytes.
-            #   this will depend on the number of dimensions in each argument.
-            #   The current heuristic to chop at 10 prevents crashing in the
-            #   pylearn/algorithms/tests/test_mcRBM feature extractor.
-            return [node.op(
-                    node.op(*node.inputs[:10]),
-                    node.op(*node.inputs[10:]))]
+    #TODO: detect the size of gpu pointeur and c int.
+    int_size = 8
+    ptr_size = 8
+        
+    argument_limit = 256  # if was 240, with this note: 16 bytes are used for block and thread coords etc.
+    size_param_mandatory = int_size #for numels
+    size_param_mandatory += int_size *  node.inputs[0].type.ndim # for the shape#node.outputs[0].ndim+1+node.inputs[0].ndim+1
+    size_param_mandatory += sum((ptr_size + int_size * i.type.ndim) for i in node.outputs)
+
+    nb_bytes_avail = argument_limit-size_param_mandatory
+    nb_bytes_per_inputs = (node.inputs[0].ndim*int_size)+ptr_size
+    max_nb_inputs = nb_bytes_avail//nb_bytes_per_inputs
+    return max_nb_inputs
+
+def split_huge_add_or_mul(node):
+    """
+    For add and mul, it can happen that we have too much input
+    That will make nvcc fail compilation of our current code.
+    We don't want node in the graph that can't execute
+    as this break DebugMode.

+    This should not happen for other GpuElemwise as their is only the fusion
+    that can generate op with too much input and it check for that.
+    """
+    if node.op.scalar_op in (scal.add, scal.mul):
+        max_nb_inputs = max_inputs_to_GpuElemwise(node)
+        while len(node.inputs)>max_nb_inputs: 
+            inner_op = []
+            for i in range(0,len(node.inputs),max_nb_inputs):
+                inner_op.append(node.op(*node.inputs[i:i+max_nb_inputs]))
+            node = node.op(*inner_op).owner
+    return node
--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -759,28 +759,47 @@ def test_many_arg_elemwise():
    rng = numpy.random.RandomState( [1,2,3])

    for num_args in [25]:
-
-        rows = rng.randint(1,5)
-        cols = rng.randint(1,5)
-
        for op_to_test in [ theano.tensor.add, theano.tensor.mul ]:
-            args = [ numpy.cast['float32'](rng.randn(rows,cols)) for arg in xrange(0,num_args) ]
-            symb_args = [ theano.tensor.fmatrix() for arg in xrange(0,num_args) ]            
+            for nb_dim in [2,3,4,5]:
+                shapes = [rng.randint(1,5) for i in range(nb_dim)]
+                args = [ numpy.cast['float32'](rng.randn(*shapes)) for arg in xrange(0,num_args) ]
+                
+                symb_args = [ theano.tensor.TensorType('float32', (False,)*nb_dim)() for arg in xrange(0,num_args) ]            

            
-            outputs = []
-            for mode in [ mode_with_gpu, mode_without_gpu ]:
-                f = theano.function( symb_args, op_to_test(*symb_args), mode = mode )
-                #theano.printing.debugprint(f)
-                outputs.append( f( * args) )
-                #assert that the test was done on the gpu.
-                if mode is mode_with_gpu:
-                    assert any([isinstance(node.op, cuda.GpuElemwise) for node in f.maker.env.nodes])
+                outputs = []
+                for mode in [ mode_with_gpu, mode_without_gpu ]:
+                    #test the optijmization local_gpu_elemwise_0
+                    f = theano.function( symb_args, op_to_test(*symb_args), mode = mode.excluding("local_gpu_elemwise_1") )
+                    outputs.append( f( * args) )
+                    #assert that the test was done on the gpu.
+                    if mode is mode_with_gpu:
+                        assert any([isinstance(node.op, cuda.GpuElemwise) for node in f.maker.env.nodes])
+                        
+                    #test the optijmization local_gpu_elemwise_1
+                    f = theano.function( symb_args, 
+                                         cuda.gpu_from_host(op_to_test(*symb_args)), 
+                                         mode = mode.excluding("local_gpu_elemwise_0") )
+                    out = f( * args)
+                    #assert that the test was done on the gpu.
+                    if mode is mode_with_gpu:
+                        assert any([isinstance(node.op, cuda.GpuElemwise) for node in f.maker.env.nodes])
+                    assert numpy.allclose(out, outputs[-1])
            
-            results_gpu, results_cpu = outputs
+                results_gpu, results_cpu = outputs
+
+                assert numpy.allclose(results_gpu, results_cpu)
+
+def test_duplicate_arg_elemwise():
+    A = theano.tensor.fmatrix()
+    B = A + A
+
+    f = theano.function([A],B, mode = mode_with_gpu)

-            assert numpy.allclose(results_gpu, results_cpu)
+    Aval = numpy.random.RandomState([1,2,3]).randn(5,5)
+    Bval = Aval + Aval

+    assert numpy.allclose(Bval,f(Aval))




--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
 import sys, time
-from theano.compile.sharedvalue import shared
+
+import numpy
+# Skip test if cuda_ndarray is not available.
+from nose.plugins.skip import SkipTest
+
 from theano.compile.pfunc import pfunc
 from theano import tensor
 import theano
-import numpy

-# Skip test if cuda_ndarray is not available.
-from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda
 if cuda.cuda_available == False:
    raise SkipTest('Optional package cuda disabled')

-import theano.compile.mode
 from theano.sandbox.cuda.type import CudaNdarrayType

 if theano.config.mode=='FAST_COMPILE':
@@ -49,6 +49,9 @@ def test_int_pow():
    #theano.printing.debugprint(f)


+
+
+
 def test_softmax():
    x = tensor.fmatrix()

@@ -78,7 +81,7 @@ def test_opt_gpujoin_onlyajoin():
    b = cuda.shared_constructor(_b)

    c = tensor.join(1,a,b)
-    
+
    f = theano.function([], c, mode=mode_with_gpu)

    #theano.printing.debugprint(f)
@@ -105,7 +108,7 @@ def test_opt_gpujoin_joinvectors_elemwise_then_minusone():
    b_prime = tensor.sin(b)

    c = tensor.join(0,a_prime,b_prime)
-    
+
    d = c[:-1]

    f = theano.function([], d, mode=mode_with_gpu)

--- a/theano/sandbox/cuda/tests/test_vector_matrix_dot.py
+++ b/theano/sandbox/cuda/tests/test_vector_matrix_dot.py
+import sys, time
+
+from theano import shared
+from theano.compile.pfunc import pfunc
+from theano import tensor
+
+import numpy
+import theano
+import theano.tensor as TT
+
+# Skip test if cuda_ndarray is not available.
+from nose.plugins.skip import SkipTest
+import theano.sandbox.cuda as cuda_ndarray
+if cuda_ndarray.cuda_available == False:
+    raise SkipTest('Optional package cuda disabled')
+
+import theano.sandbox.cuda as tcn
+import theano.sandbox.cuda as cuda
+import theano.sandbox.cuda.basic_ops as B
+import theano.sandbox.cuda.blas as blasop
+import theano.compile.mode
+from theano.tests import unittest_tools as utt
+
+### Tolerance factor used in this tests !!!
+atol = 1e-6
+##########################
+
+
+if theano.config.mode=='FAST_COMPILE':
+    mode_with_gpu = theano.compile.mode.get_mode('FAST_RUN').including('gpu')
+    mode_without_gpu = theano.compile.mode.get_mode('FAST_RUN').excluding('gpu')
+else:
+    mode_with_gpu = theano.compile.mode.get_default_mode().including('gpu')
+    mode_without_gpu = theano.compile.mode.get_default_mode().excluding('gpu')
+
+def test_dot_vm():
+    ''' Test vector dot matrix '''
+    v = theano.shared( numpy.array(numpy.random.rand(2), dtype='float32'))
+    m = theano.shared( numpy.array(numpy.random.rand(2,2),
+                                   dtype='float32'))
+    no_gpu_f = theano.function([], theano.dot(v,m), mode = mode_without_gpu)
+    gpu_f    = theano.function([], theano.dot(v,m), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuDot22) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+
+def test_dot_mv():
+    ''' Test matrix dot vector '''
+    v = theano.shared( numpy.array(numpy.random.rand(2), dtype='float32'))
+    m = theano.shared( numpy.array(numpy.random.rand(2,2),
+                                   dtype='float32'))
+    no_gpu_f = theano.function([], theano.dot(m,v), mode = mode_without_gpu)
+    gpu_f    = theano.function([], theano.dot(m,v), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuDot22) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+
+def test_gemv1():
+    ''' Is this the same test as test_gemv2 ? '''
+    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
+
+    no_gpu_f = theano.function([], v2+theano.dot(m,v1), mode = mode_without_gpu)
+    gpu_f    = theano.function([], v2+theano.dot(m,v1), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuGemm) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+
+
+def test_gemv2():
+    ''' Is this the same test as test_gemv1 ? '''
+    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
+
+    no_gpu_f = theano.function([], v2+theano.dot(v1,m), mode = mode_without_gpu)
+    gpu_f    = theano.function([], v2+theano.dot(v1,m), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuGemm) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+
+
+
+if __name__=='__main__':
+    test_dot_vm()
+    test_dot_mv()
+    test_gemv1()
+    test_gemv2()
+
--- a/theano/sandbox/debug.py
+++ b/theano/sandbox/debug.py
@@ -15,7 +15,7 @@ class DebugLinker(gof.WrapLinker):
                 copy_originals = False,
                 check_types = True,
                 compare_variables = True,
-                 compare_fn = lambda x, y: x == y):
+                 compare_fn = (lambda x, y: x == y)):
        gof.WrapLinker.__init__(self,
                                linkers = linkers,
                                wrapper = self.wrapper)

--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
@@ -46,9 +46,6 @@ class Images2Neibs(Op):

        return Apply(self, [ten4, neib_shape,neib_step], [T.matrix(dtype=ten4.type.dtype)])

-    def grad(self, (pvals, unis), (gz,)):
-        return [None, None]
-
    def c_code_cache_version(self):
        return (3,)
                
@@ -224,6 +221,11 @@ def neibs2images(neibs, neib_shape, original_shape):
    
    Return a 4d tensor of shape `original_shape`.
    """
+    # TODO: handle the case where patches either overlap
+    # TODO: handle the case where patches are not directly adjacent
+    # TODO: at least separate these cases so that the following code does not incorrectly
+    # handle them by accident.
+    raise NotImplementedError('check for overlapping patches or non-adjacent patches.')
    neibs = T.as_tensor_variable(neibs)
    neib_shape = T.as_tensor_variable(neib_shape)
    original_shape = T.as_tensor_variable(original_shape)

--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
@@ -3,22 +3,27 @@ import unittest

 from nose.plugins.skip import SkipTest
 import numpy
-import scipy.sparse as sp
-import scipy.sparse
+try:
+    import scipy.sparse as sp
+    import scipy.sparse
+except ImportError:
+    pass#the variable enable_sparse will be used to disable the test file.

 import theano
 from theano import compile
+from theano.sparse import enable_sparse
+if enable_sparse == False:
+    raise SkipTest('Optional package sparse disabled')
+
 from theano.sparse.basic import _is_dense, _is_sparse, _is_dense_variable, _is_sparse_variable
 from theano.sparse.basic import _mtypes
-from theano.sparse import as_sparse_variable, enable_sparse, CSC, CSR, CSM, CSMProperties, SparseType, StructuredDotCSC
+from theano.sparse import as_sparse_variable, CSC, CSR, CSM, CSMProperties, SparseType, StructuredDotCSC
 from theano.sparse import add, structured_dot, transpose
 from theano.sparse import csc_from_dense, csr_from_dense, dense_from_sparse

 from theano.tests import unittest_tools as utt
 from theano import tensor

-if enable_sparse == False:
-    raise SkipTest('Optional package sparse disabled')

 def eval_outputs(outputs):
    return compile.function([], outputs)()[0]

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -57,7 +57,7 @@ __oplist_constructor_list = []
 """List of functions to be listed as op constructors in the oplist (`gen_oplist`, doc/oplist.txt)."""
 def constructor(f):
    """Add `f` to :doc:`oplist`.
-    
+
    Make `f` appear as a constructor in the oplist (`gen_oplist`, doc/oplist.txt).
    """
    __oplist_constructor_list.append(f)
@@ -80,7 +80,7 @@ if 0:
        if hasattr(x, '_as_CudaNdarrayVariable'):
            return x._as_CudaNdarrayVariable() #TODO: pass name and ndim arguments
        return as_tensor_variable(x, name, ndim)
-    
+
 def as_tensor_variable(x, name = None, ndim=None):
    """Return `x`, transformed into a `TensorType`

@@ -158,7 +158,7 @@ class NumpyAutocaster(object):

    When config.floatX is float32 (at the time of calling), then this function downcasts float
    and numpy.float arguments to numpy.float32, if float32 is in the self.dtypes list.
-    
+
    Python ints are always 64bit and floats are always double precision.
    This class uses the algorithm in __call__ to use a narrower dtype when no precision would
    be lost, and to even lose precision when this is demanded by the list of dtypes (e.g. to
@@ -182,7 +182,7 @@ class NumpyAutocaster(object):
        # recall: float is numpy.float
        if isinstance(x, float) and config.floatX in self.dtypes and config.floatX == 'float32':
            return theano._asarray(x, dtype='float32')
-            
+
        for dtype in self.dtypes:
            x_ = theano._asarray(x, dtype=dtype)
            if numpy.all(x == x_):
@@ -200,7 +200,7 @@ autocast_float = NumpyAutocaster(('float32', 'float64'))
 # this autocasting, and in future, our ops might be smarter about factoring out upcasts.   The
 # advantage of this mechanism is to combine it with floatX so that 1.0 + xmatrix() will always
 # have the same type as the xmatrix().
-# 
+#
 class autocast_float_as(object):
    """This class makes it possible to temporarily and locally adjust autocasting behaviour.

@@ -222,7 +222,7 @@ class autocast_float_as(object):

 def constant_or_value(x, rtype, name=None, ndim=None, dtype=None):
    """Return a symbolic `Constant` with value `x`
-    
+
    :Exceptions:
     - `TypeError`: `x` could not be converted to a numpy.ndarray
     - `ValueError`: `x` could not be expanded to have ndim dimensions
@@ -295,19 +295,19 @@ if int(config.tensor.cmp_sloppy)>1:
    # useful to test the GPU as they don't use extended precision and
    # this cause some difference bigger then the normal sloppy.
    float32_atol = 5e-4
-    float32_rtol = 1e-3 
+    float32_rtol = 1e-3
    float64_rtol = 1e-4
    float64_atol = 1e-3
 elif int(config.tensor.cmp_sloppy):
    float32_atol = 1e-4
-    float32_rtol = 1e-3 
+    float32_rtol = 1e-3
    float64_rtol = 1e-4
    float64_atol = 1e-3
 else:
    #If you change those value in test don't forget to put them back when the test end.
    #Don't forget the case when the test fail.
    float32_atol = 1e-5
-    float32_rtol = 1e-3 
+    float32_rtol = 1e-3

    # defaults in numpy.allclose
    float64_rtol = 1.0000000000000001e-05
@@ -395,7 +395,7 @@ class TensorType(Type):
        if self.dtype=='floatX':
          self.dtype=config.floatX
        ###    broadcastable is immutable, and all elements are either True or False
-        self.broadcastable = tuple(bool(b) for b in broadcastable) 
+        self.broadcastable = tuple(bool(b) for b in broadcastable)
        self.dtype_specs() # error checking is done there
        self.name = name
        self.numpy_dtype = numpy.dtype(self.dtype)
@@ -438,12 +438,12 @@ class TensorType(Type):
        except Exception, e:
            return str(e)
        return "value is valid"
-        
+

    def dtype_specs(self):
        """Return a tuple (python type, c type, numpy typenum) that corresponds to
        self.dtype.
-        
+
        This function is used internally as part of C code generation.
        """
        #TODO: add more type correspondances for e.g. int32, int64, float32,
@@ -483,7 +483,7 @@ class TensorType(Type):
        a_eq_b = (a==b)
        r = numpy.all(a_eq_b)
        if r: return True
-        # maybe the trouble is that there are NaNs 
+        # maybe the trouble is that there are NaNs
        a_missing = numpy.isnan(a)
        if a_missing.any():
            b_missing = numpy.isnan(b)
@@ -546,7 +546,7 @@ class TensorType(Type):
                #set it to False
                cmp_elemwise = numpy.where(both_inf&cmp_elemwise,
                                           a==b,cmp_elemwise)
-                
+
                #check the sign of the inf
                both_inf = numpy.where(both_inf,a==b,both_inf)

@@ -554,7 +554,7 @@ class TensorType(Type):
                    both_inf += a_inf
                if allow_remove_nan:
                    both_missing += a_missing
-                
+
                # Combine all information.
                return (cmp_elemwise + both_missing + both_inf).all()

@@ -634,8 +634,6 @@ class TensorType(Type):

    def c_extract(self, name, sub):
        """Override `CLinkerOp.c_extract` """
-        # TODO: make the error message print out the dtype of the
-        # input received.
        return """
        %(name)s = NULL;
        if (py_%(name)s == Py_None) {
@@ -649,11 +647,13 @@ class TensorType(Type):
            PyErr_SetString(PyExc_ValueError, "expected an ndarray");
            %(fail)s
        }
+        type_num_%(name)s = ((PyArrayObject*)py_%(name)s)->descr->type_num; //we expect %(type_num)s
        if (!PyArray_ISALIGNED(py_%(name)s)) {
-            PyErr_SetString(PyExc_NotImplementedError, "expected an aligned array");
+            PyErr_Format(PyExc_NotImplementedError,
+                         "expected an aligned array of type %%d (%(type_num)s), got non-aligned array of type %%d",
+                         %(type_num)s, type_num_%(name)s);
            %(fail)s
        }
-        type_num_%(name)s = ((PyArrayObject*)py_%(name)s)->descr->type_num; //we expect %(type_num)s
        if (type_num_%(name)s != %(type_num)s) {
            PyErr_Format(PyExc_ValueError, "expected type_num %%d (%(type_num)s) got %%d", %(type_num)s, type_num_%(name)s);
            %(fail)s
@@ -885,7 +885,7 @@ class _tensor_py_operators:
    def __abs__(self): return abs_(self)
    def __neg__(self): return neg(self)

-    #CASTS 
+    #CASTS
    #### REMOVED THESE BECAUSE PYTHON appears to require __int__ to return an int. -JB 20081112
    #def __int__(self): return convert_to_int32(self)
    #def __float__(self): return convert_to_float64(self)
@@ -898,7 +898,7 @@ class _tensor_py_operators:
    def __ge__(self,other): return ge(self, other)

    #BITWISE
-    def __invert__(self): return invert(self) 
+    def __invert__(self): return invert(self)
    def __and__(self,other): return and_(self, other)
    def __or__(self,other): return or_(self, other)
    def __xor__(self,other): return xor(self, other)
@@ -910,27 +910,27 @@ class _tensor_py_operators:
 #     def __ixor__(self, other): return _xor_inplace(self, other)

    #ARITHMETIC - NORMAL
-    def __add__(self,other): 
+    def __add__(self,other):
        try:
            return add(self,other)
        except Exception, e:
            return NotImplemented
-    def __sub__(self,other): 
+    def __sub__(self,other):
        try:
            return sub(self,other)
        except Exception, e:
            return NotImplemented
-    def __mul__(self,other): 
-        try: 
+    def __mul__(self,other):
+        try:
            return mul(self,other)
        except Exception, e:
            return NotImplemented
-    def __div__(self,other): 
-        try: 
+    def __div__(self,other):
+        try:
            return div_proxy(self,other)
        except Exception, e:
            return NotImplemented
-    def __pow__(self,other): 
+    def __pow__(self,other):
        try:
            return pow(self,other)
        except Exception, e:
@@ -1031,12 +1031,12 @@ class _tensor_py_operators:
    def __getslice__(self, *args):
        args = slice(*args),
        return self.__getitem__(args)
-    
+
    #COPYING
    def copy(self):
        return tensor_copy(self)

-    def __iter__(self): 
+    def __iter__(self):
        try:
            for i in xrange(get_vector_length(self)):
                yield self[i]
@@ -1044,7 +1044,7 @@ class _tensor_py_operators:
            # This prevents accidental iteration via builtin.sum(self)
            raise TypeError('TensorType does not support iteration. '
            'Maybe you are using builtin.sum instead of theano.tensor.sum? (Maybe .max?)')
-        
+

    # CONVENIENT ACCESS TO TYPE PROPERTIES
    ndim = property(lambda self: self.type.ndim)
@@ -1053,7 +1053,7 @@ class _tensor_py_operators:
    """The broadcastable signature of this tensor.

    See :doc:`broadcasting` for details.
-    
+
    """
    dtype = property(lambda self: self.type.dtype)
    """ The dtype of this tensor.  """
@@ -1095,7 +1095,7 @@ class _tensor_py_operators:

    def get_constant_value(self):
        return get_constant_value(self)
-    
+
 class TensorVariable(Variable, _tensor_py_operators):
    """Subclass to add the tensor operators to the basic `Variable` class."""
 TensorType.Variable = TensorVariable
@@ -1115,7 +1115,7 @@ class TensorConstantSignature(tuple):
        #N.B. compare shape to ensure no broadcasting in ==
        #N.B. compare elementwise last because it is the most expensive check
        return (t0 == t1) and (d0.shape == d1.shape) \
-                and (self.sum == other.sum) and (numpy.all(d0 == d1)) 
+                and (self.sum == other.sum) and (numpy.all(d0 == d1))
    def __hash__(self):
        t, d = self
        return hashtype(self) ^ hash(t) ^ hash(d.shape) ^ hash(self.sum)
@@ -1130,7 +1130,7 @@ class TensorConstantSignature(tuple):

 class TensorConstant(Constant, _tensor_py_operators):
    """Subclass to add the tensor operators to the basic `Constant` class.
-    
+
    To create a TensorConstant, use the `constant` function in this module.
    """
    def signature(self):
@@ -1139,7 +1139,7 @@ TensorType.Constant = TensorConstant

 class TensorValue(Value, _tensor_py_operators):
    """Subclass to add the tensor operators to the basic `Value` class.
-    
+
    To create a TensorValue, use the `value` function in this module.

    :note: Value is deprecated by SharedVariable
@@ -1167,8 +1167,8 @@ def _elemwise(scalar_op, name, doc_prefix=''):
    inplace = elemwise.Elemwise(inplace_scalar_op, {0: 0}, name = name+"_inplace")

    # don't add the inplace versions, they aren't supposed to be part of the user interface
-    _constructor_list.append(straight) 
-    
+    _constructor_list.append(straight)
+
    # This is here so that gen_oplist can detect which module declared these variables.

    straight.__module__ = 'tensor'
@@ -1181,7 +1181,7 @@ def _elemwise(scalar_op, name, doc_prefix=''):

 def _redefine(real_symbol_value, module='tensor'):
    """Replace the value associated with a function symbol.
-    
+
    This is useful to trick epydoc into doing what we want.  It's a hack.
    """
    real_symbol_value.__module__ = 'tensor.basic'
@@ -1275,7 +1275,7 @@ def _conversion(real_value, name):

 _convert_to_int8  = _conversion(elemwise.Elemwise(scal.convert_to_int8), 'int8')
 """Cast to 8-bit integer"""
-    
+
 _convert_to_int16 = _conversion(elemwise.Elemwise(scal.convert_to_int16), 'int16')
 """Cast to 16-bit integer"""

@@ -1287,7 +1287,7 @@ _convert_to_int64 = _conversion(elemwise.Elemwise(scal.convert_to_int64), 'int64

 _convert_to_uint8  = _conversion(elemwise.Elemwise(scal.convert_to_uint8), 'uint8')
 """Cast to unsigned 8-bit integer"""
-    
+
 _convert_to_uint16 = _conversion(elemwise.Elemwise(scal.convert_to_uint16), 'uint16')
 """Cast to unsigned 16-bit integer"""

@@ -1324,9 +1324,9 @@ _cast_mapping = {
           'complex128': _convert_to_complex128}
 @constructor
 def cast(x, dtype):
-    """Symbolically cast `x` to a Tensor of type `dtype`.""" 
+    """Symbolically cast `x` to a Tensor of type `dtype`."""
    if dtype=='floatX': dtype = config.floatX
-    
+
    _x = as_tensor_variable(x)
    if _x.type.dtype == dtype:
        return _x
@@ -1382,7 +1382,7 @@ pprint.assign(_shape, printing.MemberPrinter('shape'))

 class MaxAndArgmax(Op):
    """Calculate the max and argmax over a given axis.
-    
+
    .. note::

        If axis is None it means to calculate the max over the last dimension which is
@@ -1393,7 +1393,7 @@ class MaxAndArgmax(Op):
    nin=2 # tensor, axis
    nout=2 # max val, max idx
    E_axis = 'invalid axis'
-    
+
    def __eq__(self,other):
        return type(self)==type(other)
    def __hash__(self):
@@ -1422,7 +1422,7 @@ class MaxAndArgmax(Op):
        inputs = [x, axis]
        #TODO: figure things out if axis is a constant
        broadcastable = [False] * (x.type.ndim - 1)
-        outputs = [tensor(x.type.dtype, broadcastable,name='max'), 
+        outputs = [tensor(x.type.dtype, broadcastable,name='max'),
                   tensor('int32', broadcastable,name='argmax')]
        return Apply(self, inputs, outputs)
    def perform(self, node, (x, axis), (max, max_idx)):
@@ -1445,7 +1445,7 @@ class MaxAndArgmax(Op):
 #        gMax * dMax/dx + gArgMax * dArgMax/dx, gMax * dMax/daxis + gArgMax * dArgMax/daxis
 #       g_max has one less dimension than x, so you need to complete g_max to x's shape
 #        when axis=0 the broadcasting mechanism does it automatically
-        
+
        if not ( axis.data == 0 or axis.data == x.ndim-1):
            raise NotImplementedError('MaxAndArgmax gradient with axis corresponding to internal dimension')
        if axis.data==0:
@@ -1874,7 +1874,7 @@ if 0:
 class Alloc(gof.Op):
    """Create a Tensor from an initial value and a desired shape

-    alloc(value, shape0, shape1, ..., shapeN) 
+    alloc(value, shape0, shape1, ..., shapeN)

    Returns an N-dimensional tensor initialized by `value` using something equivalent to
    >>> z = numpy.zeros(shape, value.dtype)
@@ -1883,7 +1883,7 @@ class Alloc(gof.Op):
    The result has N dimensions, has the dtype of `value` and is obtained by broadcasting value
    over the output ndarray.

-    This Op is used to replace fill() during optimizations because after shapes are lifted, 
+    This Op is used to replace fill() during optimizations because after shapes are lifted,
    the first argument to fill can often be pruned from the graph.
    """
    def __init__(self):
@@ -1943,7 +1943,7 @@ class Alloc(gof.Op):
            pass
        return ret

-        
+
 alloc = Alloc()
 pprint.assign(alloc, printing.FunctionPrinter('alloc'))

@@ -2006,8 +2006,8 @@ def mean(input, axis = None, op = False):
    :param axis: compute the mean along this axis of the tensor.
                 None means all axes (like numpy).
    :type axis: None or int or (list of int) (see `Sum`)
-    
-    :note: for gpu, if you manually cast the input to float32 before calling 
+
+    :note: for gpu, if you manually cast the input to float32 before calling
           mean, everything will be done on the gpu.
    """
    if op:
@@ -2117,7 +2117,7 @@ class Default(gof.Op):
        if x is None:
            # why copy?  Theano can't yet understand out[0] being a view of either x or y,
            # so we can be a view of x, but only a copy of y.
-            out[0] = default.copy() 
+            out[0] = default.copy()
        else:
            out[0] = x
 default = Default()
@@ -2221,7 +2221,7 @@ class Subtensor(Op):
    integers are indexes into the inputs array, and the start/stop/step members
    of each slice are also integer indexes into the inputs array (or None).  The
    inputs array is the tensor x, followed by scalar integer variables.
-    
+
    @todo: add support for advanced tensor indexing (in Subtensor_dx too).

    The idx_list is a tuple similar in structure to the sort of key you might expect in numpy's
@@ -2230,7 +2230,11 @@ class Subtensor(Op):
    can additionally be a Scalar instance, and slice components can also be Scalar instances
    too.
    """
-    e_invalid = 'The index list is longer than the number of dimensions of the tensor.'
+    e_invalid = ( 'The index list is longer (size %d) than the number of '
+                 'dimensions of the tensor(namely %d). You are asking for '
+                 'a dimension of the tensor that does not exist! You might '
+                 'need to use dimshuffle to add extra dimension to your '
+                 'tensor.')
    e_subslice = 'nested slicing is not supported'
    e_indextype = "Invalid index type or slice for Subtensor"
    debug = 0
@@ -2246,7 +2250,7 @@ class Subtensor(Op):
            elif isinstance(entry, slice):
                helper(entry.start)
                helper(entry.stop)
-                helper(entry.step)
+                helper( entry.step)
        for idx in idxs:
            helper(idx)
        return ret
@@ -2312,11 +2316,13 @@ class Subtensor(Op):
    def make_node(self, x, *inputs):
        x = as_tensor_variable(x)
        inputs = tuple(self.my_as_scalar(a) for a in inputs)
-        
+
        idx_list = list(self.idx_list)
        if len(idx_list) > x.type.ndim:
-            raise ValueError(Subtensor.e_invalid,
-                             (len(idx_list), x.type.ndim))
+            exception = ValueError(Subtensor.e_invalid%(len(idx_list),
+                                                        x.type.ndim))
+            exception.subtensor_invalid = True
+            raise exception

        #infer the broadcasting pattern
        padded = idx_list + [slice(0,sys.maxint,1)] * (x.type.ndim - len(idx_list))
@@ -2412,7 +2418,7 @@ class Subtensor(Op):
            msg += [(entry.start, entry.stop, entry.step)]
          else:
            msg += [entry]
-        
+
        idx_list = tuple(msg)
        #backport
        #idx_list = tuple((entry.start, entry.stop, entry.step)
@@ -2472,7 +2478,7 @@ class SubtensorPrinter:
                      msg3 = ""
                    else:
                      msg3 =  ":%s" % entry.step
-                    
+
                    sidxs.append("%s:%s%s"  % (msg1, msg2, msg3))
                    #backport
                    #sidxs.append("%s:%s%s" % ("" if entry.start is None or entry.start == 0 else entry.start,
@@ -2531,10 +2537,10 @@ def inc_subtensor(x, y, inplace=False, set_instead_of_inc=False):
 class IncSubtensor(Op):
    """Increment a subtensor.

-    This is like numpy's 
+    This is like numpy's

        x[i,j,k] += y
-    
+
    It is used internally to implement the gradient on SubTensor.

    :param set_instead_of_inc: if True set the subtensor to the value instead
@@ -2592,11 +2598,13 @@ class IncSubtensor(Op):
    def make_node(self, x, y, *inputs):
        x, y = map(as_tensor_variable, [x, y])
        inputs = tuple(map(Subtensor.my_as_scalar, inputs))
-        
+
        idx_list = list(self.idx_list)
        if len(idx_list) > x.type.ndim:
-            raise ValueError(Subtensor.e_invalid,
-                             (len(idx_list), x.type.ndim))
+            exception = ValueError(Subtensor.e_invalid%(len(idx_list),
+                                                        x.type.ndim))
+            exception.subtensor_invalid = True
+            raise exception

        #infer the broadcasting pattern
        padded = idx_list + [slice(0,sys.maxint,1)] * (x.type.ndim - len(idx_list))
@@ -2671,11 +2679,11 @@ class Split(Op):
    """Partition a `TensorVariable` along some axis.

    .. python::
-        
+
        x = vector()
        splits = lvector()
        # you have to declare right away how many split_points there will be.
-        ra, rb, rc = split(x, splits, n_splits = 3, axis = 0)  
+        ra, rb, rc = split(x, splits, n_splits = 3, axis = 0)

        f = function([x, splits], [ra, rb, rc])

@@ -2709,16 +2717,16 @@ class Split(Op):
        node = self.make_node(*inputs, **kwargs)
        node.tag.trace = traceback.extract_stack()[:-1]
        return node.outputs
- 
+
    def make_node(self, x, axis, splits):
        """WRITEME"""
        x = as_tensor_variable(x)
        axis = as_tensor_variable(axis)
        splits = as_tensor_variable(splits)

-        if splits.type not in int_vector_types: 
+        if splits.type not in int_vector_types:
            raise TypeError('splits must have type tensor.lvector', splits.type)
-        if axis.type not in int_types: 
+        if axis.type not in int_types:
            raise TypeError('axis must have type lscalar', axis.type)

 #         # The following lines are necessary if we allow splits of zero
@@ -2738,21 +2746,21 @@ class Split(Op):
        #in python 2.4, x.shape[numpy.asarray(1)] don't work.
        if sys.version_info[0:2]==(2, 4) and axis.size==1:
          axis=int(axis)
-        
+
        try:
            len_along_axis = x.shape[axis]
        except :
            raise ValueError('Split.perform() with axis=(%s) is invalid for x.shape==(%s)'
                    %(axis, x.shape))
        if len(splits) != self.len_splits:
-            raise ValueError('In Split.perform(), len(splits) != len_splits.', 
+            raise ValueError('In Split.perform(), len(splits) != len_splits.',
                    (len(splits), self.len_splits))

        if numpy.sum(splits) != len_along_axis:
            raise ValueError('The splits sum to %s, expected %s' % (numpy.sum(splits), len_along_axis))
        if not all(splits):
            raise ValueError('Cannot have a split of zero.')
-         
+
        # Checking is done, let's roll the splitting algorithm!
        # Basically we step along the given axis of x, extracting subtensors of size splits[i]
        # as we go along.
@@ -2826,7 +2834,7 @@ def addbroadcast(x, *axes):
 def unbroadcast(x, *axes):
    """
    Make the input impossible to broadcast in the specified axes.
-    
+
    We apply the opt here to don't pollute the graph especially during the gpu optimization
    """
    rval = Rebroadcast(*[(axis, False) for axis in axes])(x)
@@ -2835,7 +2843,7 @@ def unbroadcast(x, *axes):
 def patternbroadcast(x, broadcastable):
    """
    Make the input impossible to broadcast in the specified axes.
-    
+
    We apply the opt here to don't pollute the graph especially during the gpu optimization
    """
    rval = Rebroadcast(*[(i,broadcastable[i]) for i in range(len(broadcastable))])(x)
@@ -2853,7 +2861,7 @@ class Join(Op):
    For joins involving scalar values, see @stack.

    .. python::
-        
+
        x, y, z = tensor.matrix(), tensor.matrix(), tensor.matrix()
        u = tensor.vector()

@@ -2952,7 +2960,7 @@ class Join(Op):
            return [None] + split_gz
        else:
            # assume that this isn't differentiable
-            return [None] * (1 + len(tensors)) 
+            return [None] * (1 + len(tensors))

    def _native_grad(self, axis_and_tensors, (gz,)):
        """WRITEME"""
@@ -3006,7 +3014,7 @@ pprint.assign(lambda pstate, r: r.owner and isinstance(r.owner.op, Join),
 @constructor
 def shape_padleft(t, n_ones=1):
    """Reshape `t` by left-padding the shape with `n_ones` 1s
-    
+
    See also: `shape_padright` and `Dimshuffle`
    """
    _t = as_tensor_variable(t)
@@ -3017,7 +3025,7 @@ def shape_padleft(t, n_ones=1):
 @constructor
 def shape_padright(t, n_ones=1):
    """Reshape `t` by right-padding the shape with `n_ones` 1s
-    
+
    See also: `shape_padleft` and `Dimshuffle`
    """
    _t = as_tensor_variable(t)
@@ -3045,10 +3053,10 @@ def stack(*tensors):
 @constructor
 def concatenate(tensor_list, axis=0):
    """Alias for `join`(axis, *tensor_list).
-    
+
    This function is similar to `join`, but uses the signature of numpy's concatenate function.

-    This function 
+    This function
    :Exceptions:
     - `TypeError` : the tensor_list must be a tuple or list

@@ -3072,7 +3080,7 @@ def get_vector_length(v):
    :Exceptions:
     - `TypeError` : `v` hasn't the proper type.
     - `ValueError` : No special case applies, the length is not known.
-    
+
    In general this is not possible, but for a number of special cases the length can be
    determined at compile / graph-construction time.  This function implements these special
    cases.
@@ -3165,7 +3173,7 @@ else:

 class Reshape(Op):
    """Perform a reshape operation of the input x to the new shape shp.
-    The number of dimensions to which to reshape to (ndim) must be known at graph 
+    The number of dimensions to which to reshape to (ndim) must be known at graph
    build time."""
    view_map = {0: [0]} #output 0 is potentially aliased to inputs [0]
    def __init__(self, ndim, name = None):
@@ -3248,7 +3256,7 @@ class Flatten(Op):
    def grad(self, (x,), (g_out,)):
        return [reshape(g_out, shape(x), x.ndim)]

-def flatten(x, outdim=1): 
+def flatten(x, outdim=1):
    return Flatten(outdim)(x)

 class TileGrad(Op):
@@ -3634,7 +3642,7 @@ class AdvancedSubtensor(Op):
        # TODO: in general, we need to re-pack the inputs into a valid index, just like
        # subtensor
        out[0] = inputs[0].__getitem__(inputs[1:])
-        #return 
+        #return
        #raise NotImplementedError()

    def grad(self, inputs, (gz,)):
@@ -3703,7 +3711,7 @@ class Dot(Op):
        return hash(type(self))

    # the rationale for Dot22 is related to getting GEMM Ops into the graph.  See Dot22 in tensor.blas for details.
-    
+
    def make_node(self, *inputs):
        inputs = map(as_tensor_variable, inputs)

@@ -3764,7 +3772,7 @@ class Dot(Op):
        elif x.type.ndim == 1 and y.type.ndim > 1:
            rval = dot(gz, y.T), outer(x.T, gz)
        elif x.type.ndim > 1 and y.type.ndim == 1:
-            rval = outer(gz, y.T), dot(x.T, gz) 
+            rval = outer(gz, y.T), dot(x.T, gz)
        else:
            rval = dot(gz, y.T), dot(x.T, gz)
        return cast(rval[0], x.dtype), cast(rval[1], y.dtype)
@@ -3865,7 +3873,7 @@ class TensorDot(Op):
            if len(axes[0])!=len(axes[1]):
                raise ValueError("We need that the axes 2 sub list of axes are of the same size")
            assert len(axes[0])==len(axes[1])
-            
+
        self.axes = axes

    def __eq__(self, other):
@@ -3887,7 +3895,7 @@ class TensorDot(Op):
        if axesdim > x.type.ndim or axesdim > y.type.ndim:
            raise TypeError('Cannot sum over more dimensions than input. %i > %i,%i' %
                    axesdim, x.type.ndim, y.type.ndim)
-       
+
        outdim = x.type.ndim + y.type.ndim - 2*axesdim
        output = tensor(dtype=scal.upcast(x.dtype, y.dtype),
                        broadcastable=[False]*outdim);
@@ -3904,7 +3912,7 @@ class TensorDot(Op):
    def grad(self, (x, y), (gz,)):
        gx, gy = tensordot_grad(self.axes)(x, y, gz)
        return [gx, gy]
-    
+
    def __str__(self):
        return "tensordot"
 tensordot = TensorDot
@@ -3923,7 +3931,7 @@ class Outer(Op):

        if nx != 1: raise TypeError('non-vector arg0 to outer()', x)
        if ny != 1: raise TypeError('not-vector arg1 to outer()', y)
-        
+
        bz = [x.type.broadcastable[0], y.type.broadcastable[0]]

        i_dtypes = [input.type.dtype for input in inputs]
@@ -3997,8 +4005,8 @@ class numeric_grad:
    #
    # There is a relationship between the step size and the function value and the measurement
    # error that is incurred due to rounding.  The finite difference we measure is
-    # delta = f(x0) - f(x0+eps) 
-    # 
+    # delta = f(x0) - f(x0+eps)
+    #
    # For maximum precision, f should be close to zero.
    # For every power of 2 that f departs from zero, we lose a bit of precision in delta.
    #
@@ -4009,7 +4017,7 @@ class numeric_grad:
    # bias into our measurement in general for non-linear functions.
    #
    # It would be interesting to have a version of numeric grad that used an adaptive stepsize.
-    # 
+    #
    # For now, we use a heuristic that catches very bad gradients, but is not perfectly
    # accurate.
    type_eps = {'float64': 1e-7,
@@ -4161,7 +4169,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None, abs_tol=None, rel_tol=No
        mode=None, cast_to_output_type=False):
    """ Test a gradient by Finite Difference Method. Raise error on failure.

-    Example: 
+    Example:
    >>> verify_grad(theano.tensor.tanh,
                    (numpy.asarray([[2,3,4], [-1, 3.3, 9.9]]),),
                    rng=numpy.random)
@@ -4187,8 +4195,8 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None, abs_tol=None, rel_tol=No
           debug mode, which can be very slow if it has to verify a lot
           of intermediate computations.

-    :note: This op does not support multiple outputs. In tests/test_scan.py there is 
-           an experimental verify_grad that covers that case as well by using random 
+    :note: This op does not support multiple outputs. In tests/test_scan.py there is
+           an experimental verify_grad that covers that case as well by using random
           projections.
    """
    assert isinstance(pt, (list,tuple))
@@ -4244,7 +4252,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None, abs_tol=None, rel_tol=No
    t_r = shared(random_projection())

    #random projection of o onto t_r
-    cost = sum(t_r * o_output)  #This sum() is defined above, it's not the builtin sum. 
+    cost = sum(t_r * o_output)  #This sum() is defined above, it's not the builtin sum.
    cost_fn = function(tensor_pt, cost)

    #todo-- determine if this is actually needed

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -101,6 +101,11 @@ class DimShuffle(Op):
        self.new_order = new_order
        self.inplace = inplace

+        for i in xrange(len(new_order)-1):
+            j = new_order[i]
+            if j != 'x' and j in new_order[i+1:]:
+                raise ValueError("The same input dimension may not appear twice in the list of output dimensions", (new_order))
+
        # list of dimensions of the input to drop
        self.drop = []
        i2j = {} # this maps i before dropping dimensions to j after dropping dimensions so self.shuffle can be set properly later on

--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
@@ -848,9 +848,39 @@ class ConvOp(Op):
 using namespace std;
 """ + tensor.blas.blas_header_text()

+    def use_blas(self):
+        """ Return True if we will generate code that use gemm.
+        """
+        #the gemm version only support that case
+        if self.out_mode == 'valid' and self.dx==0 and self.dy==0:
+            #We use a faster version in those case.
+            if (self.imshp != self.imshp_logical or self.kshp != self.kshp_logical
+                or self.unroll_patch or self.unroll_batch>0 or self.unroll_kern>0):
+                return False
+            return True
+        return False
+
+
    def c_libraries(self):
-        return tensor.blas.ldflags()
+        if self.use_blas():
+            return tensor.blas.ldflags()
+        return []

+    def c_compile_args(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, flags=True)
+        return []
+
+    def c_lib_dirs(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, libs_dir=True)
+        return []
+    
+    def c_header_dirs(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, include_dir=True)
+        return []
+        
    def c_code(self, node, name, (img2d, filtersflipped), (z, ), sub):
        if node.inputs[0].type.dtype != node.inputs[1].type.dtype:
            raise NotImplementedError()

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -119,11 +119,11 @@ def insert_inplace_optimizer_op(OP):
        """
        #we should not validate too often as this take too much time to execute!
        #It is the _dfs_toposort() fct in theano/gof/destroyhandler.py
-        #that take so much time. 
-        #Should we try to use another lib that do toposort? 
+        #that take so much time.
+        #Should we try to use another lib that do toposort?
        #   igraph: http://igraph.sourceforge.net/
        #   networkx: https://networkx.lanl.gov/
-        #Should we try to use cython? 
+        #Should we try to use cython?
        #   compiling only that fct is not enought, should we try to add the deque class too?
        #   and init the deque and other list to an upper bound number of element?
        #Should Theano do online toposort as in http://code.google.com/p/acyclic/?
@@ -213,7 +213,7 @@ def insert_inplace_optimizer_op(OP):

 insert_inplace_optimizer = insert_inplace_optimizer_op(T.Elemwise)

-compile.optdb.register('inplace_opt', insert_inplace_optimizer, 75, 'fast_run', 'inplace') 
+compile.optdb.register('inplace_opt', insert_inplace_optimizer, 75, 'fast_run', 'inplace')

 def register_canonicalize(lopt, *tags, **kwargs):
    name = (kwargs and kwargs.pop('name')) or lopt.__name__
@@ -304,7 +304,7 @@ class MakeVector(T.Op):
    """Concatenate a number of scalars together into a vector

    This is a simple version of stack() that introduces far less cruft into the graph.
-    
+
    Should work with 0 inputs. The constant_folding optimization will remove it.
    """
    def __init__(self, dtype='int64'):
@@ -398,7 +398,7 @@ class Shape_i(T.Op):
            %(out)s=(PyArrayObject*)PyArray_ZEROS(0, NULL, PyArray_INT64, 0);
            ((npy_int64*)PyArray_DATA(%(out)s))[0]=%(x)s->dimensions[%(i)s];
            """%locals()
-        
+
        elif node.inputs[0].type.__class__.__name__=="CudaNdarrayType":
            #Don't want to import cuda stuff here.
            return """
@@ -413,12 +413,12 @@ class Shape_i(T.Op):

 class ShapeFeature(object):
    """Graph optimizer for removing all calls to shape()
-    
+
    This optimizer replaces all Shapes and Subtensors of Shapes with Shape_i and MakeVector
    Ops.

    This optimizer has several goals:
-    1. to 'lift' Shapes to as close to the inputs as possible.  
+    1. to 'lift' Shapes to as close to the inputs as possible.
    2. to infer the shape of every node in the graph in terms of the input shapes.
    3. remove all fills (T.second, T.fill) from the graph

@@ -430,7 +430,7 @@ class ShapeFeature(object):
    Many optimizations refuse to work on nodes with multiple clients.

    Lifting is done by using an `<Op>.infer_shape` function if one is present, or else using a
-    conservative default.  An Op that supports shape-lifting should define a 
+    conservative default.  An Op that supports shape-lifting should define a
    infer_shape(self, node, input_shapes) function.  The argument input_shapes is a tuple
    of tuples... there is an interior tuple for each input to the node.  The tuple has as many
    elements as dimensions.  The element in position i of tuple j represents the i'th shape
@@ -439,9 +439,9 @@ class ShapeFeature(object):
    the output[j].shape[i] of the function.  If an output is not a TensorType, then None should
    be returned instead of a tuple for that output.

-    For example the infer_shape for a matrix-matrix product would accept 
+    For example the infer_shape for a matrix-matrix product would accept
    input_shapes=((x0,x1), (y0,y1)) and return ((x0, y1),).
-    
+

    Inferring the shape of internal nodes in the graph is important for doing size-driven
    optimizations.  If we know how big various intermediate results will be, we can estimate
@@ -495,7 +495,7 @@ class ShapeFeature(object):
            return T.constant(s_i, dtype='int64')
        if type(s_i) in (tuple,list):
            # this dimension is the same as many of the inputs
-            # which tells us that if one of the inputs is known, 
+            # which tells us that if one of the inputs is known,
            # the others all become known.
            # TODO: should be implemented in Elemwise, and Dot
            #
@@ -506,7 +506,7 @@ class ShapeFeature(object):
                raise TypeError('Shape element must be scalar', s_i)
            return s_i
        else:
-            raise TypeError('Unsupported shape element', 
+            raise TypeError('Unsupported shape element',
                    s_i, type(s_i), getattr(s_i, 'type', None))

    def set_shape(self, r, s):
@@ -534,7 +534,7 @@ class ShapeFeature(object):
        assert not hasattr(env, 'shape_feature')
        env.shape_feature = self
        self.shape_of = {} # Variable -> tuple(scalars) or None  (All tensor vars map to tuple)
-        self.scheduled = {} # Variable -> 
+        self.scheduled = {} # Variable ->
        self.lscalar_one = T.constant(1, dtype='int64')
        assert self.lscalar_one.type == T.lscalar
        for node in env.toposort():
@@ -622,7 +622,7 @@ def local_fill_to_alloc(node):

    This is an important optimization because with the shape_to_shape_i optimization, the
    dependency on 's' is often removed.
-    
+
    """
    if node.op == T.fill:
        r, v = node.inputs
@@ -637,7 +637,7 @@ def local_fill_to_alloc(node):
            shape_of = node.env.shape_feature.shape_of
            # TODO: cut out un-necessary dimshuffles of v
            rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
-        
+
        #if rval[0].type != node.outputs[0].type:
            #print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')

@@ -700,7 +700,7 @@ def local_subtensor_make_vector(node):
                raise

            if isinstance(idx, (scalar.Scalar, T.TensorType)):
-                # The idx is a Scalar, ie a Type. This means the actual index 
+                # The idx is a Scalar, ie a Type. This means the actual index
                # is contained in node.inputs[1]
                old_idx, idx = idx, node.inputs[1]
                assert idx.type == old_idx
@@ -773,7 +773,7 @@ class Assert(T.Op):
        cond = [T.as_tensor_variable(c) for c in conds]
        assert numpy.all([c.type.ndim == 0 for c in cond])
        return gof.Apply(self, [value]+cond, [value.type()])
-    
+
    def __str__(self):
        return self.__class__.__name__
    def perform(self, node, inputs, (out,)):
@@ -807,7 +807,7 @@ class Assert(T.Op):

    def infer_shape(self, node, input_shapes):
        return [input_shapes[0]]
-    
+
 assert_ = Assert()

 @register_specialize
@@ -818,13 +818,13 @@ def local_remove_useless_assert(node):
        for c in node.inputs[1:]:
            try:
                const = get_constant_value(c)
-                
+
                if 0!=const.ndim or const==0:
                    #Should we raise an error here? How to be sure it is not catched?
                    cond.append(c)
            except TypeError:
                cond.append(c)
-        
+
        if len(cond)==0:
            return [node.inputs[0]]
        if len(cond)!=len(node.inputs)-1:
@@ -873,12 +873,12 @@ def local_alloc_elemwise(node):
                      isinstance(i.owner.inputs[0].owner.op,T.Alloc)):
            no_broad_idx = idx
            break
-            
+
    assert no_broad_idx>=0
    assert_op = node.inputs[no_broad_idx]
    cmp_op = assert_op
    new = []
-    
+
    for i in node.inputs:
        if i.owner and isinstance(i.owner.op,T.Alloc) and i.owner.inputs[0].type != i.owner.outputs[0].type:
            #when i.owner.inputs[0].type == i.owner.outputs[0].type we will remove that alloc later
@@ -1017,8 +1017,8 @@ def local_IncSubtensor_serialize(node):
            IncSubtensor(Elemwise{second}(a, 0), g(f(a[2])), [2])

    This is much worse because this time we have to produce 3 matrices the size of 'a', just so
-    we can add them together. 
-    
+    we can add them together.
+
    This Op rearranges IncSubtensor's that all work on the same initial argument (here,
    Elemwise{second}(a,0)) into a chain.  The advantage of the chain structure is that each one
    can be optimized later in the pipeline to operate inplace.
@@ -1028,7 +1028,7 @@ def local_IncSubtensor_serialize(node):
    #
    #  add(x, incsubtensor(b, c), incsubtensor(b, d))
    #  -> incsubtensor(incsubtensor(add(x,b,b), c), d)
-    
+
    """
    def movable(i):
        # Return True iff this is a incsubtensor that we can move
@@ -1138,7 +1138,7 @@ def local_rebroadcast_lift(node):

 def apply_rebroadcast_opt(rval):
    """
-    Apply as many times as required the optimization local_useless_rebroadcast 
+    Apply as many times as required the optimization local_useless_rebroadcast
    and local_rebroadcast_lift.

    :param rval: a Variable
@@ -1149,7 +1149,7 @@ def apply_rebroadcast_opt(rval):
    while changed and rval.owner:
      changed = False
      rval2 = theano.tensor.opt.local_useless_rebroadcast.transform(rval.owner)
-      if rval2: 
+      if rval2:
        assert len(rval2)==1
        rval = rval2[0]
        changed = True
@@ -1216,7 +1216,7 @@ def local_mul_switch_sink(node):
                    fct[0].values_eq_approx = fct[0].type.values_eq_approx_remove_nan
                    return fct
            except TypeError:
-               pass 
+               pass
            try:
                if get_constant_value(switch.inputs[2]) == 0.:
                    listmul = node.inputs[:idx] + node.inputs[idx+1:]
@@ -1274,7 +1274,7 @@ def local_reshape_chain(node):
    """
    if not opt.check_chain(node, T.Reshape, T.Reshape):
        return False
-    
+
    # TODO: this can permit a failing program to run by eliminating the the lower
    #       reshape
    return [node.op(node.inputs[0].owner.inputs[0], node.inputs[1])]
@@ -1304,7 +1304,7 @@ if 0:
            y_shape = node.env.shape_feature.shape_of[y]

            def tmp(thing):
-                try: 
+                try:
                    return T.get_constant_value(thing)
                except (TypeError, ValueError), e:
                    print e, thing.owner.inputs[0]
@@ -1322,15 +1322,15 @@ def local_fill_cut(node):
    If c.type == a.type.
    """

-    # this optimization is essentially for getting broadcasting to replace fill. 
-    # This is always possible when using a Compound Elemwise operation, 
+    # this optimization is essentially for getting broadcasting to replace fill.
+    # This is always possible when using a Compound Elemwise operation,
    # but it is not always possible without one (consider filling a large matrix with a scalar,
    # and then adding another scalar.  The only numbers that count are the two scalars, but we
    # can't ignore the large matrix because it gives the shape of the result.

    if not opt.check_chain(node, T.Elemwise):
        return False
-    
+
    output = node.outputs[0]
    try:
        #reference is some input with the same type as the input but that is not produced by a fill
@@ -1397,7 +1397,7 @@ class Canonizer(gof.LocalOptimizer):
    Simplification tool.

    Usage: Canonizer(main, inverse, reciprocal, calculate)
-    
+
    * main: a suitable Op class that is commutative, associative and
            takes one to an arbitrary number of inputs, e.g. add or
            mul
@@ -1421,7 +1421,7 @@ class Canonizer(gof.LocalOptimizer):
      T = theano.tensor
      add_canonizer = Canonizer(T.add, T.sub, T.neg, lambda n, d: sum(n) - sum(d))
      mul_canonizer = Canonizer(T.mul, T.true_div, T.inv, lambda n, d: prod(n) / prod(d))
-    
+
    Examples of optimizations mul_canonizer can perform:
      x / x -> 1
      (x * y) / x -> y
@@ -1659,7 +1659,7 @@ class Canonizer(gof.LocalOptimizer):

        # Lists representing the *constant* elements of num and denum
        numct, denumct = [], []
-        
+
        for v in orig_num:
            ct = self.get_constant(v)
            if ct is not None:
@@ -1788,7 +1788,7 @@ register_canonicalize(local_mul_canonizer, name = 'local_mul_canonizer')
 @gof.local_optimizer([T.neg])
 def local_neg_to_mul(node):
    if node.op == T.neg:
-        return [T.mul(numpy.array(-1, dtype = node.inputs[0].dtype), 
+        return [T.mul(numpy.array(-1, dtype = node.inputs[0].dtype),
            node.inputs[0])]
 register_canonicalize(local_neg_to_mul)

@@ -1797,7 +1797,7 @@ register_canonicalize(local_neg_to_mul)
 def local_sum_mul_by_scalar(node):
    """sum(scalar * smth) -> scalar * sum(smth)
    """
-    # TODO: if the the thing inside the Sum is a division, 
+    # TODO: if the the thing inside the Sum is a division,
    # we should get at the numerator....
    if isinstance(node.op, T.Sum):
        thing_summed, = node.inputs
@@ -1935,7 +1935,7 @@ def local_sum_sum(node):
                    # special case of local_cut_useless_reduce
                    return [T.Sum(None)(summed.owner.inputs[0])]
                if node.op.axis is None:
-                    # we're summing up everything anyway so lets 
+                    # we're summing up everything anyway so lets
                    # do it all at once
                    return [T.Sum(None)(summed.owner.inputs[0])]

@@ -1983,7 +1983,6 @@ def local_sum_alloc(node):
        if summed.owner and isinstance(summed.owner.op, T.Alloc):
            input = summed.owner.inputs[0]
            shapes = summed.owner.inputs[1:]
-            #import pdb;pdb.set_trace()
            if node.op.axis is None or node.op.axis == tuple(range(input.ndim)):
                try:
                    val = get_constant_value(input)
@@ -2019,7 +2018,7 @@ register_specialize(local_mul_to_neg)
 @register_specialize
 @gof.local_optimizer([T.neg])
 def local_neg_neg(node):
-    # other specializations shouldn't put this in, 
+    # other specializations shouldn't put this in,
    # but sometimes they do
    if node.op == T.neg:
        if node.inputs[0].owner and node.inputs[0].owner.op == T.neg:
@@ -2177,11 +2176,11 @@ def local_pow_specialize_device(node):
                rval1 = None
                rval1_scal = None
                while y_to_do>0:
-                    log_to_do = int(numpy.log2(y_to_do))                    
+                    log_to_do = int(numpy.log2(y_to_do))
                    if rval1:
                        rval1 *= pow2[log_to_do]
                        rval1_scal *= pow2_scal[log_to_do]
-                    else: 
+                    else:
                        rval1 = pow2[log_to_do]
                        rval1_scal = pow2_scal[log_to_do]
                    y_to_do -= 2**log_to_do
@@ -2197,7 +2196,7 @@ def local_pow_specialize_device(node):
                rval[0] = T.cast(rval[0], odtype)
                assert rval[0].type == node.outputs[0].type, (rval, node.outputs)
                return rval
-   
+
 @gof.local_optimizer([T.mul])
 def local_mul_specialize(node):
    """Remove special-case constants from mul arguments
@@ -2210,7 +2209,7 @@ def local_mul_specialize(node):
        neg = False
        new_inputs = []
        for input in node.inputs:
-            # remove any neg arguments 
+            # remove any neg arguments
            while input.owner and input.owner.op == T.neg:
                neg ^= True
                input = input.owner.inputs[0]
@@ -2303,8 +2302,8 @@ def check_for_x_over_absX(numerators, denominators):
        if den.owner and den.owner.op == T.abs_ and den.owner.inputs[0] in numerators:
            if den.owner.inputs[0].type.dtype.startswith('complex'):
                #TODO: Make an Op that projects a complex number to have unit length
-                #      but projects 0 to 0.  That would be a weird Op, but consistent with the 
-                #      special case below.  I heard there's some convention in Matlab that is 
+                #      but projects 0 to 0.  That would be a weird Op, but consistent with the
+                #      special case below.  I heard there's some convention in Matlab that is
                #      similar to this... but not sure.
                pass
            else:
@@ -2319,7 +2318,7 @@ local_mul_canonizer.add_simplifier(check_for_x_over_absX, 'X_over_absX')
 def local_abs_lift(node):
    """
    move the abs toward the input. This is needed for check_for_x_over_absX to apply in more case.
-    
+
    """
    if node.op == T.abs_ and node.inputs[0].owner:
        assert node.nin == 1
@@ -2328,13 +2327,13 @@ def local_abs_lift(node):
        if node.inputs[0].owner.op == T.true_div:
            i = node.inputs[0].owner.inputs
            return [T.true_div(T.abs_(i[0]),T.abs_(i[1]))]
-    
+
 @register_specialize
 @gof.local_optimizer([])
 def local_abs_merge(node):
    """
    merge abs generated by local_abs_lift when the canonizer don't need it anymore
-    
+
    """
    if node.op == T.mul and sum([i.owner.op == T.abs_ for i in node.inputs if i.owner])>1:
        inputs = []
@@ -2570,7 +2569,7 @@ def constant_folding(node):
    return msg

 register_canonicalize(constant_folding, 'fast_compile')
-register_stabilize(constant_folding) # because 
+register_stabilize(constant_folding) # because
 register_specialize(constant_folding)


@@ -2598,7 +2597,7 @@ def _is_minus1(expr):
        return False

 #1+erf(x)=>erfc(-x)
-local_one_plus_erf = gof.PatternSub((T.add, 
+local_one_plus_erf = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_1),
                                     (T.erf, 'x')),
                                    (T.erfc, (T.neg, 'x')),
@@ -2608,7 +2607,7 @@ register_stabilize(local_one_plus_erf, name='local_one_plus_erf')
 register_specialize(local_one_plus_erf, name='local_one_plus_erf')

 #1-erf(x)=>erfc(x)
-local_one_minus_erf = gof.PatternSub((T.sub, 
+local_one_minus_erf = gof.PatternSub((T.sub,
                                     dict(pattern='y', constraint = _is_1),
                                     (T.erf, 'x')),
                                    (T.erfc, 'x'),
@@ -2629,7 +2628,7 @@ register_specialize(local_one_minus_erf2)

 #1+(-erf(x))=>erfc(x)
 #This is a different graph then the previous as the canonicalize don't work completly
-local_one_plus_neg_erf = gof.PatternSub((T.add, 
+local_one_plus_neg_erf = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_1),
                                     (T.neg,(T.erf, 'x'))),
                                    (T.erfc, 'x'),
@@ -2640,7 +2639,7 @@ register_specialize(local_one_plus_neg_erf, name='local_one_plus_neg_erf')

 #(-1)+erf(x) => -erfc(x)
 #don't need erf(x)+(-1) as the canonicalize will put the -1 as the first argument.
-local_erf_minus_one = gof.PatternSub((T.add, 
+local_erf_minus_one = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_minus1),
                                     (T.erf, 'x')),
                                    (T.neg,(T.erfc, 'x')),
@@ -2650,7 +2649,7 @@ register_stabilize(local_erf_minus_one, name='local_erf_minus_one')
 register_specialize(local_erf_minus_one, name='local_erf_minus_one')

 #1-erfc(x) => erf(x)
-local_one_minus_erfc = gof.PatternSub((T.sub, 
+local_one_minus_erfc = gof.PatternSub((T.sub,
                                     dict(pattern='y', constraint = _is_1),
                                     (T.erfc, 'x')),
                                    (T.erf, 'x'),
@@ -2665,7 +2664,7 @@ local_one_minus_erfc2 = gof.PatternSub((T.add,
                                       (T.erf, 'x'),
                                       allow_multiple_clients = True,
                                       name='local_one_minus_erfc2')
-register_canonicalize(local_one_minus_erfc2) 
+register_canonicalize(local_one_minus_erfc2)
 register_stabilize(local_one_minus_erfc2)
 register_specialize(local_one_minus_erfc2)

@@ -2675,13 +2674,13 @@ local_one_minus_erfc3 = gof.PatternSub((T.add,
                                       (T.erf, 'x'),
                                       allow_multiple_clients = True,
                                       name='local_one_minus_erfc3')
-register_canonicalize(local_one_minus_erfc3) 
+register_canonicalize(local_one_minus_erfc3)
 register_stabilize(local_one_minus_erfc3)
 register_specialize(local_one_minus_erfc3)

 #1+(-erfc(x)) => erf(x)
 #This is a different graph then the previous as the canonicalize don't work completly
-local_one_add_neg_erfc = gof.PatternSub((T.add, 
+local_one_add_neg_erfc = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_1),
                                     (T.neg,(T.erfc, 'x'))),
                                    (T.erf, 'x'),
@@ -2691,7 +2690,7 @@ register_stabilize(local_one_add_neg_erfc, name='local_one_add_neg_erfc')
 register_specialize(local_one_add_neg_erfc, name='local_one_add_neg_erfc')

 #(-1)+erfc(-x)=>erf(x)
-local_erf_neg_minus_one = gof.PatternSub((T.add, 
+local_erf_neg_minus_one = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_minus1),
                                     (T.erfc, (T.neg,'x'))),
                                    (T.erf, 'x'),
@@ -2701,7 +2700,7 @@ register_stabilize(local_erf_neg_minus_one, name='local_erf_neg_minus_one')
 register_specialize(local_erf_neg_minus_one, name='local_erf_neg_minus_one')

 #(-1)+erfc(-1*x)=>erf(x)
-local_erf_neg_minus_one2 = gof.PatternSub((T.add, 
+local_erf_neg_minus_one2 = gof.PatternSub((T.add,
                                     dict(pattern='y', constraint = _is_minus1),
                                     (T.erfc, (T.mul,-1,'x'))),
                                    (T.erf, 'x'),
@@ -2732,7 +2731,7 @@ def local_log_erfc(node):

    x = node.inputs[0].owner.inputs[0]
    stab_value = -x**2-T.log(x)-.5*T.log(numpy.pi)+T.log(1-1/(2*x**2)+3/(4*x**4)-15/(8*x**6))
-    
+
    if node.outputs[0].dtype=='float32':
        threshold = 10.0541949
    elif node.outputs[0].dtype=='float64':
@@ -2749,7 +2748,7 @@ def local_log_erfc(node):
 #for float64: threshold=26.63 see at the end of the fct for the explaination
 #for float32: threshold=9.3 see at the end of the fct for the explaination
 #TODO: remove the contraint that their is only 2 inputs to mul and the exp(x**2) is the second.
-#TODO: at the test point 10 in float32, their is instability in the original value. 
+#TODO: at the test point 10 in float32, their is instability in the original value.
 #      the original give -30.0, the stab -20.1 and in float64 -18.1.
 #      Make the test don't generate error in that case!
 @register_stabilize
@@ -2809,7 +2808,7 @@ def local_grad_log_erfc_neg(node):
                    new_inputs.append(i)
            return new_inputs
        mul_inputs = check_input(mul_neg.owner.inputs)
-        
+
        #put the constant first
        for i in range(len(mul_inputs)):
            if isinstance(i, Constant):
@@ -2821,7 +2820,7 @@ def local_grad_log_erfc_neg(node):
                    mul_inputs[i]=tmp
                    break
        mul_neg = T.mul(*mul_inputs)
-        
+
        try:
            cst2 = get_constant_value(mul_neg.owner.inputs[0])
        except TypeError:
@@ -2840,25 +2839,25 @@ def local_grad_log_erfc_neg(node):
            return False

        if cst2!=-1:
-            if (not erfc_x.owner or erfc_x.owner.op != T.mul 
+            if (not erfc_x.owner or erfc_x.owner.op != T.mul
                or len(erfc_x.owner.inputs)!=2):
                #todo implement that case
                return False
            if erfc_x.owner.inputs[1] is not mul_neg.owner.inputs[1]:
                return False
-            
+
            x = erfc_x
-            try:                
+            try:
                cst = get_constant_value(erfc_x.owner.inputs[0])
            except TypeError:
                return False
            if cst2 != -cst*2:
                return False
-            
+
            #The constant is valid. Must check that the
-        elif erfc_x is not x: 
+        elif erfc_x is not x:
                return False
-            
+
    else:
        return False

@@ -3014,7 +3013,7 @@ def local_elemwise_fusion_op(OP):
        try:
            s_new_out.owner.op.c_code(s_new_out.owner, "test_presence_of_c_code",
                             ["x" for x in s_g],
-                             "z",{}) 
+                             "z",{})
        except MethodNotDefined:
            _logger.info("%s does not implement the c_code function. As well as being potentially slow, this disables loop fusion of this op." % str(s_new_out.owner.op))
            return False
@@ -3046,19 +3045,18 @@ def local_elemwise_fusion_op(OP):
                return False

    #    print "local_elemwise_fusion: FUSED",nb_elemwise+1,"elemwise!"
-            
+
        #we fuse as many that we can at the same time to make debug mode faster
        #debug mode will be faster as it won't test all intermediate step.
        while True:
            ret = local_fuse(n)
            if ret is not False and ret is not None:
                #print n,ret
-                #import pdb;pdb.set_trace()
                assert len(ret)==len(n.outputs)
                assert len(ret)==1
                n = ret[0].owner
            else: break
-                
+
        return n.outputs
    return local_fuse


--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -647,7 +647,7 @@ TanhInplaceTester = makeBroadcastTester(op = inplace.tanh_inplace,
                                          grad = _grad_broadcast_unary_normal,
                                          inplace = True)

-#inplace ops when the input is integer and the output is float* 
+#inplace ops when the input is integer and the output is float*
 # don't have a well defined behavior. We don't test that case.
 _good_broadcast_unary_normal_no_int = _good_broadcast_unary_normal.copy()
 del _good_broadcast_unary_normal_no_int['integers']
@@ -903,7 +903,7 @@ class T_max_and_argmax(unittest.TestCase):
    def test_grad(self):
        data = numpy.random.rand(2,3)
        n = as_tensor_variable(data)
-        
+
        def check_grad_max(data, max_grad_data, axis=None):
            #This work only for axis in [0,None]
            assert axis in [0,None]
@@ -915,7 +915,7 @@ class T_max_and_argmax(unittest.TestCase):
            else:
                for id,v in enumerate(argmax):
                    z[v*numpy.prod(data.shape[data.ndim-1:axis:-1])+id]+=1
-            
+
            z = z.reshape(data.shape)
            assert numpy.all(max_grad_data == z)

@@ -1053,7 +1053,7 @@ class T_argmin_argmax(unittest.TestCase):
    def test_grad_argmin(self):
        data = numpy.random.rand(2,3)
        n = as_tensor_variable(data)
-        
+
        #test grad of argmin
        utt.verify_grad(lambda v: argmin(v), [data])

@@ -1072,7 +1072,7 @@ class T_argmin_argmax(unittest.TestCase):
    def test_grad_argmax(self):
        data = numpy.random.rand(2,3)
        n = as_tensor_variable(data)
-        
+
        #test grad of argmax
        utt.verify_grad(lambda v: argmax(v), [data])

@@ -1172,7 +1172,7 @@ class T_min_max(unittest.TestCase):
            v = eval_outputs(fct(n,-2))
            self.failUnless(v.shape == (3,))
            self.failUnless(numpy.all(v == nfct(n.value,-2)))
-            
+
            v = eval_outputs(fct(n,-1).shape)
            assert v==(2)
            v = eval_outputs(fct(n,-2).shape)
@@ -1220,7 +1220,7 @@ class T_min_max(unittest.TestCase):
    def test_grad_max(self):
        data = numpy.random.rand(2,3)
        n = as_tensor_variable(data)
-        
+
        def check_grad_max(data, max_grad_data, axis=None):
            #This work only for axis in [0,None]
            assert axis in [0,None]
@@ -1232,7 +1232,7 @@ class T_min_max(unittest.TestCase):
            else:
                for id,v in enumerate(argmax):
                    z[v*numpy.prod(data.shape[data.ndim-1:axis:-1])+id]+=1
-            
+
            z = z.reshape(data.shape)
            assert numpy.all(max_grad_data == z)

@@ -1252,7 +1252,7 @@ class T_min_max(unittest.TestCase):
    def test_grad_min(self):
        data = numpy.random.rand(2,3)
        n = as_tensor_variable(data)
-        
+
        def check_grad_min(data, min_grad_data, axis=None):
            #This work only for axis in [0,None]
            assert axis in [0,None]
@@ -1264,7 +1264,7 @@ class T_min_max(unittest.TestCase):
            else:
                for id,v in enumerate(argmin):
                    z[v*numpy.prod(data.shape[data.ndim-1:axis:-1])+id]+=1
-            
+
            z = z.reshape(data.shape)
            assert numpy.all(min_grad_data == z)

@@ -1304,7 +1304,7 @@ class T_subtensor(unittest.TestCase):
        try:
            t = n[0]
        except ValueError, e:
-            self.failUnless(e[0] is Subtensor.e_invalid)
+            self.failUnless(hasattr(e,'subtensor_invalid'))
            return
        self.fail()

@@ -1356,7 +1356,7 @@ class T_subtensor(unittest.TestCase):
        try:
            t = n[0,0]
        except ValueError, e:
-            self.failUnless(e[0] is Subtensor.e_invalid)
+            self.failUnless(hasattr(e,'subtensor_invalid'))
            return
        self.fail()
    def test1_ok_elem(self):
@@ -2561,7 +2561,7 @@ def test_flatten_outdim_invalid():
 # TODO: write test case for Tile Op
 def test_tile():
    print >> sys.stderr, "WARNING: No testcase for Tile"
-    pass 
+    pass


 class TestARange(unittest.TestCase):
@@ -2724,7 +2724,7 @@ class TestARange(unittest.TestCase):
        f = function([stop], out.shape, mode=mode)
        assert len(f.maker.env.toposort())==2
        #[Elemwise{Cast{int64}}(stop), MakeVector(Elemwise{Cast{int64}}.0)]
-        
+
        assert out.dtype == start.type.dtype
        assert numpy.all(f(5) == len(numpy.arange(0,5)))
        assert numpy.all(f(11) == len(numpy.arange(0,11)))
@@ -2961,7 +2961,7 @@ class test_tensordot(unittest.TestCase):
        self.failUnless(numpy.allclose(numpy.tensordot(aval,bval,axes),
                                       f5(aval,bval)))
        utt.verify_grad(TensorDot(axes), [aval,bval])
-        
+
        axes = (axes[1],axes[0])
        c = tensordot(axes)(btens, atens)
        f6 = inplace_func([btens,atens],c)
@@ -3051,7 +3051,7 @@ class test_tensordot(unittest.TestCase):

    def test_tensordot_grad(self):
        #We test it manually as we recreate the op in the make_node
-        
+
        amat = matrix()
        bmat = matrix()
        gzmat = matrix()
@@ -3245,17 +3245,17 @@ class test_broadcast(unittest.TestCase):
        test that the unbroadcast fct don't insert not needed broadcast
        and fuse consecutive Rebroadcast op
        """
-        
+
        x=matrix()
        assert unbroadcast(x,0) is x
        assert unbroadcast(x,1) is x
        assert unbroadcast(x,1,0) is x
        assert unbroadcast(x,0,1) is x
-        
+
        assert addbroadcast(x,0) is not x
        assert addbroadcast(x,1) is not x
        assert addbroadcast(x,1,0).owner.inputs[0] is x
-        
+
        assert unbroadcast(addbroadcast(x,0),0) is x
        assert addbroadcast(unbroadcast(x,0),0) is not x
        x=row()
@@ -3263,15 +3263,15 @@ class test_broadcast(unittest.TestCase):
        assert unbroadcast(x,1) is x
        assert unbroadcast(x,1,0) is not x
        assert unbroadcast(x,0,1) is not x
-        
+
        assert addbroadcast(x,0) is x
        assert addbroadcast(x,1).owner.inputs[0] is x
        assert addbroadcast(x,1,0).owner.inputs[0] is x
        assert addbroadcast(x,0,1).owner.inputs[0] is x
-        
+
        assert unbroadcast(addbroadcast(x,1),1) is x
        assert addbroadcast(unbroadcast(x,1),1) is not x
-        
+
        #the first broadcast is remove the broadcast, so the second
        #should not make one
        assert unbroadcast(unbroadcast(x,0),0).owner.inputs[0] is x
@@ -3281,10 +3281,10 @@ class test_broadcast(unittest.TestCase):
        assert unbroadcast(unbroadcast(x,1),0).owner.inputs[0] is x
        assert addbroadcast(unbroadcast(x,1),0).owner.inputs[0] is x
        assert addbroadcast(unbroadcast(x,0),0) is x
-    
+
 def test_mod():
    """
-    We add this test as not all language and C implementation give the same 
+    We add this test as not all language and C implementation give the same
    signe to the result. This check that the c_code of `Mod` is implemented
    as Python. That is what we want.
    """
@@ -3298,7 +3298,7 @@ def test_mod():

 def test_mod_compile():
    """
-    This test generate an Elemwise of Composite as: 
+    This test generate an Elemwise of Composite as:
        Elemwise{Composite{Composite{Composite{Composite{mod,EQ},Switch},mul},add}}

    The c_code generated is not compiling as of 30 June 2010. I fix the compilation in the same commit.
@@ -3342,6 +3342,20 @@ def test_unalign():
        if not should_raise:
            raise Exception("Theano raised an exception when none was expected")

+def test_dimshuffle_duplicate():
+    x = theano.tensor.vector()
+
+    success = False
+
+    try:
+        y = theano.tensor.DimShuffle((False, ), (0, 0))(x)
+    except ValueError, e:
+        assert str(e).find("may not appear twice") != -1
+        success = True
+
+    assert success
+
+
 if __name__ == '__main__':
    if 1:
        unittest.main()

--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
@@ -2087,6 +2087,13 @@ if __name__ == '__main__':
 #    unittest.main()
    test_fusion().tes_memory_leak()

-
-
-
+def test_local_mul_to_neg():
+    """
+    Test that a multiplication by -1 or -1.0 yields the appropriate data type
+    """
+    a = T.imatrix()
+    f1 = theano.function([a], -1*a)
+    f2 = theano.function([a], -1.0*a)
+    aval = numpy.random.randint(0,10,(2,2))
+    assert f1(aval).dtype == a.dtype
+    assert f2(aval).dtype == 'float64'