Merged

f4cecd01 · Olivier Delalleau · 602ee1b1 · 10f86fe3 · f4cecd01 · f4cecd01
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -137,17 +137,23 @@ following methods:
  the gradient of the Op's output but rather the gradient of some
  other criterion C with respect to the Op's input.
-  If the outputs of your op are [ f_1, ... f_n], then
+  If the outputs of your op are :math:`[ f_1, ... f_n]`, then
-  ``output_derivatives`` gives [ grad_{f_1} C, grad_{f_2} C, ... , grad_{f_n} C ]
+  ``output_derivatives`` gives
-  If the inputs of your op are [x_1, ..., x_n], then your Op.grad should
+  :math:`[ grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C) ]`.
-  return [ grad_{x_1} C, grad_{x_2} C, ..., grad_{x_n} C ]
+  If the inputs of your op are :math:`[x_1, ..., x_m]`, then your Op.grad
+  should return :math:`[ grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C) ]`,
-  where (grad_{y} z)_i = partial z / partial y_i  (and i can have any
+  where :math:`(grad_{y} z)_i = \frac{\partial z}{\partial y_i}`
-  number of dimensions)
+  (and :math:`i` can have any number of dimensions).
-  (note: in the case where i is 2 dimensional, this definition of grad
+  (Note: in the case where i is 2 dimensional, this definition of grad
  is different from the standard mathematical definition of the gradient
-  of a scalar with respect to a matrix, where you transpose the indices)
+  of a scalar with respect to a matrix, where you transpose the indices.)
+  In other words, :func:`grad` does not return
+  :math:`\frac{\partial f_i}{\partial x_j}`, but
+  :math:`\frac{\partial C}{\partial x_j} =
+  \frac{\partial C}{\partial f_i} \cdot \frac{\partial f_i}{\partial x_j}`.
+  Both the partial derivation and that multiplication have to be done by
+  :func:`grad`.
 At a bare minimum, a new Op must define ``make_node`` and ``perform``, which have no defaults.

--- a/doc/index.txt
+++ b/doc/index.txt
@@ -8,7 +8,7 @@ arrays efficiently. Theano features:
 * **tight integration with numpy** -- Use `numpy.ndarray` in Theano-compiled functions.
 * **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
-* **symbolic differentiation** -- Let Theano do your derivatives.
+* **efficient symbolic differentiation** -- Theano does your derivatives for function with one or many inputs.
 * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny.
 * **dynamic C code generation** -- Evaluate expressions faster.
 * **extensive unit-testing and self-verification** -- Detect and diagnose many types of mistake.

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -272,28 +272,41 @@ that fail on your platform (use the ``theano-users@googlegroups.com`` mailing li
 but note that you must first register to it, by going to `theano-users`_).
-Windows V1(bigger install, but simpler instruction + try instruction for gpu)
+Windows V1 (bigger install, but simpler instructions + tentative GPU instructions)
-----------------------------------------------------------------------------
+----------------------------------------------------------------------------------
- If you don't have Python yet, I would recommend the Python(x,y)
+- Install `Python(x,y) <http://www.pythonxy.com>`_. It is a single installation
-  distribution. It is only one installation and contains the most
+  file that contains additional packages like Numpy, Scipy, IPython, Matplotlib,
-  important packages (NumPy, SciPy, IPython, Matplotlib, Mingw, Nose,
+  MinGW, Nose, etc. Note that this implies you do not already have a Python
-  etc.).
+  installation (if you do have one, then you will need to either remove it first,
+  or install those additional packages manually as described in the V2 instructions).
- Next you should install Mercurial and download Theano.
-  Command line version: http://mercurial.selenic.com/
+- Install Mercurial. You can use either the
-  One gui version(Tortoise hg): http://mercurial.selenic.com/downloads/
+  `command-line version <http://groups.google.com/group/theano-users>`_ or the
-  the command is
+  `GUI version <http://groups.google.com/group/theano-users>`_ (for the purpose of
+  simply downloading Theano, the command line version is enough).
+- Start a shell (hit the Start button and run the ``cmd`` command) and navigate to
+  the directory where you want to install Theano (it is ok to just stay in the
+  default directory, which should be your user profile directory). Then download
+  Theano with:
+    .. code-block:: bash
        hg clone http://hg.assembla.com/theano Theano
- Theano needs 1 environment variable:
+- Add (or edit) the PYTHONPATH environment variable (available through Control
-  a) system variable PYTHONPATH with value C:\...\Theano
+  Panel / System / Advanced / Environment Variables), so that it contains
-  (installation folder of theano)
+  the full installation directory of Theano. Restart a shell (``cmd``) to verify
+  that it works:
+    .. code-block:: bash
+        C:\Users\login>echo %PYTHONPATH%
+        C:\Users\login\Theano
-  In the USERPROFILE directory 
+- Create a new ``.theanorc`` text file (or ``.theanorc.txt``, which is easier
-  you should create a configuration file .theanorc.
+  to create under Windows) in your user profile directory, with the following
-  .theanorc.txt is also accepted on Windows if the environment 
-  variable THEANORC is not set. The file should have the following
  two lines:
    .. code-block:: bash
@@ -301,32 +314,42 @@ Windows V1(bigger install, but simpler instruction + try instruction for gpu)
      [blas]
      ldflags =
-  This is enough to run Theano! It will use NumPy for dot products
+- You are now ready to run Theano.
-  which, however, is pretty fast (see below).
+  It will use NumPy for dot products, which is still pretty fast (see below for
+  optional instructions on how to compile your own BLAS library).
+  To test that theano read correctly your configuration file, run Python (easiest
+  way is to just type ``python`` in a shell) and run the following:
-  To test that theano read correctly the .theanorc or .theanorc.txt file,
+  .. code-block:: python
-  in python run:
-  .. code-block:: bash
      import theano
      print theano.config.blas.ldflags
-  That should print the same content as what is in your config file.
+  This should print the same content as in your config file, i.e. nothing
+  (if your config file was not read properly, it would print ``-lblas``).
- (Optional) If you want a faster and/or multithreaded BLAS library, you can
+Windows V1.5 (optional follow-up to V1 instructions)
-  compile GotoBLAS2. I did not try to compile ATLAS because I read that
+----------------------------------------------------
-  it is slower than Goto and very difficult to compile (especially for
+- If you want a faster and/or multithreaded BLAS library, you can
+  compile GotoBLAS2. We did not try to compile ATLAS because we read that
+  it is slower than Goto and more difficult to compile (especially on
  Windows).
-  GotoBLAS can be downloaded after a simple registration (the most
+  GotoBLAS2 can be downloaded
-  recent version is 1.13 right now). To compile it, you need to install
+  `here <http://www.tacc.utexas.edu/tacc-projects/gotoblas2/downloads>`_
-  two more programs: MSYS and Perl (for example ActivePerl). Actually,
+  after registering on the website (we tested v1.13).
-  the GotoBLAS makefiles expect a full UNIX environment (like Cygwin)
+  To compile it, you also need to install MSYS and Perl (for instance
-  but the BLAS compilation seems to work with only MSYS and Perl. The
+  ActivePerl).
-  LAPACK compilation fails, but we don't need it anyway.
+  The GotoBLAS makefiles actually expect a full UNIX environment (like
+  Cygwin) but the BLAS compilation seems to work with only MSYS and Perl.
+  The LAPACK compilation fails, but is not needed anyways.
+  (WORK-IN-PROGRESS, TO BE CONTINUED)
  Compilation steps:
-  a) Unpack GotoBLAS2 (using 7-zip or the MSYS tar command)
+  a) Unpack GotoBLAS2 (using `7-zip <http://www.7-zip.org/>`_ or the
+     MSYS tar command).
  b) open MSYS, change directory to GotoBLAS2 (cd command)
@@ -354,7 +377,7 @@ Windows V1(bigger install, but simpler instruction + try instruction for gpu)
  b) The Windows binaries of NumPy were compiled with ATLAS and are surprisingly fast.
  c) GotoBLAS is even faster, in particular if you have several kernels.
- (Optional) Gpu on Windows. Not sur it work! Can you report success/error on the `theano-user <http://groups.google.ca/group/theano-users?pli=1>`_ mailing list?
+- (Optional) Gpu on Windows. Not sur it work! Can you report success/error on the `theano-users <http://groups.google.com/group/theano-users>`_ mailing list?
  Those are indication for 32 bits version of python, the one that come with pythonxy is 32 bits.

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -174,8 +174,8 @@ Config Attributes
    A list of optimizer tags that we don't want included in the default Mode.
    If multiple tags, separate them by ':'.
-    Ex: to remove the elemwise inplace optimizer(slow for big graph)
+    Ex: to remove the elemwise inplace optimizer(slow for big graph),
-        use the flags: optimizer_excluding:inplace_opt
+    use the flags: optimizer_excluding:inplace_opt, where
    inplace_opt is the name of that optimization.
 .. attribute:: optimizer_including

--- a/doc/library/gradient.txt
+++ b/doc/library/gradient.txt
@@ -18,11 +18,15 @@ awkward to use when :func:`tensor.grad` can do the job.
 .. function:: grad_sources_inputs(sources, graph_inputs, warn_type=True)
-    A gradient source is a pair (``r``, ``g_r``), in which ``r`` is a `Variable`, and ``g_r`` is a
+    A gradient source is a pair (``v``, ``g_v``), in which ``v`` is
-    `Variable` that is a gradient wrt ``r``.
+    a `Variable`, and ``g_v`` is a `Variable` that is a gradient wrt
+    ``v``. More specifically, ``g_v`` is the gradient of an external
+    scalar cost, ``cost`` (that is not explicitly used), wrt ``v``.
    This function traverses the graph backward from the ``r`` sources,
-    calling ``op.grad(...)`` for all ops with some non-None gradient on an output.
+    calling ``op.grad(...)`` for all ops with some non-None gradient
+    on an output, to compute gradients of ``cost`` wrt intermediate
+    variables and ``graph_inputs``.
    The ``op.grad(...)`` functions are called like this:
@@ -30,14 +34,20 @@ awkward to use when :func:`tensor.grad` can do the job.
        op.grad(op.inputs[:], [total_gradient(v) for v in op.outputs])
-    This call to ``op.grad`` should return a list or tuple: one symbolic gradient per input.
+    This call to ``op.grad`` should return a list or tuple: one symbolic
-    If ``op`` has a single input, then ``op.grad``  should return a list or tuple of length 1.
+    gradient per input. These gradients represent the gradients of
+    the same implicit ``cost`` mentionned above, wrt ``op.inputs``.  Note
+    that this is **not** the same as the gradient of ``op.outputs`` wrt
+    ``op.inputs``.
-    For each input wrt to which ``op`` is not differentiable, it should return ``None`` instead
+    If ``op`` has a single input, then ``op.grad`` should return a list
-    of a `Variable` instance.
+    or tuple of length 1.
+    For each input wrt to which ``op`` is not differentiable, it should
+    return ``None`` instead of a `Variable` instance.
+    If a source ``r`` receives a gradient from another source ``r2``,
+    then the effective gradient on ``r`` is the sum of both gradients.
-    If a source ``r`` receives a gradient from another source ``r2``, then the effective
-    gradient on ``r`` is the sum of both gradients.
    :type sources: list of pairs of Variable: (v, gradient-on-v) to 
                   initialize the total_gradient dictionary

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -463,15 +463,6 @@ TensorVariable
            (0, 'x', 1) -> AxB to Ax1xB
            (1, 'x', 0) -> AxB to Bx1xA
-        See :func:`dimshuffle`.
-        (The above link just points back to this paragraph. Maybe whoever
-        wrote that meant to refer to theano.tensor.DimShuffle)
    .. method:: flatten(ndim=1)
        Returns a view of this tensor with `ndim` dimensions, whose shape for the first
@@ -500,11 +491,11 @@ Shaping and Shuffling
 =====================
 To re-order the dimensions of a variable, to insert or remove broadcastable
-dimensions, see :meth:`_tensor_py_operators.dimshuffle`
+dimensions, see :meth:`_tensor_py_operators.dimshuffle`.
 .. function:: shape(x)
-    Returns lvector representing shape of `x`
+    Returns an lvector representing the shape of `x`.
 .. function:: reshape(x, newshape, ndim=None)
@@ -562,7 +553,8 @@ dimensions, see :meth:`_tensor_py_operators.dimshuffle`
    Make `x` broadcastable in the specified axes `axes`. For
    example, `unbroadcast(x,0)` will make the first dimension of `x`
-    broadcastable.
+    broadcastable. When performing the function, if the length of `x`
+    along that dimension is not 1, a ``ValueError`` will be raised.
 .. function:: flatten(x, outdim=1)
@@ -1106,9 +1098,13 @@ Gradient / Differentiation
    Return symbolic gradients for one or more variables with respect to some
    cost.
+    For more information about how automatic differentiation works in Theano,
+    see :mod:`gradient`. For information on how to implement the gradient of
+    a certain Op, see :func:`grad`.
    :type cost: 0-d tensor variable
    :type wrt: tensor variable or list of tensor variables
-    :type g_cost: same as `cost`
+    :type g_cost: same as type of `cost`
    :type consider_constant: list of variables
    :type warn_type: bool
@@ -1121,7 +1117,7 @@ Gradient / Differentiation
       expression
    :rtype: variable or list of variables (matching `wrt`)
-    :returns: gradients with respect to cost for each of the `wrt` terms 
+    :returns: gradients of the cost with respect to each of the `wrt` terms

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
+.. _basictutaliasing:
+===============
+Memory Aliasing
+===============
+The aggressive reuse of memory is one of the ways Theano makes code fast, and
+it's important for the correctness and speed of your program that you understand
+which buffers Theano might alias to which others.
+This file describes the principles for how Theano treats memory, and explains
+when you might want to change the default behaviour of some functions and
+methods for faster performance.
+The memory model: 2 spaces
+==========================
+There are some simple principles that guide Theano's treatment of memory.  The
+main idea is that there is a pool of memory managed by Theano, and Theano tracks
+changes to values in that pool.
+ 1. Theano manages its own memory space, which typically does not overlap with
+    the memory of normal python variables that non-theano code creates.
+ 1. Theano functions only modify buffers that are in its memory space.
+ 1. Theano's memory space includes the buffers allocated to store shared
+    variables and the temporaries used to evaluate Functions.
+ 1. Physically, Theano's memory space may be spread across the host, a GPU
+    device(s), and in the future may even include objects on a remote machine.
+ 1. The memory allocated for a shared variable buffer is unique: it is never
+    aliased to another shared variable.
+ 1. Theano's managed memory is constant while Theano Functions are not running
+    and theano library code is not running.
+ 1. The default behaviour of Function is to return user-space values for
+    outputs, and to expect user-space values for inputs.
+The distinction between Theano-managed memory and user-managed memory can be
+broken down by some theano functions (e.g. In, Out,shared, get_value)) by using
+a ``borrow=True`` flag.  This can make those methods faster (by avoiding copy
+operations) at the expense of risking subtle bugs in the overall program (by
+aliasing memory).
+The rest of this section is aimed at helping you to understand when it is safe
+to use the ``borrow=True`` argument and reap the benefit of faster code.
+Borrowing when creating shared variables
+========================================
+A ``borrow`` argument can be provided to the shared-variable constructor.
+.. code-block:: python
+    import numpy, theano
+    np_array = numpy.ones(2, dtype='float32')
+    s_default = shared(np_array)
+    s_false   = shared(np_array, borrow=False)
+    s_true    = shared(np_array, borrow=True)
+By default (``s_default``) and when explicitly setting ``borrow=False``, the
+shared variable we construct gets a [deep] copy of ``np_array``.  So changes we
+subsequently make to ``np_array`` have no effect on our shared variable.
+.. code-block:: python
+    np_array += 1 # now it is an array of 2.0 s
+    s_default.value  # -> array([1.0, 1.0])
+    s_false.value    # -> array([2.0, 2.0])
+If we are running this with the CPU as the device, 
+then changes we make to np_array *right away* will show up in ``s_false.value``
+because numpy arrays are mutable, and ``s_false`` is using the ``np_array``
+object as it's internal buffer.
+However, this aliasing of ``np_array`` and ``s_false`` is *inconsistent and fragile*!
+It is inconsistent because if Theano is using a GPU device, then the borrow flag
+has no effect.
+It is fragile because
+if we call a theano function that updates the value of ``s_false`` the aliasing
+relationship *may* or *may not* be broken (it depends on what the Theano
+function does).
+*Take home message:*
+It is safe practice (and a good idea) to use ``borrow=True`` in a shared
+variable constructor when the shared variable stands for a large object (in
+terms of memory footprint) and you do not want to create copies of it in memory
+.
+It is not a reliable technique to use ``borrow=True`` to modify shared variables
+by side-effect, because with some devices (e.g. GPU devices) this technique will
+not work.
+Borrowing when accessing value of shared variables
+==================================================
+Retrieving
+----------
+A ``borrow`` argument can also be used to control how a shared variable's value is retrieved.
+.. code-block:: python
+    s = shared(np_array)
+    v_false = s.get_value(borrow=False) # N.B. borrow default is False
+    v_true = s.get_value(borrow=True)
+When ``borrow=False`` is passed to ``get_value``, it means that the return value
+may not be aliased to any part of Theano's internal memory.
+When ``borrow=True`` is passed to ``get_value``, it means that the return value
+*might* be aliased to some of Theano's internal memory.
+But both of these calls might create copies of the internal memory.
+The reason that ``borrow=True`` might still make a copy is that the internal
+representation of a shared variable might not be what you expect.  When you
+create a variable by passing a numpy array for example, then ``get_value()``
+must return a numpy array too.  That's how Theano can make the GPU use
+transparent.  But when you are using a GPU (or in future perhaps a remote machine), then the numpy.ndarray
+is not the internal representation of your data.
+If you really want Theano to return its internal representation *and never copy it*
+then you should use the ``return_internal_type=True`` argument to
+``get_value``.  It will never copy the internal object (always return in
+constant time), but might return various datatypes depending on contextual
+factors (e.g. the compute device, the dtype of the numpy array).
+.. code-block:: python
+    v_internal = s.get_internal_value(borrow=True, return_internal_type=True)
+It is possible to use ``borrow=False`` in conjunction with
+``return_internal_type=True``, which will return a deep copy of the internal object.
+This is primarily for internal debugging, not for typical use.
+*Take home message:*
+It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
+your code does not modify the return value.  *Do not use this to modify a shared
+variable by side-effect* because it will make your code device-dependent.
+Modification of GPU variables by this sort of side-effect is impossible.
+Assigning
+---------
+Shared variables also have a ``set_value`` method that can accept an optional ``borrow=True`` argument.
+The semantics are similar to those of creating a new shared variable -
+``borrow=False`` is the default and ``borrow=True`` means that Theano *may*
+reuse the buffer you provide as the internal storage for the variable.
+A standard pattern for manually updating the value of a shared variable is as
+follows.
+.. code-block:: python
+    s.set_value(
+        some_inplace_fn(s.get_value(borrow=True)),
+        borrow=True)
+This pattern works regardless of the compute device, and when the compute device
+makes it possible to expose Theano's internal variables without a copy, then it
+goes as fast as an in-place update.
+Borrowing when constructing Function objects
+============================================
+A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
+that control how ``theano.function`` handles its arguments and return value[s]. 
+.. code-block:: python
+    import theano, theano.tensor
+    x = theano.tensor.matrix()
+    y = 2*x
+    f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
+Borrowing an input means that Theano will treat the argument you provide as if
+it were part of Theano's pool of temporaries.  Consequently, your input
+may be reused as a buffer (and overwritten!) during the computation of other variables in the
+course of evaluating that function (e.g. ``f``).  
+Borrowing an output means that Theano will not insist on allocating a fresh
+output buffer every time you call the function.  It will possibly reuse the same one as
+a previous call, and overwrite the old contents.  Consequently, it may overwrite
+old return values by side effect.
+It is also possible to pass an ``return_internal_type=True`` flag to the ``Out``
+variable which has the same interpretation as the ``return_internal_type`` flag
+to the shared variable's ``get_value`` function.
+*Take home message:*
+When an input ``x`` to a function is not needed after the function returns and you
+would like to make it available to Theano as additional workspace, then consider
+marking it with ``In(x, borrow=True)``.  It may make the function faster and
+reduce its memory requirement.
+When a return value ``y`` is large (in terms of memory footprint), and you only need to read from it once, right
+away when it's returned, then consider marking it with an ``Out(y,
+borrow=True)``.
--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -144,6 +144,8 @@ array([[ 0.25      ,  0.19661193],
 The resulting function computes the gradient of its first argument
 with respect to the second. In this way, Theano can be used for
 `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_.
+As opposed to what this page tell, Theano do efficient symbolic differentiation 
+even for function with many inputs.
 .. note::

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -510,7 +510,7 @@ class Function(object):
        # Set positional arguments
        i = 0
-        for arg in args:
+        for arg_index, arg in enumerate(args):
            #TODO: provide a Param option for skipping the filter if we
            #      really want speed.
            s = self.input_storage[i]
@@ -520,7 +520,7 @@ class Function(object):
                try:
                    s.storage[0] = s.type.filter(arg, strict=s.strict)
                except Exception, e:
-                    e.args = tuple(list(e.args)+["Bad input argument at index %d"%(list(args).index(arg))])
+                    e.args = tuple(list(e.args)+["Bad input argument at index %d" % arg_index])
                    raise
            s.provided += 1
            i+=1

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
 import os
+import subprocess
+import logging
 from theano.configparser import TheanoConfigParser, AddConfigVar, EnumStr, StrParam, IntParam, FloatParam, BoolParam
+_logger = logging.getLogger('theano.configdefaults')
+def warning(*msg):
+    _logger.warning('WARNING theano.configdefaults: '+' '.join(msg))
 config = TheanoConfigParser()
 AddConfigVar('floatX',
@@ -24,10 +31,22 @@ AddConfigVar('mode',
        "Default compilation mode",
        EnumStr('Mode', 'ProfileMode', 'DebugMode', 'FAST_RUN', 'FAST_COMPILE', 'PROFILE_MODE', 'DEBUG_MODE'))
-#Keep the default linker the same as the one for the mode FAST_RUN
+# Test whether or not gcc is present: disable C code if it is not
-AddConfigVar('linker',
+try:
-        "Default linker. If not None, will use this linker with the Mode object(not ProfileMode or DebugMode)",
+    subprocess.Popen('gcc', stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    # Keep the default linker the same as the one for the mode FAST_RUN
+    AddConfigVar('linker',
+            "Default linker. If not None, will use this linker with the Mode "+
+            "object(not ProfileMode or DebugMode)",
            EnumStr('c|py', 'py', 'c', 'c|py_nogc', 'c&py'))
+except OSError:
+    # gcc is not present, linker should default to python only
+    AddConfigVar('linker',
+            "Default linker. If not None, will use this linker with the Mode object(not ProfileMode or DebugMode)",
+            EnumStr('py', 'c|py', 'c', 'c|py_nogc', 'c&py'))
+    warning('GCC not detected ! Theano will be unable to execute optimized '+
+            'C-implementations (for both CPU and GPU) and will default to '+
+            'Python implementations. Performance will be severely degraded.')
 #Keep the default optimizer the same as the one for the mode FAST_RUN
 AddConfigVar('optimizer',
@@ -88,19 +107,23 @@ AddConfigVar('experimental.mrg',
 ###
 ### To disable some warning about old bug that are fixed now.
 ###
+AddConfigVar('warn.old_bug_default',
+             "If False, will disable by default the warning about old Theano bug. If you never used Theano, you set it to False.",
+             BoolParam(True))
+default_warn = config.warn.old_bug_default
 AddConfigVar('warn.argmax_pushdown_bug',
             "Warn if in past version of Theano we generated a bug with the optimisation theano.tensor.nnet.nnet.local_argmax_pushdown optimization. Was fixed 27 may 2010",
-             BoolParam(True))
+             BoolParam(default_warn))
 AddConfigVar('warn.gpusum_01_011_0111_bug',
             "Warn if we are in a case where old version of Theano had a silent bug with GpuSum pattern 01,011 and 0111 when the first dimensions was bigger then 4096. Was fixed 31 may 2010",
-             BoolParam(True))
+             BoolParam(default_warn))
 AddConfigVar('warn.sum_sum_bug',
             "Warn if we are in a case where Theano version between version 9923a40c7b7a and the 2 august 2010(fixed date), generated an error in that case. This happen when their is 2 consecutive sum in the graph, bad code was generated. Was fixed 2 August 2010",
-             BoolParam(True))
+             BoolParam(default_warn))
 AddConfigVar('warn.sum_div_dimshuffle_bug',
             "Warn if previous versions of Theano (between rev. 3bd9b789f5e8, 2010-06-16, and cfc6322e5ad4, 2010-08-03) would have given incorrect result. This bug was triggered by sum of division of dimshuffled tensors.",
-             BoolParam(True))
+             BoolParam(default_warn))
--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -50,7 +50,7 @@ class CLinkerObject(object):
         - `MethodNotDefined`: Subclass does not implement this method
        """
-        raise utils.MethodNotDefined("c_lib_dirs", type(self), self.__class__.__name__)
+        raise utils.MethodNotDefined("c_header_dirs", type(self), self.__class__.__name__)
    def c_libraries(self):
        """Optional: Return a list of libraries required by code returned by

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -23,7 +23,7 @@ def debugprint(obj, depth=-1, print_type=False, file=None):
    :type file: None, 'str', or file-like object
    :param file: print to this file ('str' means to return a string)
-    :returns: str if `file`=='str', else file arg
+    :returns: string if `file` == 'str', else file arg
    Each line printed represents a Variable in the graph.
    The indentation of each line corresponds to its depth in the symbolic graph.

--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -759,28 +759,47 @@ def test_many_arg_elemwise():
    rng = numpy.random.RandomState( [1,2,3])
    for num_args in [25]:
-        rows = rng.randint(1,5)
-        cols = rng.randint(1,5)
        for op_to_test in [ theano.tensor.add, theano.tensor.mul ]:
-            args = [ numpy.cast['float32'](rng.randn(rows,cols)) for arg in xrange(0,num_args) ]
+            for nb_dim in [2,3,4,5]:
-            symb_args = [ theano.tensor.fmatrix() for arg in xrange(0,num_args) ]            
+                shapes = [rng.randint(1,5) for i in range(nb_dim)]
+                args = [ numpy.cast['float32'](rng.randn(*shapes)) for arg in xrange(0,num_args) ]
+                symb_args = [ theano.tensor.TensorType('float32', (False,)*nb_dim)() for arg in xrange(0,num_args) ]            
                outputs = []
                for mode in [ mode_with_gpu, mode_without_gpu ]:
-                f = theano.function( symb_args, op_to_test(*symb_args), mode = mode )
+                    #test the optijmization local_gpu_elemwise_0
-                #theano.printing.debugprint(f)
+                    f = theano.function( symb_args, op_to_test(*symb_args), mode = mode.excluding("local_gpu_elemwise_1") )
                    outputs.append( f( * args) )
                    #assert that the test was done on the gpu.
                    if mode is mode_with_gpu:
                        assert any([isinstance(node.op, cuda.GpuElemwise) for node in f.maker.env.nodes])
+                    #test the optijmization local_gpu_elemwise_1
+                    f = theano.function( symb_args, 
+                                         cuda.gpu_from_host(op_to_test(*symb_args)), 
+                                         mode = mode.excluding("local_gpu_elemwise_0") )
+                    out = f( * args)
+                    #assert that the test was done on the gpu.
+                    if mode is mode_with_gpu:
+                        assert any([isinstance(node.op, cuda.GpuElemwise) for node in f.maker.env.nodes])
+                    assert numpy.allclose(out, outputs[-1])
                results_gpu, results_cpu = outputs
                assert numpy.allclose(results_gpu, results_cpu)
+def test_duplicate_arg_elemwise():
+    A = theano.tensor.fmatrix()
+    B = A + A
+    f = theano.function([A],B, mode = mode_with_gpu)
+    Aval = numpy.random.RandomState([1,2,3]).randn(5,5)
+    Bval = Aval + Aval
+    assert numpy.allclose(Bval,f(Aval))

--- a/theano/sandbox/cuda/tests/test_opt.py
+++ b/theano/sandbox/cuda/tests/test_opt.py
 import sys, time
-from theano.compile.sharedvalue import shared
+import numpy
+# Skip test if cuda_ndarray is not available.
+from nose.plugins.skip import SkipTest
 from theano.compile.pfunc import pfunc
 from theano import tensor
 import theano
-import numpy
-# Skip test if cuda_ndarray is not available.
-from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda
 if cuda.cuda_available == False:
    raise SkipTest('Optional package cuda disabled')
-import theano.compile.mode
 from theano.sandbox.cuda.type import CudaNdarrayType
 if theano.config.mode=='FAST_COMPILE':
@@ -49,6 +49,9 @@ def test_int_pow():
    #theano.printing.debugprint(f)
 def test_softmax():
    x = tensor.fmatrix()

--- a/theano/sandbox/cuda/tests/test_vector_matrix_dot.py
+++ b/theano/sandbox/cuda/tests/test_vector_matrix_dot.py
+import sys, time
+from theano import shared
+from theano.compile.pfunc import pfunc
+from theano import tensor
+import numpy
+import theano
+import theano.tensor as TT
+# Skip test if cuda_ndarray is not available.
+from nose.plugins.skip import SkipTest
+import theano.sandbox.cuda as cuda_ndarray
+if cuda_ndarray.cuda_available == False:
+    raise SkipTest('Optional package cuda disabled')
+import theano.sandbox.cuda as tcn
+import theano.sandbox.cuda as cuda
+import theano.sandbox.cuda.basic_ops as B
+import theano.sandbox.cuda.blas as blasop
+import theano.compile.mode
+from theano.tests import unittest_tools as utt
+### Tolerance factor used in this tests !!!
+atol = 1e-6
+##########################
+if theano.config.mode=='FAST_COMPILE':
+    mode_with_gpu = theano.compile.mode.get_mode('FAST_RUN').including('gpu')
+    mode_without_gpu = theano.compile.mode.get_mode('FAST_RUN').excluding('gpu')
+else:
+    mode_with_gpu = theano.compile.mode.get_default_mode().including('gpu')
+    mode_without_gpu = theano.compile.mode.get_default_mode().excluding('gpu')
+def test_dot_vm():
+    ''' Test vector dot matrix '''
+    v = theano.shared( numpy.array(numpy.random.rand(2), dtype='float32'))
+    m = theano.shared( numpy.array(numpy.random.rand(2,2),
+                                   dtype='float32'))
+    no_gpu_f = theano.function([], theano.dot(v,m), mode = mode_without_gpu)
+    gpu_f    = theano.function([], theano.dot(v,m), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuDot22) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+def test_dot_mv():
+    ''' Test matrix dot vector '''
+    v = theano.shared( numpy.array(numpy.random.rand(2), dtype='float32'))
+    m = theano.shared( numpy.array(numpy.random.rand(2,2),
+                                   dtype='float32'))
+    no_gpu_f = theano.function([], theano.dot(m,v), mode = mode_without_gpu)
+    gpu_f    = theano.function([], theano.dot(m,v), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuDot22) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+def test_gemv1():
+    ''' Is this the same test as test_gemv2 ? '''
+    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
+    no_gpu_f = theano.function([], v2+theano.dot(m,v1), mode = mode_without_gpu)
+    gpu_f    = theano.function([], v2+theano.dot(m,v1), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuGemm) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+def test_gemv2():
+    ''' Is this the same test as test_gemv1 ? '''
+    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
+    no_gpu_f = theano.function([], v2+theano.dot(v1,m), mode = mode_without_gpu)
+    gpu_f    = theano.function([], v2+theano.dot(v1,m), mode = mode_with_gpu)
+    # Assert they produce the same output
+    assert numpy.allclose(no_gpu_f(), gpu_f(), atol = atol)
+    # Assert that the gpu version actually uses gpu
+    assert sum([isinstance(node.op, blasop.GpuGemm) for node in
+                gpu_f.maker.env.toposort() ]) == 1
+if __name__=='__main__':
+    test_dot_vm()
+    test_dot_mv()
+    test_gemv1()
+    test_gemv2()
--- a/theano/sandbox/debug.py
+++ b/theano/sandbox/debug.py
@@ -15,7 +15,7 @@ class DebugLinker(gof.WrapLinker):
                 copy_originals = False,
                 check_types = True,
                 compare_variables = True,
-                 compare_fn = lambda x, y: x == y):
+                 compare_fn = (lambda x, y: x == y)):
        gof.WrapLinker.__init__(self,
                                linkers = linkers,
                                wrapper = self.wrapper)

--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
@@ -46,9 +46,6 @@ class Images2Neibs(Op):
        return Apply(self, [ten4, neib_shape,neib_step], [T.matrix(dtype=ten4.type.dtype)])
-    def grad(self, (pvals, unis), (gz,)):
-        return [None, None]
    def c_code_cache_version(self):
        return (3,)
@@ -224,6 +221,11 @@ def neibs2images(neibs, neib_shape, original_shape):
    Return a 4d tensor of shape `original_shape`.
    """
+    # TODO: handle the case where patches either overlap
+    # TODO: handle the case where patches are not directly adjacent
+    # TODO: at least separate these cases so that the following code does not incorrectly
+    # handle them by accident.
+    raise NotImplementedError('check for overlapping patches or non-adjacent patches.')
    neibs = T.as_tensor_variable(neibs)
    neib_shape = T.as_tensor_variable(neib_shape)
    original_shape = T.as_tensor_variable(original_shape)

--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
@@ -3,22 +3,27 @@ import unittest
 from nose.plugins.skip import SkipTest
 import numpy
-import scipy.sparse as sp
+try:
-import scipy.sparse
+    import scipy.sparse as sp
+    import scipy.sparse
+except ImportError:
+    pass#the variable enable_sparse will be used to disable the test file.
 import theano
 from theano import compile
+from theano.sparse import enable_sparse
+if enable_sparse == False:
+    raise SkipTest('Optional package sparse disabled')
 from theano.sparse.basic import _is_dense, _is_sparse, _is_dense_variable, _is_sparse_variable
 from theano.sparse.basic import _mtypes
-from theano.sparse import as_sparse_variable, enable_sparse, CSC, CSR, CSM, CSMProperties, SparseType, StructuredDotCSC
+from theano.sparse import as_sparse_variable, CSC, CSR, CSM, CSMProperties, SparseType, StructuredDotCSC
 from theano.sparse import add, structured_dot, transpose
 from theano.sparse import csc_from_dense, csr_from_dense, dense_from_sparse
 from theano.tests import unittest_tools as utt
 from theano import tensor
-if enable_sparse == False:
-    raise SkipTest('Optional package sparse disabled')
 def eval_outputs(outputs):
    return compile.function([], outputs)()[0]

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -634,8 +634,6 @@ class TensorType(Type):
    def c_extract(self, name, sub):
        """Override `CLinkerOp.c_extract` """
-        # TODO: make the error message print out the dtype of the
-        # input received.
        return """
        %(name)s = NULL;
        if (py_%(name)s == Py_None) {
@@ -649,11 +647,13 @@ class TensorType(Type):
            PyErr_SetString(PyExc_ValueError, "expected an ndarray");
            %(fail)s
        }
+        type_num_%(name)s = ((PyArrayObject*)py_%(name)s)->descr->type_num; //we expect %(type_num)s
        if (!PyArray_ISALIGNED(py_%(name)s)) {
-            PyErr_SetString(PyExc_NotImplementedError, "expected an aligned array");
+            PyErr_Format(PyExc_NotImplementedError,
+                         "expected an aligned array of type %%d (%(type_num)s), got non-aligned array of type %%d",
+                         %(type_num)s, type_num_%(name)s);
            %(fail)s
        }
-        type_num_%(name)s = ((PyArrayObject*)py_%(name)s)->descr->type_num; //we expect %(type_num)s
        if (type_num_%(name)s != %(type_num)s) {
            PyErr_Format(PyExc_ValueError, "expected type_num %%d (%(type_num)s) got %%d", %(type_num)s, type_num_%(name)s);
            %(fail)s
@@ -2230,7 +2230,11 @@ class Subtensor(Op):
    can additionally be a Scalar instance, and slice components can also be Scalar instances
    too.
    """
-    e_invalid = 'The index list is longer than the number of dimensions of the tensor.'
+    e_invalid = ( 'The index list is longer (size %d) than the number of '
+                 'dimensions of the tensor(namely %d). You are asking for '
+                 'a dimension of the tensor that does not exist! You might '
+                 'need to use dimshuffle to add extra dimension to your '
+                 'tensor.')
    e_subslice = 'nested slicing is not supported'
    e_indextype = "Invalid index type or slice for Subtensor"
    debug = 0
@@ -2246,7 +2250,7 @@ class Subtensor(Op):
            elif isinstance(entry, slice):
                helper(entry.start)
                helper(entry.stop)
-                helper(entry.step)
+                helper( entry.step)
        for idx in idxs:
            helper(idx)
        return ret
@@ -2315,8 +2319,10 @@ class Subtensor(Op):
        idx_list = list(self.idx_list)
        if len(idx_list) > x.type.ndim:
-            raise ValueError(Subtensor.e_invalid,
+            exception = ValueError(Subtensor.e_invalid%(len(idx_list),
-                             (len(idx_list), x.type.ndim))
+                                                        x.type.ndim))
+            exception.subtensor_invalid = True
+            raise exception
        #infer the broadcasting pattern
        padded = idx_list + [slice(0,sys.maxint,1)] * (x.type.ndim - len(idx_list))
@@ -2595,8 +2601,10 @@ class IncSubtensor(Op):
        idx_list = list(self.idx_list)
        if len(idx_list) > x.type.ndim:
-            raise ValueError(Subtensor.e_invalid,
+            exception = ValueError(Subtensor.e_invalid%(len(idx_list),
-                             (len(idx_list), x.type.ndim))
+                                                        x.type.ndim))
+            exception.subtensor_invalid = True
+            raise exception
        #infer the broadcasting pattern
        padded = idx_list + [slice(0,sys.maxint,1)] * (x.type.ndim - len(idx_list))

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -101,6 +101,11 @@ class DimShuffle(Op):
        self.new_order = new_order
        self.inplace = inplace
+        for i in xrange(len(new_order)-1):
+            j = new_order[i]
+            if j != 'x' and j in new_order[i+1:]:
+                raise ValueError("The same input dimension may not appear twice in the list of output dimensions", (new_order))
        # list of dimensions of the input to drop
        self.drop = []
        i2j = {} # this maps i before dropping dimensions to j after dropping dimensions so self.shuffle can be set properly later on

--- a/theano/tensor/nnet/conv.py
+++ b/theano/tensor/nnet/conv.py
@@ -848,8 +848,38 @@ class ConvOp(Op):
 using namespace std;
 """ + tensor.blas.blas_header_text()
+    def use_blas(self):
+        """ Return True if we will generate code that use gemm.
+        """
+        #the gemm version only support that case
+        if self.out_mode == 'valid' and self.dx==0 and self.dy==0:
+            #We use a faster version in those case.
+            if (self.imshp != self.imshp_logical or self.kshp != self.kshp_logical
+                or self.unroll_patch or self.unroll_batch>0 or self.unroll_kern>0):
+                return False
+            return True
+        return False
    def c_libraries(self):
+        if self.use_blas():
            return tensor.blas.ldflags()
+        return []
+    def c_compile_args(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, flags=True)
+        return []
+    def c_lib_dirs(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, libs_dir=True)
+        return []
+    def c_header_dirs(self):
+        if self.use_blas():
+            return tensor.blas.ldflags(libs=False, include_dir=True)
+        return []
    def c_code(self, node, name, (img2d, filtersflipped), (z, ), sub):
        if node.inputs[0].type.dtype != node.inputs[1].type.dtype:

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -1983,7 +1983,6 @@ def local_sum_alloc(node):
        if summed.owner and isinstance(summed.owner.op, T.Alloc):
            input = summed.owner.inputs[0]
            shapes = summed.owner.inputs[1:]
-            #import pdb;pdb.set_trace()
            if node.op.axis is None or node.op.axis == tuple(range(input.ndim)):
                try:
                    val = get_constant_value(input)
@@ -3053,7 +3052,6 @@ def local_elemwise_fusion_op(OP):
            ret = local_fuse(n)
            if ret is not False and ret is not None:
                #print n,ret
-                #import pdb;pdb.set_trace()
                assert len(ret)==len(n.outputs)
                assert len(ret)==1
                n = ret[0].owner

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -1304,7 +1304,7 @@ class T_subtensor(unittest.TestCase):
        try:
            t = n[0]
        except ValueError, e:
-            self.failUnless(e[0] is Subtensor.e_invalid)
+            self.failUnless(hasattr(e,'subtensor_invalid'))
            return
        self.fail()
@@ -1356,7 +1356,7 @@ class T_subtensor(unittest.TestCase):
        try:
            t = n[0,0]
        except ValueError, e:
-            self.failUnless(e[0] is Subtensor.e_invalid)
+            self.failUnless(hasattr(e,'subtensor_invalid'))
            return
        self.fail()
    def test1_ok_elem(self):
@@ -3342,6 +3342,20 @@ def test_unalign():
        if not should_raise:
            raise Exception("Theano raised an exception when none was expected")
+def test_dimshuffle_duplicate():
+    x = theano.tensor.vector()
+    success = False
+    try:
+        y = theano.tensor.DimShuffle((False, ), (0, 0))(x)
+    except ValueError, e:
+        assert str(e).find("may not appear twice") != -1
+        success = True
+    assert success
 if __name__ == '__main__':
    if 1:
        unittest.main()

--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
@@ -2087,6 +2087,13 @@ if __name__ == '__main__':
 #    unittest.main()
    test_fusion().tes_memory_leak()
+def test_local_mul_to_neg():
+    """
+    Test that a multiplication by -1 or -1.0 yields the appropriate data type
+    """
+    a = T.imatrix()
+    f1 = theano.function([a], -1*a)
+    f2 = theano.function([a], -1.0*a)
+    aval = numpy.random.randint(0,10,(2,2))
+    assert f1(aval).dtype == a.dtype
+    assert f2(aval).dtype == 'float64'