Merge remote-tracking branch 'upstream/master' into take-op-c-code-clean

a3282db2 · abalkin · 1b36c368 · 94592494 · a3282db2 · a3282db2
--- a/.mailmap
+++ b/.mailmap
+# Prevent git from showing duplicate names with commands like "git shortlog"
+# # See the manpage of git-shortlog for details.
+# # The syntax is:
+# # Name that should be used <email that should be used> Bad name <bad email>
+# #
+# # You can skip Bad name if it is the same as the one that should be used, and is unique.
+# #
+# # This file is up-to-date if the command git log --format="%aN <%aE>" | sort -u
+# # gives no duplicates.
 <abergeron@gmail.com> <anakha@kami.(none)>
 David Warde-Farley <wardefar@iro.umontreal.ca> David Warde-Farley <dwf@cs.toronto.edu>
 David Warde-Farley <wardefar@iro.umontreal.ca> David Warde Farley <dwf@cs.toronto.edu>

--- a/NEWS.txt
+++ b/NEWS.txt
@@ -4,6 +4,131 @@
 Release Notes
 =============
+Theano in the development version since 0.6rc2
+==============================================
+up to merged PR gh-1220
+Highlights:
+ * Speed-ups.
+ * Crash fixes.
+ * A few small interface changes.
+ * GPU memory leak fix.
+ * A few corner cases fix without incidence.
+ * More Theano determinism
+ * tensor.{dot,tensordot} more complete/faster/more GPU friendly.
+ * tensor.tensordot now support Rop/Lop
+ * tensor.dot support n-dimensional inputs as NumPy
+ * To support more NumPy syntax:
+     * Add theano.tensor.take()
+     * Add a_tensor_variable.{sort,dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag,take}
+Commiters for this rc2 only:
+Bug fix:
+ * Fix memory leak on the GPU in some corner with the Theano flags `allow_gc=False`. (Frederic B., reported by Jonas Gehring)
+ * Fix copy of random state between graph. (Guillaume D.)
+   http://deeplearning.net/software/theano/tutorial/examples.html#copying-random-state-between-theano-graphs
+ * Fix wrong dtype in sandbox.linalg.ExtractDiag with shape of 0. (Frederic B., reported by abalkin)
+ * Correctly support array with more then 2*10e32 element in AdvancedSubtensor1. (Abalkin)
+ * Fix wrong broadcast dimensions of output of Repeat op. (Abalkin)
+   We where using the inputs broadcasting pattern in some cases when we shouldn't.
+ * Fix theano.sandbox.linalg.eigh grad that didn't always returned the right dtype. (Frederic B., Olivier D.)
+New Features:
+ * More Theano determinism (Ian G., Olivier D., Pascal L.)
+     * Add and use a new class OrderedSet.
+     * Modify theano.grad to be determinist.
+     * Warn when using a dict as the updates argument to theano.compile.function, since this makes the returned function non-deterministic.
+     * The Updates class was not appropriate for representing updates because it is non-deterministic; replaced it with the OrderedUpdates class.
+ * Implemented GpuContiguous.grad. (Ian G.)
+ * tensor.tensordot now support Rop/Lop (Jeremiah Lowin)
+   This remove the class TensorDot and TensorDotGrad. It is the Dot/Elemwise ops that are used.
+ * tensor.dot support n-dimensional inputs as NumPy (Jeremiah Lowin)
+   Work on the GPU too.
+ * The Theano flag `nvcc.flags` now accept `-ftz=true`, `--prec-div=false` and `--prec=sqrt=false` as value. (Frederic B.)
+   To enable all of them, use the Theano flag `nvcc.flags=--use_fast_math`.
+ * New op theano.sparse.ConstructSparseFromList (Rami Al-Rfou'  Vivek Kulkarni)
+ * Make Theano work with Anaconda on Windows. (Pascal L.)
+ * Add tensor_var.diagonal and theano.tensor.{diag,diagonal}. (abalkin)
+ * AdvencedSubtensor1 can now have a sparse gradient. (Rami Al-Rfou', Vivek Kulkarni)
+Interface Deprecation (a warning is printed):
+ * theano.misc.strutil.renderString -> render_string (Ian G.)
+ * Will get warning when using dictionary at some place as this make Theano non-deterministic.
+Interface Change:
+ * Raise an error when theano.shared called with a theano variable. (Frederic B.)
+ * Don't print warning for bug before Theano 0.5 by default. (Frederic B.)
+ * Theano functions now always have a field name, default to None. (Frederic B.)
+ * Theano function fct.fgraph have a copy of the Theano function name field. (Ian G.)
+   This is needed to all the fgraph to know it.
+ * In the grad method, if it were asked to raise an error if there is no path between the variables, we didn't always returned an error. (Ian G.)
+   We returned the mathematical right answer 0.
+ * get_constant_value() renamed get_scalar_constant_value() and raise a new exception tensor.basic.NotScalarConstantError. (Ian G.)
+ * theano.function raise an error when triing to replace inputs with the given paramter. (Olivier D.)
+   This was doing nothing, the error message tell what the user probably want to do.
+New Interface (reuse existing functionality):
+ * tensor_var.sort() as a shortcut for theano.tensor.sort. (Jeremiah Lowin)
+   We where already doing this for argsort.
+ * Add theano.tensor.take() and a_tensor_var.take() to support NumPy syntax. (abalkin) 
+ * Add a_tensor_variable.{dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag}. (abalkin)
+New debug feature:
+ * DebugMode print more info when there is an error. (Frederic B.)
+ * Better profiling of test time with `theano-nose --time-profile`. (Frederic B.)
+ * Detection of infinite loop with global optimizer. (Pascal L.)
+ * DebugMode.check_preallocated_output now also work on Theano function output. (Pascal L.)
+Speed-ups:
+ * c_code for SpecifyShape op. (Frederic B.)
+ * cross-entropy optimization now work when specify_shape is used. (Pascal L.)
+ * The Scan optimization ScanSaveMem and PushOutDot1 applied more frequently. (Razvan P, reported Abalkin)
+   A skipped optimization warning was printed.
+ * dot(vector, vector) now faster with some BLAS implementation. (Eric Hunsberger)
+   OpenBLAS and other didn't called {s,d}dot internally when we called {s,g}gemv.
+   MKL was doing this.
+ * Compilation speed up: Take the compiledir lock only for op that generate c_code. (Frederic B)
+ * More scan optimization (Razvan P.)
+     * Opt to make RNN fast in Theano.
+     * Optimize some case of dot, by moving them outside of Scan.
+     * Move some sequences outside of scan too.
+     * Merge more scan inputs, mostly byproduct of other Scan optimizations.
+ * c_code for theano.sparse.AddSD. (Rami Al-Rfou',  Vivek Kulkarni)
+Crash Fixes:
+ * Fix crash about dimshuffle. (abalkin)
+ * Fix crash at compilation. (Olivier D.)
+ * Fix openmp detection. (Pascal L.)
+   Resulted in a crash with EPD on Windows.
+ * Fix for new BLAS interface in SciPy. (Olivier D.)
+   Fix crash with some development version of SciPy.
+ * GpuSum work with bigger shape when summing on the first dim on 3d tensor. (Frederic B., reported Chris Currivan)
+ * Windows compilation crash fix. (Frederic B.)
+ * Make CrossentropySoftmax1HotWithBiasDx and CrossentropySoftmaxArgmax1HotWithBias support uint* dtype. (Frederic B., reported by Mark Fenner)
+ * Fix GpuSoftmax and GpuSoftmaxWithBias crash on GTX285. (Frederic B.)
+ * Fix crash due to a race condition when importing theano. (Ian G.)
+ * Fix crash from path problem with `theano-nose --batch`. (Abalkin) 
+ * Fix crash with tensor.roll(Var, iscalar). (Frederic B., reported by Jeremiah Lowin)
+ * Fix compilation crash with llvm on Mac. (Abalkin)
+ * Fix the grad of Scan that told wrongly that there is no connection between cost and parameters. (Razvan P.)
+ * The infer shape mechanism now force that broadcasted dimensions have a shape know to be equivalent to one during compilation.
+   Sometimes, we where not able knowing this before run time and resulted in crash. (Frederic B.)
+ * Fix compilation problems on GPU on Windows. (Frederic B.)
+Theoretical bugfix (bug that won't happen with current Theano code, but if you messed with the internal, could have affected you):
+ * GpuContiguous now check the preallocated outputs strides before using it. (Pascal L.)
+Others:
+ * Fix race condition when determining if g++ is available. (Abalkin)
+ * Documentation improvements. (Many people including David W-F, abalkin, Amir Elaguizy, Olivier D., Frederic B.)
+ * The current GPU back-end have a new function CudaNdarray_prep_output(CudaNdarray ** arr, int nd, const int * dims) (Ian G)
+=============
+Release Notes
+=============
 Theano 0.6rc2 (November 21th, 2012)
 ===================================

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -105,7 +105,7 @@ Brian Vandenberg emailed `installation instructions on Gentoo
 <http://groups.google.com/d/msg/theano-dev/-8WCMn2FMR0/bJPasoZXaqoJ>`_,
 focusing on how to install the appropriate dependencies.
-Nicolas Pinto provide `ebuild scripts <https://github.com/npinto/sekyfsr-gentoo-overlay/tree/master/sci-libs/Theano>`_.
+Nicolas Pinto provides `ebuild scripts <https://github.com/npinto/sekyfsr-gentoo-overlay/tree/master/sci-libs/Theano>`_.
 Alternative installation on Mandriva 2010.2
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -657,9 +657,9 @@ Theano dependencies is easy, but be aware that it will take a long time
 Homebrew
 ~~~~~~~~
-There is some :ref:`instruction
+There are some `instructions
-<https://github.com/samueljohn/homebrew-python>` on how to install
+<https://github.com/samueljohn/homebrew-python>`__ by Samuel John on how to install
-Theano dependencies with Homebrew instead of MacPort by Samuel John.
+Theano dependencies with Homebrew instead of MacPort.
 .. _gpu_macos:

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -39,7 +39,7 @@ probably do something similar on older computer.
 Installation steps
 ~~~~~~~~~~~~~~~~~~
-Ubuntu 11.10/12.04:
+Ubuntu 11.10/12.04/12.10:
 1) ``sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git``
 2) ``sudo pip install Theano``
@@ -70,7 +70,7 @@ Theano/BLAS speed test:
 .. code-block:: bash
-    python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    python `python -c "import os, theano; print os.path.dirname(theano.__file__)"`/misc/check_blas.py
 This will print a table with different versions of BLAS/numbers of
 threads on multiple CPUs and GPUs. It will also print some Theano/NumPy
@@ -163,6 +163,8 @@ Test GPU configuration
   Ubuntu 12.04 LTS: default gcc version 4.6.3. gcc 4.4.7 and 4.5.3 availables.
+   Ubuntu 12.10: default gcc version 4.7.2. gcc 4.4.7, 4.5.4 and 4.6.3 availables.

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -1229,6 +1229,7 @@ Linear Algebra
                 If an integer i, it is converted to an array containing
                 the last i dimensions of the first tensor and the first
                 i dimensions of the second tensor:
                     axes = [range(a.ndim - i, b.ndim), range(i)]
                 If an array, its two elements must contain compatible axes
@@ -1251,6 +1252,8 @@ Linear Algebra
    are compatible. The resulting tensor will have shape (2, 5, 6) -- the
    dimensions that are not being summed:
+    .. code-block:: python
        a = np.random.random((2,3,4))
        b = np.random.random((5,6,4,3))
@@ -1284,6 +1287,8 @@ Linear Algebra
    In an extreme case, no axes may be specified. The resulting tensor
    will have shape equal to the concatenation of the shapes of a and b:
+    .. code-block:: python
        c = np.tensordot(a, b, 0)
        print(a.shape) #(2,3,4)
        print(b.shape) #(5,6,4,3)

--- a/doc/library/tensor/nnet/conv.txt
+++ b/doc/library/tensor/nnet/conv.txt
@@ -7,8 +7,11 @@
 .. note::
    Two similar implementation exists for conv2d:
        :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>` and
-    :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`. The former implements a traditional
+        :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
+    The former implements a traditional
    2D convolution, while the latter implements the convolutional layers
    present in convolutional neural networks (where filters are 3D and pool
    over several input channels).

--- a/doc/library/tensor/nnet/nnet.txt
+++ b/doc/library/tensor/nnet/nnet.txt
@@ -74,11 +74,11 @@ cross-entropy (note that this assumes that x will contain values between 0 and
 .. code-block:: python
-    x,y,b = T.dvectors('x','y','b')
+    x, y, b = T.dvectors('x', 'y', 'b')
    W = T.dmatrix('W') 
-    h = T.nnet.sigmoid(T.dot(W,x) + b)
+    h = T.nnet.sigmoid(T.dot(W, x) + b)
-    x_recons = T.nnet.sigmoid(T.dot(V,h) + c)
+    x_recons = T.nnet.sigmoid(T.dot(V, h) + c)
-    recon_cost = T.nnet.binary_crossentropy(x_recons,x).mean()
+    recon_cost = T.nnet.binary_crossentropy(x_recons, x).mean()
 .. function:: categorical_crossentropy(coding_dist,true_dist)
@@ -87,7 +87,7 @@ cross-entropy (note that this assumes that x will contain values between 0 and
    needed to identify an event from a set of possibilities, if a coding scheme is used based
    on a given probability distribution q, rather than the "true" distribution p. Mathematically, this 
    function computes :math:`H(p,q) = - \sum_x p(x) \log(q(x))`, where
-    p=coding_dist and q=true_dist
+    p=true_dist and q=coding_dist.
    :Parameters:
@@ -108,6 +108,6 @@ cross-entropy (note that this assumes that x will contain values between 0 and
 .. code-block:: python
-    y = T.nnet.softmax(T.dot(W,x) + b)
+    y = T.nnet.softmax(T.dot(W, x) + b)
-    cost = T.nnet.categorical_crossentropy(y,o)
+    cost = T.nnet.categorical_crossentropy(y, o)
    # o is either the above-mentioned 1-of-N vector or 2D tensor
--- a/doc/library/tensor/signal/conv.txt
+++ b/doc/library/tensor/signal/conv.txt
@@ -7,8 +7,11 @@
 .. note::
    Two similar implementation exists for conv2d:
        :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>` and
-    :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>. The former implements a traditional
+        :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
+    The former implements a traditional
    2D convolution, while the latter implements the convolutional layers
    present in convolutional neural networks (where filters are 3D and pool
    over several input channels).

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -284,13 +284,13 @@ Tips for Improving Performance on GPU
  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
  This can tell you if not enough of your graph is on the GPU or if there
  is too much memory transfer.
-* Use nvcc options. nvcc support those options to speed up some
+* Use nvcc options. nvcc supports those options to speed up some
  computations: `-ftz=true` to `flush denormals values to
  zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
-  `--prec-div=false` and `--prec-sqrt=false` option to speed up
+  `--prec-div=false` and `--prec-sqrt=false` options to speed up
  division and square root operation by being less precise. You can
-  enable all of them with with the `nvcc.flags=--use_fast_math` Theano
+  enable all of them with the `nvcc.flags=--use_fast_math` Theano
-  flags or you can enable them individually as in this example
+  flag or you can enable them individually as in this example:
  `nvcc.flags=-ftz=true --prec-div=false`.
 .. _gpu_async:

--- a/theano/compile/debugmode.py
+++ b/theano/compile/debugmode.py
@@ -5,7 +5,7 @@
 """
 __docformat__ = "restructuredtext en"
-import time, copy, sys, copy_reg, gc, os
+import copy, sys, copy_reg, gc
 from itertools import izip
 from StringIO import StringIO

--- a/theano/compile/function.py
+++ b/theano/compile/function.py
@@ -3,13 +3,10 @@
 __docformat__ = "restructuredtext en"
 import logging
-import sys
-import traceback
 _logger = logging.getLogger('theano.compile.function')
 from io import In
 from function_module import orig_function
-from profiling import ProfileStats
 from pfunc import pfunc
 from numpy import any  # to work in python 2.4
 import warnings
@@ -164,8 +161,9 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
    if updates is None:
        updates = []
-    if isinstance(updates, dict) and \
+    if (isinstance(updates, dict) and
-            not isinstance(updates, gof.python25.OrderedDict):
+            not isinstance(updates, gof.python25.OrderedDict) and
+            len(updates) > 1):
        warnings.warn(
            "The parameter 'updates' of theano.function()"
            " expects an OrderedDict,"
@@ -186,8 +184,8 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
    # compute some features of the arguments:
    uses_In = any([isinstance(i, In) for i in inputs])  # N.B. the square brackets are ncessary
    uses_tuple = any([isinstance(i, (list, tuple)) for i in inputs])  # N.B. the square brackets are ncessary
-    uses_updates = (updates != [])
+    uses_updates = bool(updates)
-    uses_givens = (givens != [])
+    uses_givens = bool(givens)
    # See if we have any mutable / borrow inputs
    check_for_aliased_inputs = False
@@ -201,7 +199,9 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
        if profile:
            raise NotImplementedError('profiling not supported in old-style function')
        if uses_updates or uses_givens:
-            raise NotImplementedError("In() instances and tuple inputs triggers the old semantics, which disallow using updates and givens")
+            raise NotImplementedError(
+                    "In() instances and tuple inputs trigger the old "
+                    "semantics, which disallow using updates and givens")
        fn = orig_function(inputs, outputs,
                           mode=mode,
                           accept_inplace=accept_inplace, name=name)

--- a/theano/compile/pfunc.py
+++ b/theano/compile/pfunc.py
@@ -9,7 +9,7 @@ from theano import config
 from theano.compile import orig_function, In, Out
 from theano.compile import UnusedInputError
 from theano.compile.sharedvalue import SharedVariable, shared
-from theano.gof import Container, Variable, generic, graph, Constant
+from theano.gof import Variable, Constant
 from theano.gof.python25 import any
 import logging
@@ -233,8 +233,8 @@ def rebuild_collect_shared(outputs,
                cloned_outputs.append(Out(cloned_v, borrow=v.borrow))
            else:
                raise TypeError('Outputs must be theano Variable or '
-                                'Out instances. Received ' + str(v)\
+                                'Out instances. Received ' + str(v)
-                                + ' of type '+str(type(v)))
+                                + ' of type ' + str(type(v)))
            #computed_list.append(cloned_v)
    else:
        if isinstance(outputs, Variable):
@@ -278,7 +278,8 @@ class Param(object):
    def __init__(self, variable, default=None, name=None, mutable=False,
            strict=False, allow_downcast=None, implicit=None, borrow=None):
        """
-        :param variable: A variable in an expression graph to use as a compiled-function parameter
+        :param variable: A variable in an expression graph to use as a
+            compiled-function parameter
        :param default: The default value to use at call-time (can also be a Container where
            the function will find a value at call-time.)
@@ -290,10 +291,11 @@ class Param(object):
        :param borrow: Whether the function is allowed to alias some output to
            this input. Using None (default) means we re-use the same value as the
            `mutable` flag.
            False: do not permit any output to be aliased to the input
        :param strict: False -> function arguments may be copied or cast to match the
-        type required by the parameter `variable`.  True -> function arguments must exactly match the type
+            type required by the parameter `variable`.
+            True -> function arguments must exactly match the type
            required by `variable`.
        :param allow_downcast: Only applies if `strict` is False.
@@ -452,6 +454,27 @@ def pfunc(params, outputs=None, mode=None, updates=None, givens=None,
                     "provided for it being ignored. Please do not duplicate "
                     "variables in the inputs list." % (v, i, dup_v_i)))
+    # Check that we are not using `givens` to replace input variables, because
+    # this typically does nothing, contrary to what one may expect.
+    in_var_set = set(in_variables)
+    try:
+        givens_pairs = givens.items()
+    except AttributeError:
+        givens_pairs = givens
+    for x, y in givens_pairs:
+        if x in in_var_set:
+            raise RuntimeError(
+                'You are trying to replace variable \'%s\' through the '
+                '`givens` parameter, but this variable is an input to your '
+                'function. Replacing inputs is currently forbidden because it '
+                'has no effect. One way to modify an input `x` to a function '
+                'evaluating f(x) is to define a new input `y` and use '
+                '`theano.function([y], f(x), givens={x: g(y)})`. Another '
+                'solution consists in using `theano.clone`, e.g. like this: '
+                '`theano.function([x], '
+                'theano.clone(f(x), replace={x: g(x)}))`.'
+                % x)
    output_vars = rebuild_collect_shared(outputs,
                                         in_variables,
                                         replace=givens,

--- a/theano/compile/tests/test_function_module.py
+++ b/theano/compile/tests/test_function_module.py
@@ -386,6 +386,14 @@ class T_function(unittest.TestCase):
        self.assertRaises(UnusedInputError, function, [m, mt], mt*2)
        f = function([m, mt], mt*2, on_unused_input='ignore')
+    def test_givens_input_var(self):
+        """
+        Ensure error is raised when trying to replace an input variable.
+        """
+        x = T.scalar('x')
+        y = x * 2
+        self.assertRaises(RuntimeError, function, [x], y, givens={x: x + 1})
 class T_picklefunction(unittest.TestCase):
@@ -680,6 +688,18 @@ class SomethingToPickle(object):
        self.f2 = function([x, In(a, value=1.0,name='a'), In(s, value=self.f1.container[s], update=s+a*x, mutable=True)], s+a*x)
+def test_empty_givens_updates():
+    """
+    Regression test for bug fixed in 8625e03.
+    """
+    # Empty givens / updates dictionaries were not properly detected before,
+    # triggering useless crashes at compile time.
+    x = T.scalar()
+    y = x * 2
+    function([theano.In(x)], y, givens={})
+    function([theano.In(x)], y, updates={})
 if __name__ == '__main__':
    if 1:

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -420,6 +420,11 @@ else:
                        " want theano to use.")
    default_openmp = count > 1
+# Disable it by default for now as currently only the ConvOp support
+# it And this cause slow down by default as we do not disable it for
+# too small convolution.
+default_openmp = False
 AddConfigVar('openmp',
             "Allow (or not) parallel computation on the CPU with OpenMP. "
             "This is the default value used when creating an Op that "

--- a/theano/gof/callcache.py
+++ b/theano/gof/callcache.py
-import cPickle, logging, sys
+import cPickle, logging
 _logger=logging.getLogger("theano.gof.callcache")

--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
@@ -892,8 +892,8 @@ class ModuleCache(object):
            key_data = None
            # We have never seen this key before.
-            # We acquire the lock later only if we where able to
+            # We acquire the lock later only if we were able to
-            # generate c code Otherwise, we would take the lock for op
+            # generate C code. Otherwise, we would take the lock for ops
            # that have only a perform().
            lock_taken = False
            # This try/finally block ensures that the lock is released once we
@@ -920,11 +920,14 @@ class ModuleCache(object):
                    src_code = compile_steps.next()
                    module_hash = get_module_hash(src_code, key)
-                    # The op have c_code, so take the lock.
+                    # The op has c_code, so take the lock.
                    compilelock.get_lock()
                    lock_taken = True
-                    assert os.path.exists(location), (
-                        "The directory just created shouldn't be deleted!")
+                    if not os.path.exists(location):
+                        # Temporary fix, we should make sure it don't
+                        # get deleted by the clear*() fct.
+                        os.makedirs(path)
                    if module_hash in self.module_hash_to_key_data:
                        _logger.debug("Duplicated module! Will re-use the "
@@ -1469,7 +1472,7 @@ class GCC_compiler(object):
        #cxxflags.append("-D NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION")
        numpy_ver = [int(n) for n in numpy.__version__.split('.')[:2]]
-        # numpy 1.7 deprecated the following macro but the didn't
+        # numpy 1.7 deprecated the following macro but the new one didn't
        # existed in the past
        if bool(numpy_ver < [1, 7]):
            cxxflags.append("-D NPY_ARRAY_ENSURECOPY=NPY_ENSURECOPY")
@@ -1483,7 +1486,7 @@ class GCC_compiler(object):
    @staticmethod
    def compile_str(module_name, src_code, location=None,
                    include_dirs=None, lib_dirs=None, libs=None,
-                    preargs=None):
+                    preargs=None, py_module=True):
        """
        :param module_name: string (this has been embedded in the src_code
@@ -1503,7 +1506,11 @@ class GCC_compiler(object):
        :param preargs: a list of extra compiler arguments
+        :param py_module: if False, compile to a shared library, but do not
+            import it as a Python module.
        :returns: dynamically-imported python module of the compiled code.
+            (unless py_module is False, in that case returns None.)
        """
        #TODO: Do not do the dlimport in this function
@@ -1628,6 +1635,7 @@ class GCC_compiler(object):
            # Print errors just below the command line.
            print compile_stderr
+        if py_module:
            #touch the __init__ file
            file(os.path.join(location, "__init__.py"), 'w').close()
            return dlimport(lib_filename)

--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
@@ -42,7 +42,7 @@ compiledir_format_dict = {"platform": platform.platform(),
                          "numpy_version": numpy.__version__,
                          "gxx_version": gcc_version_str.replace(" ", "_"),
                         }
-compiledir_format_keys = ", ".join(compiledir_format_dict.keys())
+compiledir_format_keys = ", ".join(sorted(compiledir_format_dict.keys()))
 default_compiledir_format =\
                    "compiledir_%(platform)s-%(processor)s-%(python_version)s"

--- a/theano/gof/compilelock.py
+++ b/theano/gof/compilelock.py
@@ -2,7 +2,6 @@
 # same compilation directory (which can cause crashes).
 from theano import config
-import compiledir
 import os, random, time, atexit
 import socket # only used for gethostname()
 import logging

--- a/theano/gof/destroyhandler.py
+++ b/theano/gof/destroyhandler.py
@@ -2,12 +2,6 @@
 Classes and functions for validating graphs that contain view
 and inplace operations.
 """
-import sys
-if sys.version_info[:2] >= (2,5):
-    from collections import defaultdict
-# otherwise it's implemented in python25.py
 import theano
 import toolbox
 import graph

--- a/theano/gof/fg.py
+++ b/theano/gof/fg.py
@@ -12,7 +12,6 @@ from python25 import all
 from theano import config
 import warnings
 NullType = None
-import theano
 from python25 import OrderedDict
 from theano.misc.ordered_set import OrderedSet

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -764,6 +764,7 @@ class OpenMPOp(Op):
        self.openmp = openmp
    def c_compile_args(self):
+        self.update_self_openmp()
        if self.openmp:
            return ['-fopenmp']
        return []
@@ -808,7 +809,10 @@ class OpenMPOp(Op):
            return False
        return default_openmp
-    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+    def update_self_openmp(self):
+        """
+        Make sure self.openmp is not True if there is no support in gxx
+        """
        if self.openmp:
            if OpenMPOp.gxx_support_openmp is None:
                OpenMPOp.gxx_support_openmp = OpenMPOp.test_gxx_support()
@@ -819,9 +823,13 @@ class OpenMPOp(Op):
                        " know this happen with some version of the EPD mingw"
                        " compiler. We disable openmp everywhere in Theano."
                        " To remove this warning set the theano flags `openmp`"
-                        " to False.")
+                        " to False.",
+                        stacklevel=3)
            if OpenMPOp.gxx_support_openmp is False:
                self.openmp = False
                theano.config.openmp = False
+    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+        self.update_self_openmp()
        return super(OpenMPOp, self).make_thunk(node, storage_map,
                                                compute_map, no_recycling)
--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -2,11 +2,9 @@
 __docformat__ = "restructuredtext en"
-import copy
 import utils
 from utils import MethodNotDefined, object2
 import graph
-from theano import config
 ########
 # Type #

--- a/theano/gof/utils.py
+++ b/theano/gof/utils.py
@@ -3,7 +3,7 @@
 # import variable
 from theano import config
-import re, os, traceback
+import re, traceback
 def add_tag_trace(thing):
    """Add tag.trace to an node or variable.

--- a/theano/gradient.py
+++ b/theano/gradient.py
@@ -21,10 +21,8 @@ from itertools import izip
 from theano import gof
 from theano.gof import Variable
 from theano.gof.python25 import OrderedDict
-from theano.gof.python25 import all
-import theano.gof.utils
 from theano.gof.null_type import NullType
-from theano.printing import min_informative_str
 # we can't do "import theano.tensor"
 # tensor depends on theano.compile
 # theano.compile depends on theano.gradient (this file)

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -194,41 +194,28 @@ if __name__ == "__main__":
        goto2 1.13/16                                                     3.16s
        Test time in float32
-        (cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)
+        cuda version      5.0    4.2    4.1    4.0    3.2    3.0   # note
-        gpu/cuda version
+        gpu
-        M2050(Amazon)/5.0 0.25s
+        M2070             0.25s         0.27s         0.32s
+        M2050(Amazon)     0.25s
-        GTX680/4.2        0.154s
+        C2075                    0.25s
-        GTX580/4.2        0.164s
+        C1060                                         0.46s
-        GTX480/4.2        0.192s
-        GTX470/4.2        0.238s
+        GTX680                   0.154s               0.218s
-        C2075/4.2         0.25s
+        GTX580                   0.164s               0.203s
-        GTX285/4.2        0.452s #cuda 3.0 seam faster? driver version?
+        GTX480                   0.192s               0.237s 0.27s
-        GT520/4.2         2.68s
+        GTX470                   0.238s               0.297s 0.34s
-        GTX560/4.2        0.30s
+        GTX660                   0.24s
+        GTX560                   0.30s
-        GTX460/4.0        0.45s
+        GTX460            0.37s                0.45s
+        GTX285                   0.452s        0.452s        0.40s # cuda 3.0 seam faster? driver version?
-        GTX580/3.2        0.203s
+        GTX550Ti                               0.57s
-        GTX680/3.2        0.218s
+        GT520                    2.68s                3.06s
-        GTX480/3.2        0.237s
+        520M                                          3.19s        # with bumblebee on Ubuntu 12.04
-        GTX470/3.2        0.297s
+        GT220                                         3.80s
-        GTX285/3.2        0.452s #cuda 3.0 seam faster? driver version?
+        GT210                                  6.35s
+        8500GT                                               10.68s
-        GTX480/3.0        0.27s
-        M2070/4.1         0.27s
-        GTX470/3.2        0.29s
-        M2070/3.2         0.32s
-        GTX470/3.0        0.34s
-        GTX285/3.0        0.40s
-        C1060/3.2         0.46s
-        GTX550Ti/4.0      0.57s
-        520/3.2           3.06s
-        520M/3.2          3.19s with bumblebee on Ubuntu 12.04
-        GT220/3.2RC       3.80s
-        GT210/4.0         6.35s
-        8500GT/3.0       10.68s
        """
    t, impl = execute(not options.print_only, not options.quiet,

--- a/theano/misc/strutil.py
+++ b/theano/misc/strutil.py
-def renderString(string, dict):
+import warnings
+def render_string(string, sub):
+    """
+    string: a string, containing formatting instructions
+    sub: a dictionary containing keys and values to substitute for
+        them.
+    returns: string % sub
+    The only difference between this function and the % operator
+    is that it raises an exception with a more informative error
+    message than the % operator does.
+    """
    try:
-        finalCode = string % dict
+        finalCode = string % sub
    except Exception , E:
-        #print 'could not render C code due to exception with message "'+str(E)+'", trying to find out why...'
+        # If unable to render the string, render longer and longer
+        # initial substrings until we find the minimal initial substring
+        # that causes an error
        i = 0
        while i <= len(string):
            try:
-                finalCode = string[0:i] % dict
+                finalCode = string[0:i] % sub
            except Exception, F:
                if str(F) == str(E):
                    raise Exception(string[0:i]+"<<<< caused exception "+str(F))
            i+=1
        assert False
    return finalCode
-#
+def renderString(string, dict):
+    warnings.warn("renderString is deprecated. It is now called render_string",
+            stacklevel = 2)
+    return render_string(string, dict)
 def pretty_format(string):
    lines = string.split('\n')
@@ -34,11 +53,8 @@ def pretty_format(string):
    rval = '\n'.join(lines)
    return rval
-#
 def strip_leading_white_space(line):
    while len(line) >0 and (line[0]==' ' or line[0]=='\t'):
        line = line[1:]
-    #
    return line
-#
--- a/theano/misc/windows.py
+++ b/theano/misc/windows.py
@@ -13,5 +13,9 @@ def call_subprocess_Popen(command, **params):
            startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
        except AttributeError:
            startupinfo.dwFlags |= subprocess._subprocess.STARTF_USESHOWWINDOW
+        # Under Windows 7 64-bits, Anaconda's g++ is not found unless
+        # specifying "shell=True".
+        params['shell'] = True
    proc = subprocess.Popen(command, startupinfo=startupinfo, **params)
    return proc
--- a/theano/sandbox/cuda/GpuConv3D.py
+++ b/theano/sandbox/cuda/GpuConv3D.py
@@ -220,7 +220,7 @@ if(!work_complete){
 }}}}}}} //extra scope so error handler jumps don't cross declarations
                        ///////////// < /code generated by GpuConv3D >
        """
-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())
    def c_support_code_apply(self, node, nodename):
        # This code is not sensitive to the ignore_border flag.
@@ -279,7 +279,7 @@ conv_rows_stack( float* img, float* kern, float* bias, float* out,
            """
-        return codeSource#renderString(codeSource,locals())
+        return codeSource
 gpu_convd = GpuConv3D()

--- a/theano/sandbox/cuda/GpuConvGrad3D.py
+++ b/theano/sandbox/cuda/GpuConvGrad3D.py
@@ -336,7 +336,7 @@ convgrad_rows_stack( float* img, float* dCdH, float* dCdW,
                                            dCdW[j,z,k,l,m] += dCdH[i,j,p,q,r] * V[i,z,dr*p+k,dc*q+l,dt*r+m]
 */
 """
-        return codeSource#renderString(codeSource,locals())
+        return codeSource
 gpu_conv_grad3d = GpuConvGrad3D()

--- a/theano/sandbox/cuda/GpuConvTransp3D.py
+++ b/theano/sandbox/cuda/GpuConvTransp3D.py
@@ -263,7 +263,7 @@ if(!work_complete){
 }}}}}} // for fail
            ///////////// < /code generated by GpuConvTransp3D >
        """
-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())
    def c_support_code_apply(self, node, nodename):
        # This code is not sensitive to the ignore_border flag.

--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -218,7 +218,7 @@ if cuda_available:
        atexit.register(gpu_shutdown)
    except EnvironmentError, e:
        cuda_available = False
-        cuda_initialization_error_message = e.message
+        cuda_initialization_error_message = " ".join(e.args)
 class GpuOp(theano.gof.Op):

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -13,15 +13,20 @@ scal = scalar # somewhere scalar gets reassigned to be a function
 from theano.gof.python25 import all, any
-from theano.sandbox.cuda import GpuOp, device_properties
+try:
+    # We must be able to import this file to create the full doc when nvcc
+    # is not available
+    from theano.sandbox.cuda import filter as type_support_filter
+    from theano.sandbox.cuda import device_properties
+    import cuda_ndarray
+except ImportError:
+    pass
+from theano.sandbox.cuda import GpuOp
 from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda import filter as type_support_filter
 from theano.sandbox.cuda.elemwise import NaiveAlgo
-import cuda_ndarray
 _logger_name = 'theano.sandbox.cuda.basic_ops'
 _logger = logging.getLogger(_logger_name)
 _logger.setLevel(logging.INFO)
@@ -2267,9 +2272,17 @@ class GpuSubtensor(GpuOp, tensor.Subtensor):
                                       set_dim='CudaNdarray_set_dim',
                                       set_stride='CudaNdarray_set_stride',
                                       update_flags="", strides_mul=4)
+        finish_view = ""
+        #For broadcasted dimensions, set the strides to 0
+        #We can't do that only for broadcasted dimensions as this can happen for dimensions of size 0,
+        #That are rebroadcated later.
+        for idx in range(node.outputs[0].ndim):
+            finish_view += """
+            if(CudaNdarray_HOST_DIMS(xview)[%(idx)s]==1)
+            CudaNdarray_set_stride(xview, %(idx)s, 0);
+            """ % locals()
+        finish_view += """
-        finish_view = """
        //Set the base only now
        if(CudaNdarray_set_device_data(xview, CudaNdarray_DEV_DATA(xview),
@@ -2287,6 +2300,13 @@ class GpuSubtensor(GpuOp, tensor.Subtensor):
        return build_view + "{" + get_xview + "}" + finish_view
+    def c_code_cache_version(self):
+        hv = self.helper_c_code_cache_version()
+        # If `helper_c_code_cache_version` is not versioned we do not want to
+        # have a versioned version of this op's C code.
+        if len(hv) == 0:
+            return ()
+        return (3, hv)
 class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
    """
@@ -2455,7 +2475,7 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):
            :return: C code expression to make a copy of x
-            Base class uses PyArrayObject *, subclasses may override for
+            Base class uses `PyArrayObject *`, subclasses may override for
            different types of arrays.
        """
        return """(CudaNdarray*) CudaNdarray_Copy(%(x)s)""" % locals()

--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -53,7 +53,13 @@ struct table_struct{
 };
 table_struct _alloc_size_table[TABLE_SIZE];
 #endif
 void * device_malloc(size_t size)
+{
+    return device_malloc(size, VERBOSE_DEVICE_MALLOC);
+}
+void * device_malloc(size_t size, int verbose)
 {
    void * rval=NULL;
    cudaError_t err = cudaMalloc(&rval, size);
@@ -64,11 +70,14 @@ void * device_malloc(size_t size)
        // it returns something else I still don't see why we should ignore
        // it.  All we want to do here is reset the flag.
        cudaGetLastError();
+        if (verbose)
+        {
            #if COMPUTE_GPU_MEM_USED
                fprintf(stderr, "Error allocating %li bytes of device memory (%s). new total bytes allocated: %d\n", (long)size, cudaGetErrorString(err),_allocated_size);
            #else
                fprintf(stderr, "Error allocating %li bytes of device memory (%s).\n", (long)size, cudaGetErrorString(err));
            #endif
+        }
        PyErr_Format(PyExc_MemoryError,
                "Error allocating %li bytes of device memory (%s).", (long)size, cudaGetErrorString(err));
        return NULL;

--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
@@ -42,6 +42,9 @@ typedef float real;
 #define SHARED_SIZE (16*1024)
 #endif
+#define VERBOSE_DEVICE_MALLOC 1
+#define NO_VERBOSE_DEVICE_MALLOC 0
 /**
 * Allocation and freeing of device memory should go through these functions so that the lib can track memory usage.
 *
@@ -49,6 +52,7 @@ typedef float real;
 * device_free will return nonzero on failure (after setting the python error message)
 */
 DllExport void * device_malloc(size_t size);
+DllExport void * device_malloc(size_t size, int verbose);
 DllExport int device_free(void * ptr);
 template <typename T>
@@ -162,7 +166,8 @@ CudaNdarray_set_dim(CudaNdarray * self, int idx, int d)
 {
    if ((idx >= self->nd) || (idx < 0) || (d < 0))
    {
-        fprintf(stderr, "WARNING: probably bad CudaNdarray_set_dim arguments: %i %i\n", idx, d);
+        fprintf(stderr, "WARNING: probably bad CudaNdarray_set_dim arguments: self->ndim=%i, idx=%i stride=%i\n",
+                self->nd, idx, d);
    }
    if (d != self->host_structure[idx])

--- a/theano/sandbox/cuda/neighbours.py
+++ b/theano/sandbox/cuda/neighbours.py
 # This is work in progress
-import theano
 from theano import Op, Apply
-import theano.tensor as T
 from theano.gof import local_optimizer
 from theano.sandbox.cuda import cuda_available, GpuOp

--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -164,7 +164,7 @@ class NVCC_compiler(object):
    def compile_str(
            module_name, src_code,
            location=None, include_dirs=[], lib_dirs=[], libs=[], preargs=[],
-            rpaths=rpath_defaults):
+            rpaths=rpath_defaults, py_module=True):
        """:param module_name: string (this has been embedded in the src_code
        :param src_code: a complete c or c++ source listing for the module
        :param location: a pre-existing filesystem directory where the
@@ -178,8 +178,11 @@ class NVCC_compiler(object):
        :param preargs: a list of extra compiler arguments
        :param rpaths: list of rpaths to use with Xlinker.
                       Defaults to `rpath_defaults`.
+        :param py_module: if False, compile to a shared library, but
+            do not import as a Python module.
        :returns: dynamically-imported python module of the compiled code.
+            (unless py_module is False, in that case returns None.)
        :note 1: On Windows 7 with nvcc 3.1 we need to compile in the
                 real directory Otherwise nvcc never finish.
@@ -393,6 +396,7 @@ class NVCC_compiler(object):
            # this doesn't happen to my knowledge
            print >> sys.stderr, "DEBUG: nvcc STDOUT", nvcc_stdout
+        if py_module:
            #touch the __init__ file
            file(os.path.join(location, "__init__.py"), 'w').close()
            return dlimport(lib_filename)

--- a/theano/sandbox/cuda/type.py
+++ b/theano/sandbox/cuda/type.py
@@ -288,7 +288,9 @@ class CudaNdarrayType(Type):
            //std::cerr << "c_extract " << %(name)s << '\\n';
            if (%(name)s->nd != %(nd)s)
            {
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has rank %%i, it was supposed to have rank %(nd)s", %(name)s->nd);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has rank %%i, it was supposed to have rank %(nd)s",
+                             %(name)s->nd);
                %(name)s = NULL;
                %(fail)s;
            }
@@ -299,7 +301,9 @@ class CudaNdarrayType(Type):
                print >> sio, """
            if (CudaNdarray_HOST_DIMS(%(name)s)[%(i)s] != 1)
            {
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has dim %%i on broadcastable dimension %%i", CudaNdarray_HOST_DIMS(%(name)s)[%(i)s], %(i)s);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has dim %%i on broadcastable dimension %%i",
+                             CudaNdarray_HOST_DIMS(%(name)s)[%(i)s], %(i)s);
                %(name)s = NULL;
                %(fail)s;
            }
@@ -309,7 +313,9 @@ class CudaNdarrayType(Type):
            if (CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s])
            {
                //std::cerr << "c_extract bad stride detected...\\n";
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has a nonzero stride %%i on a broadcastable dimension %%i", CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s], %(i)s);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has a nonzero stride %%i on a broadcastable dimension %%i",
+                             CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s], %(i)s);
                %(name)s = NULL;
                %(fail)s;
            }

--- a/theano/sandbox/linalg/kron.py
+++ b/theano/sandbox/linalg/kron.py
-import numpy
 import theano
 from theano.gof import Op, Apply
 from theano import tensor

--- a/theano/sandbox/linalg/ops.py
+++ b/theano/sandbox/linalg/ops.py
@@ -12,7 +12,7 @@ from theano.tensor.opt import (register_stabilize,
        register_specialize, register_canonicalize)
 from theano.gof import local_optimizer
 from theano.gof.opt import Optimizer
-from theano.gradient import grad_not_implemented, DisconnectedType
+from theano.gradient import DisconnectedType
 try:
    import scipy.linalg
@@ -433,16 +433,14 @@ class CholeskyGrad(Op):
        return Apply(self, [x, l, dz], [x.type()])
    def perform(self, node, inputs, outputs):
-        """
+        """Implements the "reverse-mode" gradient [1]_ for the
-        Implements the "reverse-mode" gradient for the Cholesky factorization
+        Cholesky factorization of a positive-definite matrix.
-        of a positive-definite matrix.
-        References
-        ----------
        .. [1] S. P. Smith. "Differentiation of the Cholesky Algorithm".
               Journal of Computational and Graphical Statistics,
               Vol. 4, No. 2 (Jun.,1995), pp. 134-147
               http://www.jstor.org/stable/1390762
        """
        x = inputs[0]
        L = inputs[1]

--- a/theano/sandbox/scan_module/scan_op.py
+++ b/theano/sandbox/scan_module/scan_op.py
@@ -12,27 +12,18 @@ __authors__ = ("Razvan Pascanu "
 __copyright__ = "(c) 2010, Universite de Montreal"
 __contact__ = "Razvan Pascanu <r.pascanu@gmail>"
-import itertools
 import logging
-import time
 from itertools import izip
 import numpy
 import theano
-from theano.compile import function, Param, Out
 from theano import compile
-from theano import gradient
 from theano.gof.python25 import any
 from theano.gof import PureOp, Apply
 from theano import gof
 from theano.tensor import TensorType
-from theano import tensor
 from theano.tensor.opt import Shape_i
-#from theano.sandbox import cuda
-from theano.compile.profiling import ScanProfileStats
-import scan_utils
 # Logging function for sending warning or info
 _logger = logging.getLogger('theano.scan_module.scan_op')

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -561,6 +561,9 @@ class ScalarVariable(_scalar_py_operators, Variable):
 class ScalarConstant(_scalar_py_operators, Constant):
    pass
+# Register ScalarConstant as the type of Constant corresponding to Scalar
+Scalar.Constant = ScalarConstant
 # Easy constructors

--- a/theano/scalar/sharedvar.py
+++ b/theano/scalar/sharedvar.py
@@ -22,7 +22,7 @@ __contact__   = "theano-dev <theano-dev@googlegroups.com>"
 __docformat__ = "restructuredtext en"
 import numpy
-from theano.compile import shared_constructor, SharedVariable
+from theano.compile import SharedVariable
 from basic import Scalar, _scalar_py_operators
 class ScalarSharedVariable(_scalar_py_operators, SharedVariable):

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -520,7 +520,6 @@ def get_scalar_constant_value(v):
    if isinstance(v, numpy.ndarray):
        return numpy_scalar(v)
    if isinstance(v, Constant):
        if getattr(v.tag, 'unique_value', None) is not None:
            data = v.tag.unique_value
@@ -529,11 +528,9 @@ def get_scalar_constant_value(v):
        return numpy_scalar(data)
    if v.owner:
-        if isinstance(v.owner.op, Alloc):
+        if isinstance(v.owner.op, (Alloc, DimShuffle, Rebroadcast,
-            return get_scalar_constant_value(v.owner.inputs[0])
+                                   compile.ops.OutputGuard,
-        if isinstance(v.owner.op, DimShuffle):
+                                   compile.DeepCopyOp)):
-            return get_scalar_constant_value(v.owner.inputs[0])
-        if isinstance(v.owner.op, Rebroadcast):
            return get_scalar_constant_value(v.owner.inputs[0])
        if isinstance(v.owner.op, Elemwise) and \
                isinstance(v.owner.op.scalar_op, scal.Second):
@@ -604,11 +601,33 @@ def get_scalar_constant_value(v):
            # This is needed when we take the grad as the Shape op
            # are not already changed into MakeVector
-            if (v.owner.inputs[0].owner and
+            owner = v.owner
-                isinstance(v.owner.inputs[0].owner.op,
+            leftmost_parent = owner.inputs[0]
+            if (leftmost_parent.owner and
+                isinstance(leftmost_parent.owner.op,
                           theano.tensor.Shape)):
-                if v.owner.inputs[0].owner.inputs[0].type.broadcastable[
+                op = owner.op
-                    v.owner.op.idx_list[0]]:
+                idx_list = op.idx_list
+                idx = idx_list[0]
+                grandparent = leftmost_parent.owner.inputs[0]
+                gp_broadcastable = grandparent.type.broadcastable
+                ndim = grandparent.type.ndim
+                assert ndim == len(gp_broadcastable)
+                if not (idx < len(gp_broadcastable)):
+                    msg = "get_scalar_constant_value detected " + \
+                            "deterministic IndexError: x.shape[%d] " + \
+                            "when x.ndim=%d." % (ndim, idx)
+                    if config.exception_verbosity == 'high':
+                        msg += 'x=%s' % min_informative_str(x)
+                    else:
+                        msg += 'x=%s' % str(x)
+                    raise ValueError(msg)
+                if gp_broadcastable[idx]:
                    return numpy.asarray(1)
    raise NotScalarConstantError(v)
@@ -1986,6 +2005,13 @@ class TensorConstant(_tensor_py_operators, Constant):
    def signature(self):
        return TensorConstantSignature((self.type, self.data))
+    def equals(self, other):
+        # Override Contant.equals to allow to compare with numpy.ndarray
+        if isinstance(other, numpy.ndarray):
+            # Make a TensorConstant to be able to compare
+            other = constant(other)
+        return (isinstance(other, TensorConstant) and
+                self.signature() == other.signature())
 TensorType.Constant = TensorConstant
@@ -3620,6 +3646,10 @@ def var(input, axis=None, keepdims=False):
    :param keepdims: If this is set to True, the axes which are reduced are
        left in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the original tensor.
+    :note: It use the two-pass algorithm for more stable results.
+        https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm
+        It exist other implementation that are even more stable, but probably slower.
    """
    input_ndim = input.type.ndim
@@ -3655,6 +3685,10 @@ def std(input, axis=None, keepdims=False):
        With this option,
        the result will broadcast correctly against the
        original tensor.
+    :note: It call var and var use the two-pass algorithm for more stable results.
+        https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm
+        It exist other implementation that are even more stable, but probably slower.
    """
    return sqrt(var(input=input, axis=axis, keepdims=keepdims))
@@ -6510,12 +6544,12 @@ class AdvancedSubtensor1(Op):
        else:
            o = None
-        # If i.dtype is more precise than numpy.intc (int32 on 32-bit machines,
+        # If i.dtype is more precise than numpy.intp (int32 on 32-bit machines,
        # int64 on 64-bit machines), numpy may raise the following error:
        # TypeError: array cannot be safely cast to required type.
        # Since we will probably not have an array with more than 2**31 items
-        # on a 32-bit arch, I suppose it is safe to cast i into intc.
+        # on a 32-bit arch, I suppose it is safe to cast i into intp.
-        i = theano._asarray(i, dtype=numpy.intc)
+        i = theano._asarray(i, dtype=numpy.intp)
        out[0] = x.take(i, axis=0, out=o)

--- a/theano/tensor/blas_c.py
+++ b/theano/tensor/blas_c.py
@@ -492,11 +492,19 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
            {
                if (PyArray_DESCR(%(xx)s)->type_num == NPY_FLOAT)
                {
-                    //fprintf(stderr, "B %%i %%i %%i %%i\\n",
-                    //        Nz0, Nz1, Sz0, Sz1);
                    float alpha = ((dtype_%(alpha)s*)PyArray_DATA(%(alpha)s))[0];
-                    //fprintf(stderr, "alpha=%%f\\n", alpha);
-                    //fprintf(stderr, "sx  sy %%i %%i\\n", Sx, Sy);
+                    // Check for vector-vector dot (Nx0 == 1). The code may work
+                    // for Sx1 != 1 as well, but has not been tested for this case,
+                    // so Sx1 == 1 is required for safety.
+                    if (Nx0 == 1 && Sx1 == 1)
+                    {
+                        zz_data[0] = fbeta*zz_data[0] + alpha*sdot_(&Nx1, 
+                            (float*)(PyArray_DATA(%(xx)s)), &Sx1,
+                            (float*)yy_data, &Sy);
+                    }
+                    else
+                    {
                        sgemv_(&TRANS, &Nx1, &Nx0,
                            &alpha,
                            (float*)(PyArray_DATA(%(xx)s)), &Sx0,
@@ -504,9 +512,22 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
                            &fbeta,
                            (float*)zz_data, &Sz);
                    }
+                }
                else if (PyArray_DESCR(%(xx)s)->type_num == NPY_DOUBLE)
                {
                    double alpha = ((dtype_%(alpha)s*)PyArray_DATA(%(alpha)s))[0];
+                    // Check for vector-vector dot (Nx0 == 1). The code may work
+                    // for Sx1 != 1 as well, but has not been tested for this case,
+                    // so Sx1 == 1 is required for safety.
+                    if (Nx0 == 1 && Sx1 == 1)
+                    {
+                        zz_data[0] = dbeta*zz_data[0] + alpha*ddot_(&Nx1, 
+                              (double*)(PyArray_DATA(%(xx)s)), &Sx1,
+                              (double*)yy_data, &Sy);
+                    }
+                    else
+                    {
                        dgemv_(&TRANS, &Nx1, &Nx0,
                            &alpha,
                            (double*)(PyArray_DATA(%(xx)s)), &Sx0,
@@ -514,6 +535,7 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
                            &dbeta,
                            (double*)zz_data, &Sz);
                    }
+                }
                else
                {
                    PyErr_SetString(PyExc_AssertionError,
@@ -556,7 +578,7 @@ class CGemv(BaseBLAS, Gemv):
        return code
    def c_code_cache_version(self):
-        return (9,)
+        return (10,)
 @local_optimizer([gemv_inplace, gemv_no_inplace])

--- a/theano/tensor/fourier.py
+++ b/theano/tensor/fourier.py
-import theano
 import numpy
 import math
-from theano import gof, tensor, function, scalar
+from theano import gof, tensor
-from theano.sandbox.linalg.ops import diag
 class Fourier(gof.Op):

--- a/theano/tensor/inplace.py
+++ b/theano/tensor/inplace.py
-from basic import _scal_elemwise #, _transpose_inplace
 from theano import scalar as scal
 import elemwise
 from theano import printing
 from theano.printing import pprint
-from theano.gof.python25 import any
 def _scal_inplace(symbol):
    """Replace a symbol definition with an elementwise version of the corresponding scalar Op"""

--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
@@ -545,7 +545,7 @@ class Conv3D(theano.Op):
            ///////////// < /code generated by Conv3D >
        """
-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())
 global conv3D
 conv3D = Conv3D()

--- a/theano/tensor/nnet/ConvGrad3D.py
+++ b/theano/tensor/nnet/ConvGrad3D.py
@@ -271,7 +271,7 @@ class ConvGrad3D(theano.Op):
            ///////////// < /code generated by ConvGradW3D >
        """
-        return strutil.renderString(codeSource, locals())
+        return strutil.render_string(codeSource, locals())
 convGrad3D = ConvGrad3D()

--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -324,7 +324,7 @@ class ConvTransp3D(theano.Op):
               ///////////// < /code generated by ConvTransp3D >
                     """
-        return strutil.renderString(codeSource, locals())
+        return strutil.render_string(codeSource, locals())
 convTransp3D = ConvTransp3D()

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -813,7 +813,21 @@ class ShapeFeature(object):
                        "for a variable with %d dimensions." % (
                        len(s), r.ndim))
-            shape_vars = [self.unpack(s_i) for s_i in s]
+            shape_vars = []
+            for i in range(r.ndim):
+                if (hasattr(r.type, 'broadcastable') and
+                    r.type.broadcastable[i]):
+                    shape_vars.append(self.lscalar_one)
+                else:
+                    shape_vars.append(self.unpack(s[i]))
+            assert all([not hasattr(r.type, "broadcastable") or
+                        not r.type.broadcastable[i] or
+                        # The two following comparison are a speed optimization
+                        # But we never timed this speed optimization!
+                        self.lscalar_one.equals(shape_vars[i]) or
+                        self.lscalar_one.equals(
+                            T.extract_constant(shape_vars[i]))
+                        for i in range(r.ndim)])
            self.shape_of[r] = tuple(shape_vars)
            for sv in shape_vars:
                self.shape_of_reverse_index.setdefault(sv, set()).add(r)
@@ -855,6 +869,15 @@ class ShapeFeature(object):
                merged_shape.append(r_shape[i])
            else:
                merged_shape.append(other_shape[i])
+        assert all([(not hasattr(r.type, "broadcastable") or
+                     not r.type.broadcastable[i] and
+                     not other_r.type.broadcastable[i]) or
+                    # The two following comparison are a speed optimization
+                    # But we never timed this speed optimization!
+                    self.lscalar_one.equals(merged_shape[i]) or
+                    self.lscalar_one.equals(
+                        T.extract_constant(merged_shape[i]))
+                    for i in range(r.ndim)])
        self.shape_of[r] = tuple(merged_shape)
        for sv in self.shape_of[r]:
            self.shape_of_reverse_index.setdefault(sv, set()).add(r)
@@ -871,6 +894,13 @@ class ShapeFeature(object):
                new_shape.append(self.unpack(s_i))
            else:
                new_shape.append(s_j)
+        assert all([not hasattr(r.type, "broadcastable") or
+                    not r.type.broadcastable[i] or
+                    # The two following comparison are a speed optimization
+                    # But we never timed this speed optimization!
+                    self.lscalar_one.equals(new_shape[i]) or
+                    self.lscalar_one.equals(T.extract_constant(new_shape[i]))
+                    for i in range(r.ndim)])
        self.shape_of[r] = tuple(new_shape)
        for sv in self.shape_of[r]:
            self.shape_of_reverse_index.setdefault(sv, set()).add(r)

--- a/theano/tensor/opt_uncanonicalize.py
+++ b/theano/tensor/opt_uncanonicalize.py
@@ -28,16 +28,10 @@ Also, we should make the fgraph refuse optimization that break the canonization
 import logging
 _logger = logging.getLogger('theano.tensor.opt')
-import operator
-import itertools
-import sys
-import theano
 from theano import gof
 from elemwise import CAReduce
 import basic as T
-from theano.gof.python25 import any, all
 from theano.gof.opt import Optimizer
 from theano.gof import InconsistencyError, toolbox

--- a/theano/tensor/shared_randomstreams.py
+++ b/theano/tensor/shared_randomstreams.py
@@ -4,11 +4,8 @@ graphs.
 __docformat__ = "restructuredtext en"
 import copy
-import sys
 import numpy
-from theano.gof import Container
 from theano.compile.sharedvalue import (SharedVariable, shared_constructor,
                                        shared)
 import raw_random

--- a/theano/tensor/signal/conv.py
+++ b/theano/tensor/signal/conv.py
@@ -5,11 +5,7 @@ generic 2D convolution.
 __docformat__ = "restructuredtext en"
-import numpy
-import theano
 import theano.tensor as tensor
-import theano.tensor.nnet as nnet
-from theano import gof, Op, tensor, config
 from theano.tensor.nnet import conv
 import logging

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -5456,8 +5456,9 @@ class test_tensordot(unittest.TestCase):
        f1 = inplace_func([avec, bvec], c)
        aval = rand(5)
        bval = rand(5)
-        self.assertTrue(numpy.tensordot(aval, bval, axes) == \
+        out0 = numpy.tensordot(aval, bval, axes)
-                        f1(aval, bval))
+        out1 = f1(aval, bval)
+        self.assertTrue(numpy.allclose(out0, out1), (out0, out1))
        utt.verify_grad(self.TensorDot(axes), [aval, bval])
        # Test matrix-vector

--- a/theano/tensor/tests/test_io.py
+++ b/theano/tensor/tests/test_io.py
@@ -9,7 +9,7 @@ class T_load_tensor(unittest.TestCase):
    def setUp(self):
        self.data = numpy.arange(5, dtype=numpy.int32)
        self.filename = os.path.join(
-            theano.config.base_compiledir,
+            theano.config.compiledir,
            "_test.npy")
        numpy.save(self.filename, self.data)
@@ -52,5 +52,5 @@ class T_load_tensor(unittest.TestCase):
    def tearDown(self):
        os.remove(os.path.join(
-            theano.config.base_compiledir,
+            theano.config.compiledir,
            "_test.npy"))
--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
@@ -2475,6 +2475,57 @@ class test_shapeoptimizer(unittest.TestCase):
        assert len(topo) == 1
        assert topo[0].op == deep_copy_op
+    @staticmethod
+    def max_pool_c01b(c01b, pool_shp, pool_stride, img_shp):
+        """Like max_pool but with input using axes ('c', 0, 1, 'b')
+          (Alex Krizhevsky format)
+        pool_shp, pool_stride and img_shp are int that represent
+        the same shp in x and y.
+        """
+        mx = None
+        # Compute index in pooled space of last needed pool
+        # (needed = each input pixel must appear in at least one pool)
+        def last_pool(im_shp, p_shp, p_strd):
+            rval = int(numpy.ceil(float(im_shp - p_shp) / p_strd))
+            assert p_strd * rval + p_shp >= im_shp
+            assert p_strd * (rval - 1) + p_shp < im_shp
+            return rval
+        # Compute starting row of the last pool
+        last_pool_r = last_pool(img_shp, pool_shp, pool_stride) * pool_stride
+        # Compute number of rows needed in img for all indexes to work out
+        required_r = last_pool_r + pool_shp
+        last_pool_c = last_pool(img_shp, pool_shp, pool_stride) * pool_stride
+        required_c = last_pool_c + pool_shp
+        wide_infinity = T.alloc(-numpy.inf, c01b.shape[0],
+                                required_r, required_c, c01b.shape[3])
+        c01b = T.set_subtensor(wide_infinity[:, 0:img_shp, 0:img_shp, :], c01b)
+        for row_within_pool in xrange(pool_shp):
+            row_stop = last_pool_r + row_within_pool + 1
+            for col_within_pool in xrange(pool_shp):
+                col_stop = last_pool_c + col_within_pool + 1
+                cur = c01b[:, row_within_pool:row_stop:pool_stride,
+                           col_within_pool:col_stop:pool_stride, :]
+                if mx is None:
+                    mx = cur
+                else:
+                    mx = T.maximum(mx, cur)
+        return mx
+    def test_broadcasted_dims(self):
+        #This test a case that caused a crash during optimization
+        shp = (1, 1, 1, 1)
+        rng = numpy.random.RandomState(utt.fetch_seed())
+        a = shared(rng.rand(*shp).astype(config.floatX))
+        out = self.max_pool_c01b(a, 1, 1, 1)
+        f = theano.function([], out)
+        f()
    def test_local_track_shape_i(self):
        class IdentityNoShape(gof.Op):
            '''Op that does not infer the output shape from the input one'''

--- a/theano/tensor/xlogx.py
+++ b/theano/tensor/xlogx.py
-import theano
 import numpy
 from elemwise import Elemwise

--- a/theano/tests/run_tests_in_batch.py
+++ b/theano/tests/run_tests_in_batch.py
@@ -55,10 +55,12 @@ nosetests.
 import cPickle
+import datetime
 import os
 import subprocess
 import sys
-import datetime
+import time
 import theano
 from theano.misc.windows import call_subprocess_Popen
@@ -261,8 +263,8 @@ def run(stdout, stderr, argv, theano_nose, batch_size, time_profile,
                                                 n_tests + 1)):
                # Print the test we will start in the raw log to help
                # debug tests that are too long.
-                f_rawlog.write("\nWill run test #%d %s\n" % (test_id,
+                f_rawlog.write("\n%s Will run test #%d %s\n" % (
-                                                         data["ids"][test_id]))
+                    time.ctime(), test_id, data["ids"][test_id]))
                f_rawlog.flush()
                proc = call_subprocess_Popen(

--- a/theano/updates.py
+++ b/theano/updates.py
@@ -64,7 +64,8 @@ class OrderedUpdates(OrderedDict):
            # Warn about non-determinism.
            warnings.warn('Updating an `OrderedUpdates` with a '
                          'non-ordered dictionary with 2+ elements could '
-                          'make your code non-deterministic')
+                          'make your code non-deterministic',
+                          stacklevel=2)
        for key, val in OrderedDict(other).iteritems():
            if key in self:
                if self[key] == val: