Merge remote-tracking branch 'upstream/master' into no-relative-imports

Conflicts: theano/compile/function.py theano/compile/pfunc.py theano/gof/fg.py theano/gof/type.py

Merge remote-tracking branch 'upstream/master' into no-relative-imports
44a3e92c · abalkin · 25c4f78e · 94592494 · 44a3e92c · 44a3e92c
--- a/.mailmap
+++ b/.mailmap
+# Prevent git from showing duplicate names with commands like "git shortlog"
+# # See the manpage of git-shortlog for details.
+# # The syntax is:
+# # Name that should be used <email that should be used> Bad name <bad email>
+# #
+# # You can skip Bad name if it is the same as the one that should be used, and is unique.
+# #
+# # This file is up-to-date if the command git log --format="%aN <%aE>" | sort -u
+# # gives no duplicates.
+
 <abergeron@gmail.com> <anakha@kami.(none)>
 David Warde-Farley <wardefar@iro.umontreal.ca> David Warde-Farley <dwf@cs.toronto.edu>
 David Warde-Farley <wardefar@iro.umontreal.ca> David Warde Farley <dwf@cs.toronto.edu>

--- a/NEWS.txt
+++ b/NEWS.txt
@@ -4,6 +4,131 @@
 Release Notes
 =============

+Theano in the development version since 0.6rc2
+==============================================
+
+up to merged PR gh-1220
+
+Highlights:
+ * Speed-ups.
+ * Crash fixes.
+ * A few small interface changes.
+ * GPU memory leak fix.
+ * A few corner cases fix without incidence.
+ * More Theano determinism
+ * tensor.{dot,tensordot} more complete/faster/more GPU friendly.
+ * tensor.tensordot now support Rop/Lop
+ * tensor.dot support n-dimensional inputs as NumPy
+ * To support more NumPy syntax:
+     * Add theano.tensor.take()
+     * Add a_tensor_variable.{sort,dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag,take}
+
+Commiters for this rc2 only:
+
+Bug fix:
+ * Fix memory leak on the GPU in some corner with the Theano flags `allow_gc=False`. (Frederic B., reported by Jonas Gehring)
+ * Fix copy of random state between graph. (Guillaume D.)
+   http://deeplearning.net/software/theano/tutorial/examples.html#copying-random-state-between-theano-graphs
+ * Fix wrong dtype in sandbox.linalg.ExtractDiag with shape of 0. (Frederic B., reported by abalkin)
+ * Correctly support array with more then 2*10e32 element in AdvancedSubtensor1. (Abalkin)
+ * Fix wrong broadcast dimensions of output of Repeat op. (Abalkin)
+   We where using the inputs broadcasting pattern in some cases when we shouldn't.
+ * Fix theano.sandbox.linalg.eigh grad that didn't always returned the right dtype. (Frederic B., Olivier D.)
+
+New Features:
+ * More Theano determinism (Ian G., Olivier D., Pascal L.)
+     * Add and use a new class OrderedSet.
+     * Modify theano.grad to be determinist.
+     * Warn when using a dict as the updates argument to theano.compile.function, since this makes the returned function non-deterministic.
+     * The Updates class was not appropriate for representing updates because it is non-deterministic; replaced it with the OrderedUpdates class.
+ * Implemented GpuContiguous.grad. (Ian G.)
+ * tensor.tensordot now support Rop/Lop (Jeremiah Lowin)
+   This remove the class TensorDot and TensorDotGrad. It is the Dot/Elemwise ops that are used.
+ * tensor.dot support n-dimensional inputs as NumPy (Jeremiah Lowin)
+   Work on the GPU too.
+ * The Theano flag `nvcc.flags` now accept `-ftz=true`, `--prec-div=false` and `--prec=sqrt=false` as value. (Frederic B.)
+   To enable all of them, use the Theano flag `nvcc.flags=--use_fast_math`.
+ * New op theano.sparse.ConstructSparseFromList (Rami Al-Rfou'  Vivek Kulkarni)
+ * Make Theano work with Anaconda on Windows. (Pascal L.)
+ * Add tensor_var.diagonal and theano.tensor.{diag,diagonal}. (abalkin)
+ * AdvencedSubtensor1 can now have a sparse gradient. (Rami Al-Rfou', Vivek Kulkarni)
+
+Interface Deprecation (a warning is printed):
+ * theano.misc.strutil.renderString -> render_string (Ian G.)
+ * Will get warning when using dictionary at some place as this make Theano non-deterministic.
+
+Interface Change:
+ * Raise an error when theano.shared called with a theano variable. (Frederic B.)
+ * Don't print warning for bug before Theano 0.5 by default. (Frederic B.)
+ * Theano functions now always have a field name, default to None. (Frederic B.)
+ * Theano function fct.fgraph have a copy of the Theano function name field. (Ian G.)
+   This is needed to all the fgraph to know it.
+ * In the grad method, if it were asked to raise an error if there is no path between the variables, we didn't always returned an error. (Ian G.)
+   We returned the mathematical right answer 0.
+ * get_constant_value() renamed get_scalar_constant_value() and raise a new exception tensor.basic.NotScalarConstantError. (Ian G.)
+ * theano.function raise an error when triing to replace inputs with the given paramter. (Olivier D.)
+   This was doing nothing, the error message tell what the user probably want to do.
+
+New Interface (reuse existing functionality):
+ * tensor_var.sort() as a shortcut for theano.tensor.sort. (Jeremiah Lowin)
+   We where already doing this for argsort.
+ * Add theano.tensor.take() and a_tensor_var.take() to support NumPy syntax. (abalkin) 
+ * Add a_tensor_variable.{dot,std,argmin,argmax,argsort,clip,conj,conjugate,repeat,round,trace,real,imag}. (abalkin)
+
+New debug feature:
+ * DebugMode print more info when there is an error. (Frederic B.)
+ * Better profiling of test time with `theano-nose --time-profile`. (Frederic B.)
+ * Detection of infinite loop with global optimizer. (Pascal L.)
+ * DebugMode.check_preallocated_output now also work on Theano function output. (Pascal L.)
+
+Speed-ups:
+ * c_code for SpecifyShape op. (Frederic B.)
+ * cross-entropy optimization now work when specify_shape is used. (Pascal L.)
+ * The Scan optimization ScanSaveMem and PushOutDot1 applied more frequently. (Razvan P, reported Abalkin)
+   A skipped optimization warning was printed.
+ * dot(vector, vector) now faster with some BLAS implementation. (Eric Hunsberger)
+   OpenBLAS and other didn't called {s,d}dot internally when we called {s,g}gemv.
+   MKL was doing this.
+ * Compilation speed up: Take the compiledir lock only for op that generate c_code. (Frederic B)
+ * More scan optimization (Razvan P.)
+     * Opt to make RNN fast in Theano.
+     * Optimize some case of dot, by moving them outside of Scan.
+     * Move some sequences outside of scan too.
+     * Merge more scan inputs, mostly byproduct of other Scan optimizations.
+ * c_code for theano.sparse.AddSD. (Rami Al-Rfou',  Vivek Kulkarni)
+
+Crash Fixes:
+ * Fix crash about dimshuffle. (abalkin)
+ * Fix crash at compilation. (Olivier D.)
+ * Fix openmp detection. (Pascal L.)
+   Resulted in a crash with EPD on Windows.
+ * Fix for new BLAS interface in SciPy. (Olivier D.)
+   Fix crash with some development version of SciPy.
+ * GpuSum work with bigger shape when summing on the first dim on 3d tensor. (Frederic B., reported Chris Currivan)
+ * Windows compilation crash fix. (Frederic B.)
+ * Make CrossentropySoftmax1HotWithBiasDx and CrossentropySoftmaxArgmax1HotWithBias support uint* dtype. (Frederic B., reported by Mark Fenner)
+ * Fix GpuSoftmax and GpuSoftmaxWithBias crash on GTX285. (Frederic B.)
+ * Fix crash due to a race condition when importing theano. (Ian G.)
+ * Fix crash from path problem with `theano-nose --batch`. (Abalkin) 
+ * Fix crash with tensor.roll(Var, iscalar). (Frederic B., reported by Jeremiah Lowin)
+ * Fix compilation crash with llvm on Mac. (Abalkin)
+ * Fix the grad of Scan that told wrongly that there is no connection between cost and parameters. (Razvan P.)
+ * The infer shape mechanism now force that broadcasted dimensions have a shape know to be equivalent to one during compilation.
+   Sometimes, we where not able knowing this before run time and resulted in crash. (Frederic B.)
+ * Fix compilation problems on GPU on Windows. (Frederic B.)
+
+Theoretical bugfix (bug that won't happen with current Theano code, but if you messed with the internal, could have affected you):
+ * GpuContiguous now check the preallocated outputs strides before using it. (Pascal L.)
+
+Others:
+ * Fix race condition when determining if g++ is available. (Abalkin)
+ * Documentation improvements. (Many people including David W-F, abalkin, Amir Elaguizy, Olivier D., Frederic B.)
+ * The current GPU back-end have a new function CudaNdarray_prep_output(CudaNdarray ** arr, int nd, const int * dims) (Ian G)
+
+=============
+Release Notes
+=============
+
 Theano 0.6rc2 (November 21th, 2012)
 ===================================


--- a/doc/install.txt
+++ b/doc/install.txt
@@ -657,8 +657,8 @@ Theano dependencies is easy, but be aware that it will take a long time
 Homebrew
 ~~~~~~~~

-There are some :ref:`instructions
-<https://github.com/samueljohn/homebrew-python>` by Samuel John on how to install
+There are some `instructions
+<https://github.com/samueljohn/homebrew-python>`__ by Samuel John on how to install
 Theano dependencies with Homebrew instead of MacPort.



--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -39,7 +39,7 @@ probably do something similar on older computer.
 Installation steps
 ~~~~~~~~~~~~~~~~~~

-Ubuntu 11.10/12.04:
+Ubuntu 11.10/12.04/12.10:
 1) ``sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git``
 2) ``sudo pip install Theano``

@@ -70,7 +70,7 @@ Theano/BLAS speed test:

 .. code-block:: bash

-    python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    python `python -c "import os, theano; print os.path.dirname(theano.__file__)"`/misc/check_blas.py

 This will print a table with different versions of BLAS/numbers of
 threads on multiple CPUs and GPUs. It will also print some Theano/NumPy
@@ -163,6 +163,8 @@ Test GPU configuration

   Ubuntu 12.04 LTS: default gcc version 4.6.3. gcc 4.4.7 and 4.5.3 availables.

+   Ubuntu 12.10: default gcc version 4.7.2. gcc 4.4.7, 4.5.4 and 4.6.3 availables.
+




--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -1229,6 +1229,7 @@ Linear Algebra
                 If an integer i, it is converted to an array containing
                 the last i dimensions of the first tensor and the first
                 i dimensions of the second tensor:
+
                     axes = [range(a.ndim - i, b.ndim), range(i)]

                 If an array, its two elements must contain compatible axes
@@ -1251,6 +1252,8 @@ Linear Algebra
    are compatible. The resulting tensor will have shape (2, 5, 6) -- the
    dimensions that are not being summed:

+    .. code-block:: python
+
        a = np.random.random((2,3,4))
        b = np.random.random((5,6,4,3))

@@ -1284,6 +1287,8 @@ Linear Algebra
    In an extreme case, no axes may be specified. The resulting tensor
    will have shape equal to the concatenation of the shapes of a and b:

+    .. code-block:: python
+
        c = np.tensordot(a, b, 0)
        print(a.shape) #(2,3,4)
        print(b.shape) #(5,6,4,3)

--- a/doc/library/tensor/nnet/conv.txt
+++ b/doc/library/tensor/nnet/conv.txt
@@ -7,8 +7,11 @@
 .. note::

    Two similar implementation exists for conv2d:
+
        :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>` and
-    :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`. The former implements a traditional
+        :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
+
+    The former implements a traditional
    2D convolution, while the latter implements the convolutional layers
    present in convolutional neural networks (where filters are 3D and pool
    over several input channels).

--- a/doc/library/tensor/nnet/nnet.txt
+++ b/doc/library/tensor/nnet/nnet.txt
@@ -74,11 +74,11 @@ cross-entropy (note that this assumes that x will contain values between 0 and

 .. code-block:: python

-    x,y,b = T.dvectors('x','y','b')
+    x, y, b = T.dvectors('x', 'y', 'b')
    W = T.dmatrix('W') 
-    h = T.nnet.sigmoid(T.dot(W,x) + b)
-    x_recons = T.nnet.sigmoid(T.dot(V,h) + c)
-    recon_cost = T.nnet.binary_crossentropy(x_recons,x).mean()
+    h = T.nnet.sigmoid(T.dot(W, x) + b)
+    x_recons = T.nnet.sigmoid(T.dot(V, h) + c)
+    recon_cost = T.nnet.binary_crossentropy(x_recons, x).mean()

 .. function:: categorical_crossentropy(coding_dist,true_dist)

@@ -87,7 +87,7 @@ cross-entropy (note that this assumes that x will contain values between 0 and
    needed to identify an event from a set of possibilities, if a coding scheme is used based
    on a given probability distribution q, rather than the "true" distribution p. Mathematically, this 
    function computes :math:`H(p,q) = - \sum_x p(x) \log(q(x))`, where
-    p=coding_dist and q=true_dist
+    p=true_dist and q=coding_dist.

    :Parameters:

@@ -108,6 +108,6 @@ cross-entropy (note that this assumes that x will contain values between 0 and

 .. code-block:: python

-    y = T.nnet.softmax(T.dot(W,x) + b)
-    cost = T.nnet.categorical_crossentropy(y,o)
+    y = T.nnet.softmax(T.dot(W, x) + b)
+    cost = T.nnet.categorical_crossentropy(y, o)
    # o is either the above-mentioned 1-of-N vector or 2D tensor
--- a/doc/library/tensor/signal/conv.txt
+++ b/doc/library/tensor/signal/conv.txt
@@ -7,8 +7,11 @@
 .. note::

    Two similar implementation exists for conv2d:
+
        :func:`signal.conv2d <theano.tensor.signal.conv.conv2d>` and
-    :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>. The former implements a traditional
+        :func:`nnet.conv2d <theano.tensor.nnet.conv.conv2d>`.
+
+    The former implements a traditional
    2D convolution, while the latter implements the convolutional layers
    present in convolutional neural networks (where filters are 3D and pool
    over several input channels).

--- a/theano/compile/function.py
+++ b/theano/compile/function.py
@@ -161,8 +161,9 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
    if updates is None:
        updates = []

-    if isinstance(updates, dict) and \
-            not isinstance(updates, gof.python25.OrderedDict):
+    if (isinstance(updates, dict) and
+            not isinstance(updates, gof.python25.OrderedDict) and
+            len(updates) > 1):
        warnings.warn(
            "The parameter 'updates' of theano.function()"
            " expects an OrderedDict,"
@@ -183,8 +184,8 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
    # compute some features of the arguments:
    uses_In = any([isinstance(i, In) for i in inputs])  # N.B. the square brackets are ncessary
    uses_tuple = any([isinstance(i, (list, tuple)) for i in inputs])  # N.B. the square brackets are ncessary
-    uses_updates = (updates != [])
-    uses_givens = (givens != [])
+    uses_updates = bool(updates)
+    uses_givens = bool(givens)

    # See if we have any mutable / borrow inputs
    check_for_aliased_inputs = False
@@ -198,7 +199,9 @@ def function(inputs, outputs=None, mode=None, updates=None, givens=None,
        if profile:
            raise NotImplementedError('profiling not supported in old-style function')
        if uses_updates or uses_givens:
-            raise NotImplementedError("In() instances and tuple inputs triggers the old semantics, which disallow using updates and givens")
+            raise NotImplementedError(
+                    "In() instances and tuple inputs trigger the old "
+                    "semantics, which disallow using updates and givens")
        fn = orig_function(inputs, outputs,
                           mode=mode,
                           accept_inplace=accept_inplace, name=name)

--- a/theano/compile/pfunc.py
+++ b/theano/compile/pfunc.py
@@ -232,8 +232,8 @@ def rebuild_collect_shared(outputs,
                cloned_outputs.append(Out(cloned_v, borrow=v.borrow))
            else:
                raise TypeError('Outputs must be theano Variable or '
-                                'Out instances. Received ' + str(v)\
-                                + ' of type '+str(type(v)))
+                                'Out instances. Received ' + str(v)
+                                + ' of type ' + str(type(v)))
            #computed_list.append(cloned_v)
    else:
        if isinstance(outputs, Variable):
@@ -277,7 +277,8 @@ class Param(object):
    def __init__(self, variable, default=None, name=None, mutable=False,
            strict=False, allow_downcast=None, implicit=None, borrow=None):
        """
-        :param variable: A variable in an expression graph to use as a compiled-function parameter
+        :param variable: A variable in an expression graph to use as a
+            compiled-function parameter

        :param default: The default value to use at call-time (can also be a Container where
            the function will find a value at call-time.)
@@ -289,10 +290,12 @@ class Param(object):
        :param borrow: Whether the function is allowed to alias some output to
            this input. Using None (default) means we re-use the same value as the
            `mutable` flag.
+            False: do not permit any output to be aliased to the input

        False: do not permit any output to be aliased to the input
        :param strict: False -> function arguments may be copied or cast to match the
-        type required by the parameter `variable`.  True -> function arguments must exactly match the type
+            type required by the parameter `variable`.
+            True -> function arguments must exactly match the type
            required by `variable`.

        :param allow_downcast: Only applies if `strict` is False.
@@ -451,6 +454,27 @@ def pfunc(params, outputs=None, mode=None, updates=None, givens=None,
                     "provided for it being ignored. Please do not duplicate "
                     "variables in the inputs list." % (v, i, dup_v_i)))

+    # Check that we are not using `givens` to replace input variables, because
+    # this typically does nothing, contrary to what one may expect.
+    in_var_set = set(in_variables)
+    try:
+        givens_pairs = givens.items()
+    except AttributeError:
+        givens_pairs = givens
+    for x, y in givens_pairs:
+        if x in in_var_set:
+            raise RuntimeError(
+                'You are trying to replace variable \'%s\' through the '
+                '`givens` parameter, but this variable is an input to your '
+                'function. Replacing inputs is currently forbidden because it '
+                'has no effect. One way to modify an input `x` to a function '
+                'evaluating f(x) is to define a new input `y` and use '
+                '`theano.function([y], f(x), givens={x: g(y)})`. Another '
+                'solution consists in using `theano.clone`, e.g. like this: '
+                '`theano.function([x], '
+                'theano.clone(f(x), replace={x: g(x)}))`.'
+                % x)
+
    output_vars = rebuild_collect_shared(outputs,
                                         in_variables,
                                         replace=givens,

--- a/theano/compile/tests/test_function_module.py
+++ b/theano/compile/tests/test_function_module.py
@@ -386,6 +386,14 @@ class T_function(unittest.TestCase):
        self.assertRaises(UnusedInputError, function, [m, mt], mt*2)
        f = function([m, mt], mt*2, on_unused_input='ignore')

+    def test_givens_input_var(self):
+        """
+        Ensure error is raised when trying to replace an input variable.
+        """
+        x = T.scalar('x')
+        y = x * 2
+        self.assertRaises(RuntimeError, function, [x], y, givens={x: x + 1})
+

 class T_picklefunction(unittest.TestCase):

@@ -680,6 +688,18 @@ class SomethingToPickle(object):
        self.f2 = function([x, In(a, value=1.0,name='a'), In(s, value=self.f1.container[s], update=s+a*x, mutable=True)], s+a*x)


+def test_empty_givens_updates():
+    """
+    Regression test for bug fixed in 8625e03.
+    """
+    # Empty givens / updates dictionaries were not properly detected before,
+    # triggering useless crashes at compile time.
+    x = T.scalar()
+    y = x * 2
+    function([theano.In(x)], y, givens={})
+    function([theano.In(x)], y, updates={})
+
+
 if __name__ == '__main__':

    if 1:

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -420,6 +420,11 @@ else:
                        " want theano to use.")
    default_openmp = count > 1

+# Disable it by default for now as currently only the ConvOp support
+# it And this cause slow down by default as we do not disable it for
+# too small convolution.
+default_openmp = False
+
 AddConfigVar('openmp',
             "Allow (or not) parallel computation on the CPU with OpenMP. "
             "This is the default value used when creating an Op that "

--- a/theano/gof/cmodule.py
+++ b/theano/gof/cmodule.py
@@ -1472,7 +1472,7 @@ class GCC_compiler(object):
        #cxxflags.append("-D NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION")
        numpy_ver = [int(n) for n in numpy.__version__.split('.')[:2]]

-        # numpy 1.7 deprecated the following macro but the didn't
+        # numpy 1.7 deprecated the following macro but the new one didn't
        # existed in the past
        if bool(numpy_ver < [1, 7]):
            cxxflags.append("-D NPY_ARRAY_ENSURECOPY=NPY_ENSURECOPY")

--- a/theano/gof/destroyhandler.py
+++ b/theano/gof/destroyhandler.py
@@ -2,12 +2,6 @@
 Classes and functions for validating graphs that contain view
 and inplace operations.
 """
-import sys
-if sys.version_info[:2] >= (2,5):
-    from collections import defaultdict
-
-# otherwise it's implemented in python25.py
-
 import theano
 import toolbox
 import graph

--- a/theano/gof/op.py
+++ b/theano/gof/op.py
@@ -763,6 +763,7 @@ class OpenMPOp(Op):
        self.openmp = openmp

    def c_compile_args(self):
+        self.update_self_openmp()
        if self.openmp:
            return ['-fopenmp']
        return []
@@ -807,7 +808,10 @@ class OpenMPOp(Op):
            return False
        return default_openmp

-    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+    def update_self_openmp(self):
+        """
+        Make sure self.openmp is not True if there is no support in gxx
+        """
        if self.openmp:
            if OpenMPOp.gxx_support_openmp is None:
                OpenMPOp.gxx_support_openmp = OpenMPOp.test_gxx_support()
@@ -818,9 +822,13 @@ class OpenMPOp(Op):
                        " know this happen with some version of the EPD mingw"
                        " compiler. We disable openmp everywhere in Theano."
                        " To remove this warning set the theano flags `openmp`"
-                        " to False.")
+                        " to False.",
+                        stacklevel=3)
            if OpenMPOp.gxx_support_openmp is False:
                self.openmp = False
                theano.config.openmp = False
+
+    def make_thunk(self, node, storage_map, compute_map, no_recycling):
+        self.update_self_openmp()
        return super(OpenMPOp, self).make_thunk(node, storage_map,
                                                compute_map, no_recycling)
--- a/theano/gradient.py
+++ b/theano/gradient.py
@@ -22,7 +22,7 @@ from theano import gof
 from theano.gof import Variable
 from theano.gof.python25 import OrderedDict
 from theano.gof.null_type import NullType
-from theano.printing import min_informative_str
+
 # we can't do "import theano.tensor"
 # tensor depends on theano.compile
 # theano.compile depends on theano.gradient (this file)

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -194,41 +194,28 @@ if __name__ == "__main__":
        goto2 1.13/16                                                     3.16s

        Test time in float32
-        (cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)
-
-        gpu/cuda version
-        M2050(Amazon)/5.0 0.25s
-
-        GTX680/4.2        0.154s
-        GTX580/4.2        0.164s
-        GTX480/4.2        0.192s
-        GTX470/4.2        0.238s
-        C2075/4.2         0.25s
-        GTX285/4.2        0.452s #cuda 3.0 seam faster? driver version?
-        GT520/4.2         2.68s
-        GTX560/4.2        0.30s
-
-        GTX460/4.0        0.45s
-
-        GTX580/3.2        0.203s
-        GTX680/3.2        0.218s
-        GTX480/3.2        0.237s
-        GTX470/3.2        0.297s
-        GTX285/3.2        0.452s #cuda 3.0 seam faster? driver version?
-
-        GTX480/3.0        0.27s
-        M2070/4.1         0.27s
-        GTX470/3.2        0.29s
-        M2070/3.2         0.32s
-        GTX470/3.0        0.34s
-        GTX285/3.0        0.40s
-        C1060/3.2         0.46s
-        GTX550Ti/4.0      0.57s
-        520/3.2           3.06s
-        520M/3.2          3.19s with bumblebee on Ubuntu 12.04
-        GT220/3.2RC       3.80s
-        GT210/4.0         6.35s
-        8500GT/3.0       10.68s
+
+        cuda version      5.0    4.2    4.1    4.0    3.2    3.0   # note
+        gpu
+        M2070             0.25s         0.27s         0.32s
+        M2050(Amazon)     0.25s
+        C2075                    0.25s
+        C1060                                         0.46s
+
+        GTX680                   0.154s               0.218s
+        GTX580                   0.164s               0.203s
+        GTX480                   0.192s               0.237s 0.27s
+        GTX470                   0.238s               0.297s 0.34s
+        GTX660                   0.24s
+        GTX560                   0.30s
+        GTX460            0.37s                0.45s
+        GTX285                   0.452s        0.452s        0.40s # cuda 3.0 seam faster? driver version?
+        GTX550Ti                               0.57s
+        GT520                    2.68s                3.06s
+        520M                                          3.19s        # with bumblebee on Ubuntu 12.04
+        GT220                                         3.80s
+        GT210                                  6.35s
+        8500GT                                               10.68s
        """

    t, impl = execute(not options.print_only, not options.quiet,

--- a/theano/misc/strutil.py
+++ b/theano/misc/strutil.py
-def renderString(string, dict):
+import warnings
+
+def render_string(string, sub):
+    """
+    string: a string, containing formatting instructions
+    sub: a dictionary containing keys and values to substitute for
+        them.
+
+    returns: string % sub
+
+    The only difference between this function and the % operator
+    is that it raises an exception with a more informative error
+    message than the % operator does.
+    """
    try:
-        finalCode = string % dict
+        finalCode = string % sub
    except Exception , E:
-        #print 'could not render C code due to exception with message "'+str(E)+'", trying to find out why...'
+        # If unable to render the string, render longer and longer
+        # initial substrings until we find the minimal initial substring
+        # that causes an error
        i = 0
        while i <= len(string):
            try:
-                finalCode = string[0:i] % dict
+                finalCode = string[0:i] % sub
            except Exception, F:
                if str(F) == str(E):
                    raise Exception(string[0:i]+"<<<< caused exception "+str(F))
            i+=1
        assert False
    return finalCode
-#
+
+def renderString(string, dict):
+    warnings.warn("renderString is deprecated. It is now called render_string",
+            stacklevel = 2)
+    return render_string(string, dict)

 def pretty_format(string):
    lines = string.split('\n')
@@ -34,11 +53,8 @@ def pretty_format(string):
    rval = '\n'.join(lines)

    return rval
-#

 def strip_leading_white_space(line):
    while len(line) >0 and (line[0]==' ' or line[0]=='\t'):
        line = line[1:]
-    #
    return line
-#
--- a/theano/misc/windows.py
+++ b/theano/misc/windows.py
@@ -13,5 +13,9 @@ def call_subprocess_Popen(command, **params):
            startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
        except AttributeError:
            startupinfo.dwFlags |= subprocess._subprocess.STARTF_USESHOWWINDOW
+
+        # Under Windows 7 64-bits, Anaconda's g++ is not found unless
+        # specifying "shell=True".
+        params['shell'] = True
    proc = subprocess.Popen(command, startupinfo=startupinfo, **params)
    return proc
--- a/theano/sandbox/cuda/GpuConv3D.py
+++ b/theano/sandbox/cuda/GpuConv3D.py
@@ -220,7 +220,7 @@ if(!work_complete){
 }}}}}}} //extra scope so error handler jumps don't cross declarations
                        ///////////// < /code generated by GpuConv3D >
        """
-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())

    def c_support_code_apply(self, node, nodename):
        # This code is not sensitive to the ignore_border flag.
@@ -279,7 +279,7 @@ conv_rows_stack( float* img, float* kern, float* bias, float* out,

            """

-        return codeSource#renderString(codeSource,locals())
+        return codeSource

 gpu_convd = GpuConv3D()


--- a/theano/sandbox/cuda/GpuConvGrad3D.py
+++ b/theano/sandbox/cuda/GpuConvGrad3D.py
@@ -336,7 +336,7 @@ convgrad_rows_stack( float* img, float* dCdH, float* dCdW,
                                            dCdW[j,z,k,l,m] += dCdH[i,j,p,q,r] * V[i,z,dr*p+k,dc*q+l,dt*r+m]
 */
 """
-        return codeSource#renderString(codeSource,locals())
+        return codeSource

 gpu_conv_grad3d = GpuConvGrad3D()


--- a/theano/sandbox/cuda/GpuConvTransp3D.py
+++ b/theano/sandbox/cuda/GpuConvTransp3D.py
@@ -263,7 +263,7 @@ if(!work_complete){
 }}}}}} // for fail
            ///////////// < /code generated by GpuConvTransp3D >
        """
-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())

    def c_support_code_apply(self, node, nodename):
        # This code is not sensitive to the ignore_border flag.

--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -218,7 +218,7 @@ if cuda_available:
        atexit.register(gpu_shutdown)
    except EnvironmentError, e:
        cuda_available = False
-        cuda_initialization_error_message = e.message
+        cuda_initialization_error_message = " ".join(e.args)


 class GpuOp(theano.gof.Op):

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -13,15 +13,20 @@ scal = scalar # somewhere scalar gets reassigned to be a function

 from theano.gof.python25 import all, any

-from theano.sandbox.cuda import GpuOp, device_properties
+try:
+    # We must be able to import this file to create the full doc when nvcc
+    # is not available
+    from theano.sandbox.cuda import filter as type_support_filter
+    from theano.sandbox.cuda import device_properties
+    import cuda_ndarray
+except ImportError:
+    pass
+
+from theano.sandbox.cuda import GpuOp
 from theano.sandbox.cuda.type import CudaNdarrayType
-from theano.sandbox.cuda import filter as type_support_filter
-
 from theano.sandbox.cuda.elemwise import NaiveAlgo


-import cuda_ndarray
-
 _logger_name = 'theano.sandbox.cuda.basic_ops'
 _logger = logging.getLogger(_logger_name)
 _logger.setLevel(logging.INFO)
@@ -2267,9 +2272,17 @@ class GpuSubtensor(GpuOp, tensor.Subtensor):
                                       set_dim='CudaNdarray_set_dim',
                                       set_stride='CudaNdarray_set_stride',
                                       update_flags="", strides_mul=4)
+        finish_view = ""
+        #For broadcasted dimensions, set the strides to 0
+        #We can't do that only for broadcasted dimensions as this can happen for dimensions of size 0,
+        #That are rebroadcated later.
+        for idx in range(node.outputs[0].ndim):
+            finish_view += """
+            if(CudaNdarray_HOST_DIMS(xview)[%(idx)s]==1)
+            CudaNdarray_set_stride(xview, %(idx)s, 0);
+            """ % locals()

-
-        finish_view = """
+        finish_view += """
        //Set the base only now

        if(CudaNdarray_set_device_data(xview, CudaNdarray_DEV_DATA(xview),
@@ -2287,6 +2300,13 @@ class GpuSubtensor(GpuOp, tensor.Subtensor):

        return build_view + "{" + get_xview + "}" + finish_view

+    def c_code_cache_version(self):
+        hv = self.helper_c_code_cache_version()
+        # If `helper_c_code_cache_version` is not versioned we do not want to
+        # have a versioned version of this op's C code.
+        if len(hv) == 0:
+            return ()
+        return (3, hv)

 class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
    """
@@ -2455,7 +2475,7 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):

            :return: C code expression to make a copy of x

-            Base class uses PyArrayObject *, subclasses may override for
+            Base class uses `PyArrayObject *`, subclasses may override for
            different types of arrays.
        """
        return """(CudaNdarray*) CudaNdarray_Copy(%(x)s)""" % locals()

--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
@@ -166,7 +166,8 @@ CudaNdarray_set_dim(CudaNdarray * self, int idx, int d)
 {
    if ((idx >= self->nd) || (idx < 0) || (d < 0))
    {
-        fprintf(stderr, "WARNING: probably bad CudaNdarray_set_dim arguments: %i %i\n", idx, d);
+        fprintf(stderr, "WARNING: probably bad CudaNdarray_set_dim arguments: self->ndim=%i, idx=%i stride=%i\n",
+                self->nd, idx, d);
    }

    if (d != self->host_structure[idx])

--- a/theano/sandbox/cuda/neighbours.py
+++ b/theano/sandbox/cuda/neighbours.py
 # This is work in progress
-import theano
 from theano import Op, Apply
-import theano.tensor as T
 from theano.gof import local_optimizer
 from theano.sandbox.cuda import cuda_available, GpuOp


--- a/theano/sandbox/cuda/type.py
+++ b/theano/sandbox/cuda/type.py
@@ -288,7 +288,9 @@ class CudaNdarrayType(Type):
            //std::cerr << "c_extract " << %(name)s << '\\n';
            if (%(name)s->nd != %(nd)s)
            {
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has rank %%i, it was supposed to have rank %(nd)s", %(name)s->nd);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has rank %%i, it was supposed to have rank %(nd)s",
+                             %(name)s->nd);
                %(name)s = NULL;
                %(fail)s;
            }
@@ -299,7 +301,9 @@ class CudaNdarrayType(Type):
                print >> sio, """
            if (CudaNdarray_HOST_DIMS(%(name)s)[%(i)s] != 1)
            {
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has dim %%i on broadcastable dimension %%i", CudaNdarray_HOST_DIMS(%(name)s)[%(i)s], %(i)s);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has dim %%i on broadcastable dimension %%i",
+                             CudaNdarray_HOST_DIMS(%(name)s)[%(i)s], %(i)s);
                %(name)s = NULL;
                %(fail)s;
            }
@@ -309,7 +313,9 @@ class CudaNdarrayType(Type):
            if (CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s])
            {
                //std::cerr << "c_extract bad stride detected...\\n";
-                PyErr_Format(PyExc_RuntimeError, "Some CudaNdarray has a nonzero stride %%i on a broadcastable dimension %%i", CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s], %(i)s);
+                PyErr_Format(PyExc_RuntimeError,
+                             "c_extract: Some CudaNdarray has a nonzero stride %%i on a broadcastable dimension %%i",
+                             CudaNdarray_HOST_STRIDES(%(name)s)[%(i)s], %(i)s);
                %(name)s = NULL;
                %(fail)s;
            }

--- a/theano/sandbox/linalg/kron.py
+++ b/theano/sandbox/linalg/kron.py
-import numpy
 import theano
 from theano.gof import Op, Apply
 from theano import tensor

--- a/theano/sandbox/linalg/ops.py
+++ b/theano/sandbox/linalg/ops.py
@@ -12,7 +12,7 @@ from theano.tensor.opt import (register_stabilize,
        register_specialize, register_canonicalize)
 from theano.gof import local_optimizer
 from theano.gof.opt import Optimizer
-from theano.gradient import grad_not_implemented, DisconnectedType
+from theano.gradient import DisconnectedType

 try:
    import scipy.linalg
@@ -433,16 +433,14 @@ class CholeskyGrad(Op):
        return Apply(self, [x, l, dz], [x.type()])

    def perform(self, node, inputs, outputs):
-        """
-        Implements the "reverse-mode" gradient for the Cholesky factorization
-        of a positive-definite matrix.
+        """Implements the "reverse-mode" gradient [1]_ for the
+        Cholesky factorization of a positive-definite matrix.

-        References
-        ----------
        .. [1] S. P. Smith. "Differentiation of the Cholesky Algorithm".
               Journal of Computational and Graphical Statistics,
               Vol. 4, No. 2 (Jun.,1995), pp. 134-147
               http://www.jstor.org/stable/1390762
+
        """
        x = inputs[0]
        L = inputs[1]

--- a/theano/sandbox/scan_module/scan_op.py
+++ b/theano/sandbox/scan_module/scan_op.py
@@ -12,27 +12,18 @@ __authors__ = ("Razvan Pascanu "
 __copyright__ = "(c) 2010, Universite de Montreal"
 __contact__ = "Razvan Pascanu <r.pascanu@gmail>"

-import itertools
 import logging
-import time
 from itertools import izip

 import numpy

 import theano
-from theano.compile import function, Param, Out
 from theano import compile
-from theano import gradient
 from theano.gof.python25 import any
 from theano.gof import PureOp, Apply
 from theano import gof
 from theano.tensor import TensorType
-from theano import tensor
 from theano.tensor.opt import Shape_i
-#from theano.sandbox import cuda
-from theano.compile.profiling import ScanProfileStats
-
-import scan_utils

 # Logging function for sending warning or info
 _logger = logging.getLogger('theano.scan_module.scan_op')

--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -561,6 +561,9 @@ class ScalarVariable(_scalar_py_operators, Variable):
 class ScalarConstant(_scalar_py_operators, Constant):
    pass

+# Register ScalarConstant as the type of Constant corresponding to Scalar
+Scalar.Constant = ScalarConstant
+

 # Easy constructors


--- a/theano/scalar/sharedvar.py
+++ b/theano/scalar/sharedvar.py
@@ -22,7 +22,7 @@ __contact__   = "theano-dev <theano-dev@googlegroups.com>"
 __docformat__ = "restructuredtext en"

 import numpy
-from theano.compile import shared_constructor, SharedVariable
+from theano.compile import SharedVariable
 from basic import Scalar, _scalar_py_operators

 class ScalarSharedVariable(_scalar_py_operators, SharedVariable):

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -519,7 +519,6 @@ def get_scalar_constant_value(v):
    if isinstance(v, numpy.ndarray):
        return numpy_scalar(v)

-
    if isinstance(v, Constant):
        if getattr(v.tag, 'unique_value', None) is not None:
            data = v.tag.unique_value
@@ -528,11 +527,9 @@ def get_scalar_constant_value(v):
        return numpy_scalar(data)

    if v.owner:
-        if isinstance(v.owner.op, Alloc):
-            return get_scalar_constant_value(v.owner.inputs[0])
-        if isinstance(v.owner.op, DimShuffle):
-            return get_scalar_constant_value(v.owner.inputs[0])
-        if isinstance(v.owner.op, Rebroadcast):
+        if isinstance(v.owner.op, (Alloc, DimShuffle, Rebroadcast,
+                                   compile.ops.OutputGuard,
+                                   compile.DeepCopyOp)):
            return get_scalar_constant_value(v.owner.inputs[0])
        if isinstance(v.owner.op, Elemwise) and \
                isinstance(v.owner.op.scalar_op, scal.Second):
@@ -2007,6 +2004,13 @@ class TensorConstant(_tensor_py_operators, Constant):
    def signature(self):
        return TensorConstantSignature((self.type, self.data))

+    def equals(self, other):
+        # Override Contant.equals to allow to compare with numpy.ndarray
+        if isinstance(other, numpy.ndarray):
+            # Make a TensorConstant to be able to compare
+            other = constant(other)
+        return (isinstance(other, TensorConstant) and
+                self.signature() == other.signature())

 TensorType.Constant = TensorConstant

@@ -3641,6 +3645,10 @@ def var(input, axis=None, keepdims=False):
    :param keepdims: If this is set to True, the axes which are reduced are
        left in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the original tensor.
+
+    :note: It use the two-pass algorithm for more stable results.
+        https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm
+        It exist other implementation that are even more stable, but probably slower.
    """

    input_ndim = input.type.ndim
@@ -3676,6 +3684,10 @@ def std(input, axis=None, keepdims=False):
        With this option,
        the result will broadcast correctly against the
        original tensor.
+
+    :note: It call var and var use the two-pass algorithm for more stable results.
+        https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass_algorithm
+        It exist other implementation that are even more stable, but probably slower.
    """

    return sqrt(var(input=input, axis=axis, keepdims=keepdims))

--- a/theano/tensor/blas_c.py
+++ b/theano/tensor/blas_c.py
@@ -492,11 +492,19 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
            {
                if (PyArray_DESCR(%(xx)s)->type_num == NPY_FLOAT)
                {
-                    //fprintf(stderr, "B %%i %%i %%i %%i\\n",
-                    //        Nz0, Nz1, Sz0, Sz1);
                    float alpha = ((dtype_%(alpha)s*)PyArray_DATA(%(alpha)s))[0];
-                    //fprintf(stderr, "alpha=%%f\\n", alpha);
-                    //fprintf(stderr, "sx  sy %%i %%i\\n", Sx, Sy);
+
+                    // Check for vector-vector dot (Nx0 == 1). The code may work
+                    // for Sx1 != 1 as well, but has not been tested for this case,
+                    // so Sx1 == 1 is required for safety.
+                    if (Nx0 == 1 && Sx1 == 1)
+                    {
+                        zz_data[0] = fbeta*zz_data[0] + alpha*sdot_(&Nx1, 
+                            (float*)(PyArray_DATA(%(xx)s)), &Sx1,
+                            (float*)yy_data, &Sy);
+                    }
+                    else
+                    {
                        sgemv_(&TRANS, &Nx1, &Nx0,
                            &alpha,
                            (float*)(PyArray_DATA(%(xx)s)), &Sx0,
@@ -504,9 +512,22 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
                            &fbeta,
                            (float*)zz_data, &Sz);
                    }
+                }
                else if (PyArray_DESCR(%(xx)s)->type_num == NPY_DOUBLE)
                {
                    double alpha = ((dtype_%(alpha)s*)PyArray_DATA(%(alpha)s))[0];
+
+                    // Check for vector-vector dot (Nx0 == 1). The code may work
+                    // for Sx1 != 1 as well, but has not been tested for this case,
+                    // so Sx1 == 1 is required for safety.
+                    if (Nx0 == 1 && Sx1 == 1)
+                    {
+                        zz_data[0] = dbeta*zz_data[0] + alpha*ddot_(&Nx1, 
+                              (double*)(PyArray_DATA(%(xx)s)), &Sx1,
+                              (double*)yy_data, &Sy);
+                    }
+                    else
+                    {
                        dgemv_(&TRANS, &Nx1, &Nx0,
                            &alpha,
                            (double*)(PyArray_DATA(%(xx)s)), &Sx0,
@@ -514,6 +535,7 @@ def gemv_c_code(aa, xx, yy, zz, alpha, beta, destructive, fail):
                            &dbeta,
                            (double*)zz_data, &Sz);
                    }
+                }
                else
                {
                    PyErr_SetString(PyExc_AssertionError,
@@ -556,7 +578,7 @@ class CGemv(BaseBLAS, Gemv):
        return code

    def c_code_cache_version(self):
-        return (9,)
+        return (10,)


 @local_optimizer([gemv_inplace, gemv_no_inplace])

--- a/theano/tensor/fourier.py
+++ b/theano/tensor/fourier.py
-import theano
 import numpy
 import math
-from theano import gof, tensor, function, scalar
-from theano.sandbox.linalg.ops import diag
+from theano import gof, tensor


 class Fourier(gof.Op):

--- a/theano/tensor/inplace.py
+++ b/theano/tensor/inplace.py
-from basic import _scal_elemwise #, _transpose_inplace
 from theano import scalar as scal
 import elemwise
 from theano import printing
 from theano.printing import pprint
-from theano.gof.python25 import any

 def _scal_inplace(symbol):
    """Replace a symbol definition with an elementwise version of the corresponding scalar Op"""

--- a/theano/tensor/nnet/Conv3D.py
+++ b/theano/tensor/nnet/Conv3D.py
@@ -545,7 +545,7 @@ class Conv3D(theano.Op):
            ///////////// < /code generated by Conv3D >
        """

-        return strutil.renderString(codeSource,locals())
+        return strutil.render_string(codeSource,locals())

 global conv3D
 conv3D = Conv3D()

--- a/theano/tensor/nnet/ConvGrad3D.py
+++ b/theano/tensor/nnet/ConvGrad3D.py
@@ -271,7 +271,7 @@ class ConvGrad3D(theano.Op):
            ///////////// < /code generated by ConvGradW3D >
        """

-        return strutil.renderString(codeSource, locals())
+        return strutil.render_string(codeSource, locals())


 convGrad3D = ConvGrad3D()

--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -324,7 +324,7 @@ class ConvTransp3D(theano.Op):
               ///////////// < /code generated by ConvTransp3D >
                     """

-        return strutil.renderString(codeSource, locals())
+        return strutil.render_string(codeSource, locals())


 convTransp3D = ConvTransp3D()

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -813,7 +813,21 @@ class ShapeFeature(object):
                        "for a variable with %d dimensions." % (
                        len(s), r.ndim))

-            shape_vars = [self.unpack(s_i) for s_i in s]
+            shape_vars = []
+            for i in range(r.ndim):
+                if (hasattr(r.type, 'broadcastable') and
+                    r.type.broadcastable[i]):
+                    shape_vars.append(self.lscalar_one)
+                else:
+                    shape_vars.append(self.unpack(s[i]))
+            assert all([not hasattr(r.type, "broadcastable") or
+                        not r.type.broadcastable[i] or
+                        # The two following comparison are a speed optimization
+                        # But we never timed this speed optimization!
+                        self.lscalar_one.equals(shape_vars[i]) or
+                        self.lscalar_one.equals(
+                            T.extract_constant(shape_vars[i]))
+                        for i in range(r.ndim)])
            self.shape_of[r] = tuple(shape_vars)
            for sv in shape_vars:
                self.shape_of_reverse_index.setdefault(sv, set()).add(r)
@@ -855,6 +869,15 @@ class ShapeFeature(object):
                merged_shape.append(r_shape[i])
            else:
                merged_shape.append(other_shape[i])
+        assert all([(not hasattr(r.type, "broadcastable") or
+                     not r.type.broadcastable[i] and
+                     not other_r.type.broadcastable[i]) or
+                    # The two following comparison are a speed optimization
+                    # But we never timed this speed optimization!
+                    self.lscalar_one.equals(merged_shape[i]) or
+                    self.lscalar_one.equals(
+                        T.extract_constant(merged_shape[i]))
+                    for i in range(r.ndim)])
        self.shape_of[r] = tuple(merged_shape)
        for sv in self.shape_of[r]:
            self.shape_of_reverse_index.setdefault(sv, set()).add(r)
@@ -871,6 +894,13 @@ class ShapeFeature(object):
                new_shape.append(self.unpack(s_i))
            else:
                new_shape.append(s_j)
+        assert all([not hasattr(r.type, "broadcastable") or
+                    not r.type.broadcastable[i] or
+                    # The two following comparison are a speed optimization
+                    # But we never timed this speed optimization!
+                    self.lscalar_one.equals(new_shape[i]) or
+                    self.lscalar_one.equals(T.extract_constant(new_shape[i]))
+                    for i in range(r.ndim)])
        self.shape_of[r] = tuple(new_shape)
        for sv in self.shape_of[r]:
            self.shape_of_reverse_index.setdefault(sv, set()).add(r)

--- a/theano/tensor/opt_uncanonicalize.py
+++ b/theano/tensor/opt_uncanonicalize.py
@@ -28,16 +28,10 @@ Also, we should make the fgraph refuse optimization that break the canonization
 import logging
 _logger = logging.getLogger('theano.tensor.opt')

-import operator
-import itertools
-import sys
-
-import theano
 from theano import gof
 from elemwise import CAReduce
 import basic as T

-from theano.gof.python25 import any, all
 from theano.gof.opt import Optimizer
 from theano.gof import InconsistencyError, toolbox


--- a/theano/tensor/shared_randomstreams.py
+++ b/theano/tensor/shared_randomstreams.py
@@ -4,11 +4,8 @@ graphs.
 __docformat__ = "restructuredtext en"

 import copy
-import sys
-
 import numpy

-from theano.gof import Container
 from theano.compile.sharedvalue import (SharedVariable, shared_constructor,
                                        shared)
 import raw_random

--- a/theano/tensor/signal/conv.py
+++ b/theano/tensor/signal/conv.py
@@ -5,11 +5,7 @@ generic 2D convolution.

 __docformat__ = "restructuredtext en"

-import numpy
-import theano
 import theano.tensor as tensor
-import theano.tensor.nnet as nnet
-from theano import gof, Op, tensor, config
 from theano.tensor.nnet import conv

 import logging

--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -5456,8 +5456,9 @@ class test_tensordot(unittest.TestCase):
        f1 = inplace_func([avec, bvec], c)
        aval = rand(5)
        bval = rand(5)
-        self.assertTrue(numpy.tensordot(aval, bval, axes) == \
-                        f1(aval, bval))
+        out0 = numpy.tensordot(aval, bval, axes)
+        out1 = f1(aval, bval)
+        self.assertTrue(numpy.allclose(out0, out1), (out0, out1))
        utt.verify_grad(self.TensorDot(axes), [aval, bval])

        # Test matrix-vector

--- a/theano/tensor/tests/test_opt.py
+++ b/theano/tensor/tests/test_opt.py
@@ -2475,6 +2475,57 @@ class test_shapeoptimizer(unittest.TestCase):
        assert len(topo) == 1
        assert topo[0].op == deep_copy_op

+    @staticmethod
+    def max_pool_c01b(c01b, pool_shp, pool_stride, img_shp):
+        """Like max_pool but with input using axes ('c', 0, 1, 'b')
+          (Alex Krizhevsky format)
+
+        pool_shp, pool_stride and img_shp are int that represent
+        the same shp in x and y.
+        """
+        mx = None
+
+        # Compute index in pooled space of last needed pool
+        # (needed = each input pixel must appear in at least one pool)
+        def last_pool(im_shp, p_shp, p_strd):
+            rval = int(numpy.ceil(float(im_shp - p_shp) / p_strd))
+            assert p_strd * rval + p_shp >= im_shp
+            assert p_strd * (rval - 1) + p_shp < im_shp
+            return rval
+        # Compute starting row of the last pool
+        last_pool_r = last_pool(img_shp, pool_shp, pool_stride) * pool_stride
+        # Compute number of rows needed in img for all indexes to work out
+        required_r = last_pool_r + pool_shp
+
+        last_pool_c = last_pool(img_shp, pool_shp, pool_stride) * pool_stride
+        required_c = last_pool_c + pool_shp
+
+        wide_infinity = T.alloc(-numpy.inf, c01b.shape[0],
+                                required_r, required_c, c01b.shape[3])
+
+        c01b = T.set_subtensor(wide_infinity[:, 0:img_shp, 0:img_shp, :], c01b)
+
+        for row_within_pool in xrange(pool_shp):
+            row_stop = last_pool_r + row_within_pool + 1
+            for col_within_pool in xrange(pool_shp):
+                col_stop = last_pool_c + col_within_pool + 1
+                cur = c01b[:, row_within_pool:row_stop:pool_stride,
+                           col_within_pool:col_stop:pool_stride, :]
+                if mx is None:
+                    mx = cur
+                else:
+                    mx = T.maximum(mx, cur)
+        return mx
+
+    def test_broadcasted_dims(self):
+        #This test a case that caused a crash during optimization
+        shp = (1, 1, 1, 1)
+        rng = numpy.random.RandomState(utt.fetch_seed())
+        a = shared(rng.rand(*shp).astype(config.floatX))
+        out = self.max_pool_c01b(a, 1, 1, 1)
+        f = theano.function([], out)
+        f()
+
    def test_local_track_shape_i(self):
        class IdentityNoShape(gof.Op):
            '''Op that does not infer the output shape from the input one'''

--- a/theano/tensor/xlogx.py
+++ b/theano/tensor/xlogx.py
-
-import theano
 import numpy

 from elemwise import Elemwise

--- a/theano/tests/run_tests_in_batch.py
+++ b/theano/tests/run_tests_in_batch.py
@@ -55,10 +55,12 @@ nosetests.


 import cPickle
+import datetime
 import os
 import subprocess
 import sys
-import datetime
+import time
+
 import theano
 from theano.misc.windows import call_subprocess_Popen

@@ -261,8 +263,8 @@ def run(stdout, stderr, argv, theano_nose, batch_size, time_profile,
                                                 n_tests + 1)):
                # Print the test we will start in the raw log to help
                # debug tests that are too long.
-                f_rawlog.write("\nWill run test #%d %s\n" % (test_id,
-                                                         data["ids"][test_id]))
+                f_rawlog.write("\n%s Will run test #%d %s\n" % (
+                    time.ctime(), test_id, data["ids"][test_id]))
                f_rawlog.flush()

                proc = call_subprocess_Popen(

--- a/theano/updates.py
+++ b/theano/updates.py
@@ -64,7 +64,8 @@ class OrderedUpdates(OrderedDict):
            # Warn about non-determinism.
            warnings.warn('Updating an `OrderedUpdates` with a '
                          'non-ordered dictionary with 2+ elements could '
-                          'make your code non-deterministic')
+                          'make your code non-deterministic',
+                          stacklevel=2)
        for key, val in OrderedDict(other).iteritems():
            if key in self:
                if self[key] == val: