Merge remote-tracking branch 'central/master' into rc1

Conflicts: NEWS.txt

Merge remote-tracking branch 'central/master' into rc1
6dbe4953 · Frederic · 5f9efce1 · 63a59ac1 · 6dbe4953 · 6dbe4953
--- a/NEWS.txt
+++ b/NEWS.txt
-Theano 0.5rc1
-
-TODO for final 0.5 release:
- test python 2.4
- test theano-cache with "pip install Theano": issue 101
+TODO for final release:
 - Re-write this NEWS.txt file!

-if time check issue: 98.
-
-Modifications in the trunk since the 0.4.1 release (12 August 2011) up to 2 Dec 2011
+Modifications in the trunk since the 0.4.1 release (12 August 2011) up to 5 Dec 2011


-Every body is recommented to update Theano to 0.5 when released after
-they checked there code don't return deprecation warning. Otherwise,
-in one case the result can change. In other case, the warning are
-transformed to error. See bellow.
+Every body is recommended to update Theano to 0.5 when released after
+checking that their code doesn't return deprecation warnings. Otherwise,
+in one case the result can change. In other cases, the warnings are
+transformed to errors. See below.


-Important change:
+Important changes:
 * Moved to github: https://github.com/Theano/Theano/
 * Old trac ticket moved to assembla ticket: https://www.assembla.com/spaces/theano/tickets
 * Theano vision: https://deeplearning.net/software/theano/introduction.html#theano-vision (Many people)
- * 
+ * Theano with GPU works in some cases on Windows now. Still experimental. (Sebastian Urban)
+ * See the Interface changes.


 Interface Behavior Change (was deprecated and generated a warning since Theano 0.3 released the 23 Nov 2010):
 * The current default value of the parameter axis of
  theano.{max,min,argmax,argmin,max_and_argmax} is now the same as
-  numpy: None. i.e. operate on all dimensions of the tensor.
+  numpy: None. i.e. operate on all dimensions of the tensor. (Frédéric Bastien, Olivier Delalleau)


 Interface Feature Removed (was deprecated):
 * The string mode FAST_RUN_NOGC and STABILIZE are not accepted. It was accepted only by theano.function(). Use Mode(linker='c|py_nogc') or Mode(optimizer='stabilize') instead.
- * tensor.grad(cost, wrt) now return an object of the "same type" as wrt 
-   (list/tuple/TensorVariable).
- * a few tag.shape and Join.vec_length left.
+ * tensor.grad(cost, wrt) now return an object of the "same type" as wrt
+   (list/tuple/TensorVariable). (Ian Goodfellow, Olivier)
+ * a few tag.shape and Join.vec_length left have been removed. (Frederic)

- * scan interface change: RP
+ * scan interface change: (Razvan Pascanu)
   * The use of `return_steps` for specifying how many entries of the output
-     scan has been deprecated
-
-     * The same thing can be done by applying a subtensor on the output
-       return by scan to select a certain slice
+     to return has been removed. Instead, apply a subtensor to the output
+       returned by scan to select a certain slice.
   * The inner function (that scan receives) should return its outputs and
     updates following this order:

        [outputs], [updates], [condition]. One can skip any of the three if not
        used, but the order has to stay unchanged.
- * shared.value is moved, use shared.set_value() or shared.get_value() instead.
+ * shared.value is moved, use shared.set_value() or shared.get_value() instead. (Olivier D.)
+

+Interface bug fixes:
+ * Rop in some case should have returned a list of one theano variable, but returned the variable itself. (Razvan)
+ * Theano flags "home" is not used anymore as it was a duplicate. If you use it, theano should raise an error. (Olivier D.)

-New Deprecation (will be removed in Theano 0.6, warning generated if you use them):
- * tensor.shared() renamed to tensor._shared (Olivier D.)
-   * You probably want to call theano.shared()!

+New deprecation (will be removed in Theano 0.6, warning generated if you use them):
+ * tensor.shared() renamed to tensor._shared. You probably want to call theano.shared()! (Olivier D.)

-Interface Bug Fix:
- * Rop in some case should have returned a list of 1 theano varible, but returned directly that variable.
- * Theano flags "home" is not used anymore as it was a duplicate. If you use it, theano should raise an error.

 New features:
- * adding 1d advanced indexing support to inc_subtensor and set_subtensor (James
- * tensor.{zeros,ones}_like now support the dtype param as numpy (Fred)
- * config flags "exception_verbosity" to control the verbosity of exception (Ian 
- * theano-cache list: list the content of the theano cache(Fred)
- * tensor.ceil_int_div FB
- * MaxAndArgMax.grad now work with any axis(The op support only 1 axis) FB
+ * Adding 1d advanced indexing support to inc_subtensor and set_subtensor (James Bergstra)
+ * tensor.{zeros,ones}_like now support the dtype param as numpy (Frederic)
+ * Added configuration flag "exception_verbosity" to control the verbosity of exceptions (Ian)
+ * theano-cache list: list the content of the theano cache (Frederic)
+ * theano-cache unlock: remove the Theano lock (Olivier)
+ * tensor.ceil_int_div (Frederic)
+ * MaxAndArgMax.grad now work with any axis(The op support only 1 axis) (Frederic)
   * used by tensor.{max,min,max_and_argmax}
- * tensor.{all,any} RP
- * tensor.roll as numpy: (Matthew Rocklin, DWF)
- * on Windows work. Still experimental. (Sebastian Urban)
+ * tensor.{all,any} (Razvan)
+ * tensor.roll as numpy: (Matthew Rocklin, David Warde-Farley)
+ * Theano with GPU works in some cases on Windows now. Still experimental. (Sebastian Urban)
 * IfElse now allow to have a list/tuple as the result of the if/else branches.
-   * They must have the same length and correspondig type) RP
- * argmax dtype as int64. OD
-
-
-
-New Optimizations:
- * AdvancedSubtensor1 reuse preallocated memory if available(scan, c|py_nogc linker)(Fred)
- * tensor_variable.size (as numpy) product of the shape elements OD
- * sparse_variable.size (as scipy) the number of stored value.OD
- * dot22, dot22scalar work with complex(Fred)
- * Doc how to wrap in Theano an existing python function(in numpy, scipy, ...) Fred
- * added arccos IG
- * sparse dot with full output. (Yann Dauphin)
-   * Optimized to Usmm and UsmmCscDense in some case (YD)
-   * Note: theano.dot, sparse.dot return a structured_dot grad(
- * Generate Gemv/Gemm more often JB
- * scan move computation outside the inner loop when the remove everything from the inner loop RP
- * scan optimization done earlier. This allow other optimization to be applied FB, RP, GD
- * exp(x) * sigmoid(-x) is now correctly optimized to a more stable form.
-
-
-GPU:
- * GpuAdvancedSubtensor1 support broadcasted dimensions
-
-
-Bugs fixed:
- * On cpu, if the convolution had received explicit shape information, they where not checked at run time. This caused wrong result if the input shape was not the one expected. (Fred, reported by Sander Dieleman)
- * Scan grad when the input of scan has sequence of different length. (RP reported by Michael Forbes)
- * Scan.infer_shape now work correctly when working with a condition for the number of loop. In the past, it returned n_stepts as the shape, witch is not always true. RP
- * Theoritic bug: in some case we could have GPUSum return bad value. Was not able to produce the error.. 
+   * They must have the same length and corresponding type) (Razvan)
+ * Argmax output dtype now int64. (Olivier)
+ * Added the element-wise operation arccos. (Ian)
+ * sparse dot with full grad output. (Yann Dauphin)
+   * Optimized to Usmm and UsmmCscDense in some case (Yann)
+   * Note: theano.dot, sparse.dot return a structured_dot grad.
+     This mean that the old grad returned a grad value with the same sparsity pattern then the inputs.
+ * GpuAdvancedSubtensor1 support broadcasted dimensions. (Frederic)
+
+
+New optimizations:
+ * AdvancedSubtensor1 reuse preallocated memory if available(scan, c|py_nogc linker)(Frederic)
+ * tensor_variable.size (as numpy) product of the shape elements. (Olivier)
+ * sparse_variable.size (as scipy) the number of stored value. (Olivier)
+ * dot22, dot22scalar work with complex. (Frederic)
+ * Generate Gemv/Gemm more often. (James)
+ * remove scan when all computations can be moved outside the loop. (Razvan)
+ * scan optimization done earlier. This allow other optimization to be applied. (Frederic, Guillaume, Razvan)
+ * exp(x) * sigmoid(-x) is now correctly optimized to a more stable form. (Olivier)
+ * Added Subtensor(Rebroadcast(x)) => Rebroadcast(Subtensor(x)) optimization. (Guillaume)
+ * Make the optimization process faster. (James)
+ * Allow fusion of elemwise when the scalar op need support code. (James)
+ * Better opt that lift transpose around dot. (James)
+
+
+Bug fixes (the result change):
+ * On CPU, if the convolution had received explicit shape information, they where not checked at runtime.
+   This caused wrong result if the input shape was not the one expected. (Frederic, reported by Sander Dieleman)
+ * Scan grad when the input of scan has sequence of different length. (Razvan, reported by Michael Forbes)
+ * Scan.infer_shape now work correctly when working with a condition for the number of loop.
+   In the past, it returned n_steps as the shape, which is not always true. (Razvan)
+ * Theoretical bug: in some case we could have GPUSum return bad value. Was not able to produce the error.
   * pattern affected({0,1}*nb dim, 0 no reduction on this dim, 1 reduction on this dim )
-     01, 011, 0111, 010, 10, 001, 0011, 0101: FB
- * div by zeros in verify_grad. This hidded a bug in the grad of Images2Neibs. (JB)
- * theano.sandbox.neighbors.Images2Neibs grad was returning wrong value. The grad is now disabled and return an error. FB
-
+     01, 011, 0111, 010, 10, 001, 0011, 0101: (Frederic)
+ * div by zeros in verify_grad. This hid a bug in the grad of Images2Neibs. (James)
+ * theano.sandbox.neighbors.Images2Neibs grad was returning wrong value.
+   The grad is now disabled and return an error. (Frederic)


 Crash fixed:
- * T.mean crash at graph building timeby Ian G.
- * "Interactive debugger" crash fix (Ian, Fred)
- * "Interactive Debugger" renamed to "Using Test Values"
- * Do not call gemm with strides 0, some blas refuse it. (PL)
- * optimization crash with gemm and complex.(Fred
- * Gpu crash with elemwise Fred
- * compilation crash with amdlibm and the gpu. Fred
- * IfElse crash Fred
- * Execution crash fix in AdvancedSubtensor1 on 32 bits computer(PL)
- * gpu compilation crash on MacOS X OD
- * gpu compilation crash on MacOS X Fred
- * Support for OSX Enthought Python Distribution 7.x (Graham Taylor, OD)
- * When the subtensor inputs had 0 dimensions and the outputs 0 dimensions
- * Crash when the step to subtensor was not 1 in conjonction with some optimization
-
-
-Optimization:
- * Added Subtensor(Rebroadcast(x)) => Rebroadcast(Subtensor(x)) optimization (GD)
- * Scan optimization are executed earlier. This make other optimization being applied(like blas optimization, gpu optimization...)(GD, Fred, RP)
- * Make the optimization process faster JB
- * Allow fusion of elemwise when the scalar op need support code. JB
+ * T.mean crash at graph building time. (Ian)
+ * "Interactive debugger" crash fix. (Ian, Frederic)
+ * Do not call gemm with strides 0, some blas refuse it. (Pascal Lamblin)
+ * Optimization crash with gemm and complex. (Frederic)
+ * GPU crash with elemwise. (Frederic)
+ * Compilation crash with amdlibm and the GPU. (Frederic)
+ * IfElse crash. (Frederic)
+ * Execution crash fix in AdvancedSubtensor1 on 32 bit computers. (Pascal)
+ * GPU compilation crash on MacOS X. (Olivier)
+ * Support for OSX Enthought Python Distribution 7.x. (Graham Taylor, Olivier)
+ * When the subtensor inputs had 0 dimensions and the outputs 0 dimensions. (Frederic)
+ * Crash when the step to subtensor was not 1 in conjunction with some optimization. (Frederic, reported by Olivier Chapelle)
+ * fix dot22scalar cast (Justin Bayer, Frédéric, Olivier)


 Know bug:
- * CAReduce with nan in inputs don't return the good output (`Ticket <http://trac-hg.assembla.com/theano/ticket/763>`_).
+ * CAReduce with nan in inputs don't return the good output (`Ticket <https://www.assembla.com/spaces/theano/tickets/763>`_).

   * This is used in tensor.{max,mean,prod,sum} and in the grad of PermuteRowElements.
 * If you do grad of grad of scan you can have wrong number in some case.


 Sandbox:
- * cvm, interface more consistent with current linker (James)
- * vm linker have a callback parameter (JB)
- * review/finish/doc: diag/extract_diag AB,FB,GD
- * review/finish/doc: AllocDiag/diag AB,FB,GD
- * review/finish/doc: MatrixInverse, matrix_inverse RP
- * review/finish/doc: matrix_dot RP
- * review/finish/doc: det PH determinent op
- * review/finish/doc: Cholesky David determinent op
- * review/finish/doc: ensure_sorted_indices Li Yao
- * review/finish/doc: spectral_radius_boud Xavier Glorot
- * review/finish/doc: sparse sum Valentin Bisson
+ * cvm, interface more consistent with current linker. (James)
+ * vm linker have a callback parameter. (James)
+ * review/finish/doc: diag/extract_diag. (Arnaud Bergeron, Frederic, Olibier)
+ * review/finish/doc: AllocDiag/diag. (Arnaud, Frederic, Guillaume)
+ * review/finish/doc: MatrixInverse, matrix_inverse. (Razvan)
+ * review/finish/doc: matrix_dot. (Razvan)
+ * review/finish/doc: det (determinent) op. (Philippe Hamel)
+ * review/finish/doc: Cholesky determinent op. (David)
+ * review/finish/doc: ensure_sorted_indices. (Li Yao)
+ * review/finish/doc: spectral_radius_boud. (Xavier Glorot)
+ * review/finish/doc: sparse sum. (Valentin Bisson)


 Sandbox New features(not enabled by default):
- * CURAND_RandomStreams for uniform and normal(not pickable, gpu only)(James)
+ * CURAND_RandomStreams for uniform and normal. (not pickable, GPU only)(James)


 Documentation:
- * Many update by many people: Olivier Delalleau, Fred, RP, David, 
- * Updates to install doc on MacOS (OD)
- * Updates to install doc on Windows(DWF, OD)
- * Doc how to use scan to loop with a condition as the number of iteration RP
+ * Many updates. (Many people)
+ * Updates to install doc on MacOS. (Olivier)
+ * Updates to install doc on Windows. (David, Olivier)
+ * Added how to use scan to loop with a condition as the number of iteration. (Razvan)
+ * Added how to wrap in Theano an existing python function .(in numpy, scipy, ...) (Frederic)
+ * Refactored GPU insatalltion of Theano. (Olivier)


 Others:
- * Better error message at many places: David Warde-Farley, Ian, Fred, Olivier D.
- * pep8: James, 
- * min_informative_str to print graph: Ian G.
- * Fix catching of exception. (Sometimes we catched interupt): Fred, David, Ian, OD,
- * Better support for uft string(David WF)
- * Fix pydotprint with a function compiled with a ProfileMode (Fred)
+ * Better error message at many places. (David, Ian, Frederic, Olivier)
+ * pep8 fix. (Many people)
+ * New min_informative_str() function to print graph. (Ian)
+ * Fix catching of exception. (Sometimes we catched interupt) (Frederic, David, Ian, Olivier)
+ * Better support for uft string. (David)
+ * Fix pydotprint with a function compiled with a ProfileMode (Frederic)
   * Was broken with change to the profiler.
- * warning when people have old cache entry (OD)
- * More test for join on the gpu and cpu.
- * Don't request to load the gpu module by default in scan module. RP
- * Better opt that lift transpose around dot JB
- * Fix some import problem
- * Filtering update JB
-
-
-Reviewers:
- James, David, Ian, Fred, Razvan, delallea
+ * warning when people have old cache entry. (Olivier)
+ * More test for join on the GPU and cpu. (Frederic)
+ * Don't request to load the GPU module by default in scan module. (Razvan)
+ * Fix some import problem.
+ * Filtering update. (James)
+ * The buidbot raise optimization error instead of printing a warning. (Frederic)
+ * On Windows, the default compiledir changed to be local to the computer/user and not transfered with roaming profile. (Sebastian Urban)
+
+Reviewers (alphabetical order):
+ * David, Frederic, Ian, James, Olivier, Razvan
--- a/bin/theano-cache
+++ b/bin/theano-cache
@@ -28,10 +28,14 @@ elif sys.argv[1] in ('clear'):
                      (len(items), ', '.join(items)))
 elif sys.argv[1] in ('list'):
    theano.gof.compiledir.print_compiledir_content()
+elif sys.argv[1] == 'unlock':
+    theano.gof.compilelock.force_unlock()
+    print 'Lock successfully removed!'
 else:
    print 'command "%s" not recognized' % sys.argv[1]
    print 'Type "theano-cache" to print the cache location'
    print 'Type "theano-cache clear" to erase the cache'
    print 'Type "theano-cache list" to print the cache content'
+    print 'Type "theano-cache unlock" to unlock the cache directory'
    sys.exit(1)

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -726,7 +726,7 @@ Currently, due to memory fragmentation issue in Windows, the
 test-suite breaks at some point when using ``nosetests``, with many error
 messages looking
 like: ``DLL load failed: Not enough storage is available to process this
-command``. As a result, you should instead run
+command``. As a workaround, you can instead run:

    .. code-block:: bash

@@ -736,6 +736,13 @@ This will run tests in batches of 100, which should avoid memory errors.
 Note that this script calls ``nosetests``, which may require being run from
 within a MinGW shell if you installed Nose manually as described above.

+.. note::
+
+    The above workaround to run tests with the ``run_tests_in_batch.py`` script
+    is currently imperfect: some tests are not properly collected by nosetests
+    in this mode. This may result in some weird test failures starting with
+    ``ERROR: Failure: OSError``. We do not yet have a fix for this problem.
+
 Editing code in Visual Studio
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


--- a/theano/gof/compilelock.py
+++ b/theano/gof/compilelock.py
@@ -21,9 +21,27 @@ timeout_before_override = 120
 # 'refresh_every' seconds.
 refresh_every = 60

-def get_lock():
+
+def force_unlock():
+    """
+    Delete the compilation lock if someone else has it.
+    """
+    global timeout_before_override
+    timeout_backup = timeout_before_override
+    timeout_before_override = 0
+    try:
+        get_lock(min_wait=0, max_wait=0.001)
+        release_lock()
+    finally:
+        timeout_before_override = timeout_backup
+
+
+def get_lock(**kw):
    """
    Obtain lock on compilation directory.
+
+    :param kw: Additional arguments to be forwarded to the `lock` function when
+    acquiring the lock.
    """
    if not hasattr(get_lock, 'n_lock'):
        # Initialization.
@@ -47,7 +65,7 @@ def get_lock():
    if get_lock.lock_is_enabled:
        # Only really try to acquire the lock if we do not have it already.
        if get_lock.n_lock == 0:
-            lock(get_lock.lock_dir, timeout = timeout_before_override)
+            lock(get_lock.lock_dir, timeout=timeout_before_override, **kw)
            atexit.register(Unlocker.unlock, get_lock.unlocker)
            # Store time at which the lock was set.
            get_lock.start_time = time.time()

--- a/theano/misc/do_nightly_build
+++ b/theano/misc/do_nightly_build
@@ -5,6 +5,12 @@ START=`date +%s`
 NOSETESTS=nosetests
 ARGS=$@
 PROFILING=""
+RELEASE=""
+if [ "$1" == "--release" ]; then
+    RELEASE="True"
+    shift
+    ARGS=$@
+fi
 if [ "$1" == "--buildbot" ]; then
    ROOT_CWD=/Tmp/nightly_build
    FLAGS=compiledir=/Tmp/lisa_theano_compile_dir_theano
@@ -17,7 +23,10 @@ fi

 echo "Number of elements in the compiledir:"
 ls ${COMPILEDIR}|wc -l
+# We don't want warning for fixed error in the buildbot
 FLAGS=${THEANO_FLAGS},warn.argmax_pushdown_bug=False,warn.gpusum_01_011_0111_bug=False,warn.sum_sum_bug=False,warn.sum_div_dimshuffle_bug=False,$FLAGS
+# We want to see correctly optimization error, so make make them raise an error
+FLAGS=on_opt_error=raise,$FLAGS
 # Ignore user device and floatX config, because:
 #   1. Tests are intended to be run with device=cpu.
 #   2. We explicitly add 'floatX=float32' in one run of the test suite below,
@@ -25,14 +34,23 @@ FLAGS=${THEANO_FLAGS},warn.argmax_pushdown_bug=False,warn.gpusum_01_011_0111_bug
 FLAGS=${FLAGS},device=cpu,floatX=float64
 export PYTHONPATH=${ROOT_CWD}:$PYTHONPATH

+if [ "$RELEASE" ]; then
+    echo "Executing nosetests with default mode and compute_test_value"
+    THEANO_FLAGS=${FLAGS},compute_test_value=ignore ${NOSETESTS} ${ARGS}
+    echo "Number of elements in the compiledir:"
+    ls ${COMPILEDIR}|wc -l
+fi
+
 echo "Executing nosetests with mode=FAST_COMPILE"
 THEANO_FLAGS=${FLAGS},mode=FAST_COMPILE ${NOSETESTS} ${ARGS}
 echo "Number of elements in the compiledir:"
 ls ${COMPILEDIR}|wc -l
+
 echo "Executing nosetests with mode=FAST_RUN"
 THEANO_FLAGS=${FLAGS},mode=FAST_RUN ${NOSETESTS} ${PROFILING} ${ARGS}
 echo "Number of elements in the compiledir:"
 ls ${COMPILEDIR}|wc -l
+
 echo "Executing nosetests with mode=FAST_RUN,floatX=float32"
 THEANO_FLAGS=${FLAGS},mode=FAST_RUN,floatX=float32 ${NOSETESTS} ${ARGS}
 echo "Number of elements in the compiledir:"

--- a/theano/sparse/tests/test_basic.py
+++ b/theano/sparse/tests/test_basic.py
@@ -555,8 +555,14 @@ class test_structureddot(unittest.TestCase):

 class DotTests(unittest.TestCase):
    def setUp(self):
-        x_size = (10, 1000)
-        y_size = (1000, 10000)
+        # On 32-bit platforms we use smaller matrices to avoid running out of
+        # memory during tests.
+        if theano.gof.cmodule.local_bitwidth() <= 32:
+            x_size = (10, 100)
+            y_size = (100, 1000)
+        else:
+            x_size = (10, 1000)
+            y_size = (1000, 10000)

        self.x_csr = scipy.sparse.csr_matrix(
            numpy.random.binomial(1, 0.5, x_size), dtype=theano.config.floatX)

--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
@@ -890,21 +890,28 @@ def res_is_a(node, op, maxclients=None):
              and retval


-def _as_scalar(res):
+def _as_scalar(res, dtype=None):
    """Return None or a TensorVariable whose type is in T.float_scalar_types"""
+    if dtype is None:
+        dtype = config.floatX
    if numpy.all(res.type.broadcastable):
        while res.owner and isinstance(res.owner.op, T.DimShuffle):
            res = res.owner.inputs[0]
-        if res.type.broadcastable: # may still have some number of True's
+        # may still have some number of True's
+        if res.type.broadcastable:
            rval = res.dimshuffle()
        else:
            rval = res
-
        if rval.type.dtype[:3] in ('int', 'uin'):
-            rval = cast(rval, theano.config.floatX) #may lose precision !?
-
-        #if isinstance(rval, T.Constant):
-            #rval = rval.data.flatten()[0]
+            # We check that the upcast of res and dtype won't change dtype.
+            # If dtype is float64, we will cast int64 to float64.
+            # This is valid when res is a scalar used as input to a dot22
+            # as the cast of the scalar can be done before or after the dot22
+            #  and this will give the same result.
+            if theano.scalar.upcast(res.dtype, dtype) == dtype:
+                return T.cast(rval, dtype)
+            else:
+                return None

        return rval

@@ -1567,7 +1574,7 @@ def local_dot22_to_dot22scalar(node):
        #return False #TODO fix
    dot22_idx = i_dot22.index(True)
    d = node.inputs[dot22_idx]
-    i_scalar = [_as_scalar(x) for x in node.inputs]
+    i_scalar = [_as_scalar(x, dtype=d.dtype) for x in node.inputs]
    if not any(i_scalar):
        i_mul = [x.owner and x.owner.op ==T.mul for x in node.inputs]
        if not any(i_mul):
@@ -1581,10 +1588,10 @@ def local_dot22_to_dot22scalar(node):
        mul_idx = i_mul.index(True)#we take the first mul!
        m = node.inputs[mul_idx]

-        if len(m.owner.inputs)==2 and any([_as_scalar(x) for x in m.owner.inputs]):
+        if len(m.owner.inputs)==2 and any([_as_scalar(x, dtype=d.dtype) for x in m.owner.inputs]):
            scalar_idx = -1
            for i,x in enumerate(m.owner.inputs):
-                if _as_scalar(x) and (theano.scalar.upcast(x.type.dtype,d.type.dtype)
+                if _as_scalar(x, dtype=d.dtype) and (theano.scalar.upcast(x.type.dtype,d.type.dtype)
                                      == d.type.dtype):
                    scalar_idx = i
                    break
@@ -1594,7 +1601,7 @@ def local_dot22_to_dot22scalar(node):
                             'of the scalar cannot be upcasted to the matrix type',
                             node.inputs, [x.type for x in node.inputs])
                return False
-            a = T.cast(_as_scalar(m.owner.inputs[scalar_idx]), d.type.dtype)
+            a = T.cast(_as_scalar(m.owner.inputs[scalar_idx], dtype=d.dtype), d.type.dtype)
            assert not a.type.ndim
            dot=_dot22scalar(d.owner.inputs[0], d.owner.inputs[1], a)


--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -920,6 +920,28 @@ class ShapeFeature(object):
                    + ' != len(node.outputs) = '
                    + str(len(node.outputs)))

+        # Ensure shapes are in 'int64'. This is to make sure the assert
+        # found in the `local_useless_subtensor` optimization does not fail.
+        new_shape = []
+        for sh_idx, sh in enumerate(o_shapes):
+            if sh is None:
+                continue
+            for i, d in enumerate(sh):
+                # Note: we ignore any shape element that is not typed (i.e. does
+                # not have a 'dtype' attribute). This means there may still
+                # remain int elements that are int32 on 32-bit platforms, but
+                # this works with `local_useless_subtensor`, so for now we
+                # keep it this way. See #266 for a better long-term fix.
+                if getattr(d, 'dtype', 'int64') != 'int64':
+                    assert d.dtype in theano.tensor.int_dtypes
+                    new_shape += sh[len(new_shape):i + 1]
+                    new_shape[i] = theano.tensor.cast(d, 'int64')
+            if new_shape:
+                # We replace the shape with wrong dtype by the one with 'int64'.
+                new_shape += sh[len(new_shape):]
+                o_shapes[sh_idx] = tuple(new_shape)
+                new_shape = []
+
        for r, s in izip(node.outputs, o_shapes):
            self.set_shape(r, s)


--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -372,7 +372,8 @@ def rand_of_dtype(shape, dtype):


 def makeBroadcastTester(op, expected, checks={}, name=None, **kwargs):
-    name = str(op)
+    if name is None:
+        name = str(op)
    # Here we ensure the test name matches the name of the variable defined in
    # this script. This is needed to properly identify the test e.g. with the
    # --with-id option of nosetests, or simply to rerun a specific test that
@@ -628,6 +629,7 @@ CeilIntDivTester = makeBroadcastTester(
                 uinteger=(randint(2, 3).astype("uint8"),
                           randint_nonzero(2, 3).astype("uint8")),
                 ),
+    name='CeilIntDiv',
    # As we implement this function with neq, the gradient returned is always 0.
 #    grad=_grad_broadcast_div_mod_normal,
 #    grad_rtol=div_grad_rtol,
@@ -674,10 +676,13 @@ _grad_broadcast_pow_normal = dict(same_shapes = (rand_ranged(1, 5, (2, 3)), rand
 _good_broadcast_pow_normal_float_pow = copy(_good_broadcast_pow_normal_float)
 del _good_broadcast_pow_normal_float_pow["empty2"]

-PowTester = makeBroadcastTester(op = pow,
-                                  expected = lambda x, y: check_floatX((x, y), x ** y),
-                                  good = _good_broadcast_pow_normal_float,
-                                  grad = _grad_broadcast_pow_normal)
+PowTester = makeBroadcastTester(
+        op=pow,
+        expected=lambda x, y: check_floatX((x, y), x ** y),
+        good=_good_broadcast_pow_normal_float,
+        grad= _grad_broadcast_pow_normal,
+        name='Pow')
+
 PowInplaceTester = makeBroadcastTester(op = inplace.pow_inplace,
                                       expected = lambda x, y: x ** y,
                                       good = _good_broadcast_pow_normal_float_pow,
@@ -1090,15 +1095,19 @@ ErfcInplaceTester = makeBroadcastTester(op = inplace.erfc_inplace,
                                        inplace = True,
                                        skip = skip_scipy)

-ZerosLikeTester =  makeBroadcastTester(op = tensor.zeros_like,
-                                        expected = numpy.zeros_like,
-                                        good = _good_broadcast_unary_normal,
-                                        grad = _grad_broadcast_unary_normal)
+ZerosLikeTester = makeBroadcastTester(
+        op=tensor.zeros_like,
+        expected=numpy.zeros_like,
+        good=_good_broadcast_unary_normal,
+        grad=_grad_broadcast_unary_normal,
+        name='ZerosLike')

-OnesLikeTester =  makeBroadcastTester(op = tensor.ones_like,
-                                        expected = numpy.ones_like,
-                                        good = _good_broadcast_unary_normal,
-                                        grad = _grad_broadcast_unary_normal)
+OnesLikeTester = makeBroadcastTester(
+        op=tensor.ones_like,
+        expected=numpy.ones_like,
+        good=_good_broadcast_unary_normal,
+        grad=_grad_broadcast_unary_normal,
+        name='OnesLike')

 DotTester = makeTester(name = 'DotTester',
                        op = dot,

--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
@@ -821,6 +821,27 @@ def test_dot22scalar():
                    cmp((0,4),(4,0),(0,0))
                    cmp((0,0),(0,0),(0,0))

+
+def test_dot22scalar_cast():
+    """
+    Test that in `dot22_to_dot22scalar` we properly cast integers to floats.
+    """
+    # Note that this test was failing before d5ff6904.
+    A = T.dmatrix()
+    for scalar_int_type in T.int_dtypes:
+        y = T.scalar(dtype=scalar_int_type)
+        f = theano.function([A, y], T.dot(A, A) * y, mode=mode_blas_opt)
+        assert _dot22scalar in [x.op for x in f.maker.env.toposort()]
+    A = T.fmatrix()
+    for scalar_int_type in T.int_dtypes:
+        y = T.scalar(dtype=scalar_int_type)
+        f = theano.function([A, y], T.dot(A, A) * y, mode=mode_blas_opt)
+        if scalar_int_type in ['int32', 'int64']:
+            assert _dot22 in [x.op for x in f.maker.env.toposort()]
+        else:
+            assert _dot22scalar in [x.op for x in f.maker.env.toposort()]
+
+
 def test_dot_w_self():
    # This can trigger problems in the optimization because what would normally be a gemm must
    # not be because the output is aliased to one of the inputs.