Merge pull request #863 from nouiz/mixed2

Mixed2

Merge pull request #863 from nouiz/mixed2
72a7214a · lamblin · 7ebae191 · 43b81a93 · 72a7214a · 72a7214a
--- a/NEWS.txt
+++ b/NEWS.txt
@@ -2,148 +2,7 @@

 Updates in the Trunk since the last release:

-Bug fixes
- * Outputs of Scan nodes could contain corrupted values: some parts of the
-   output would be repeated a second time, instead of the correct values.
-   It happened randomly, and quite infrequently, but the bug has been present
-   (both in Python and Cython) since April 2011. (Pascal L.)
- * In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
-   It did not return the right number of elements. (Frederic B.)
- * set_subtensor(x[int vector], new_value) when moved to the GPU
-   was transformed into inc_subtensor on the GPU. Now we have a correct
-   (but slow) GPU implementation.
-   Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
-   in all cases as well as inc_subtensor(*, *).
-   Note 2: If your code was affected by the incorrect behavior, we now print
-   a warning by default (Frederic B.)
- * Fixed an issue whereby config values were used as default arguments,
-   with those defaults then stuck at old values if the config variables were
-   changed during program execution. (David W-F)
- * Fixed many subtle bugs involving mutable default arguments which may have
-   led to unexpected behaviour, such as objects sharing instance variables
-   they were not supposed to share. (David W-F)
- * Correctly record the GPU device number used when we let the driver select it.
-   (Frederic B.)
-
-Documentation
- * Added in the tutorial documentation on how to extend Theano.
-   This explains how to make a Theano Op from a Python function.
-   http://deeplearning.net/software/theano/tutorial/extending_theano.html
-   (Frédéric B.)
- * New installation instructions for Windows using EPD (Pascal L.)
-
-Interface changes
- * In 0.5, we removed the deprecated sharedvar.value property.
-   Now we raise an error if you access it. (Frederic B.)
- * theano.function does not accept duplicate inputs, so function([x, x], ...)
-   does not work anymore. (Pascal L.)
- * theano.function now raises an error if some of the provided inputs are
-   not part of the computational graph needed to compute the output, for
-   instance, function([x, y], [y]). You can use the kwarg
-   ``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
-   (Pascal L.)
- * New Theano flag "on_unused_input" that define the default value of the
-   previous point. (Frederic B.)
- * tensor.alloc() now raises an error during graph build time
-   when we try to create less dimensions than the number of dimensions
-   the provided value have. In the past, the error was at run time.
-   (Frederic B.)
-
-Speed up
- * Convolution on the GPU now check the generation of the card to make
-   it faster in some cases (especially medium/big ouput image) (Frédéric B.)
-   (We hardcoded 512 as the maximum number of thread per block. Newer card
-    support up to 1024 threads per block.
- * CPU convolution are now parallelized (Frédric B.)
-   By default use all cores/hyper-threads
-   To control it, use the OMP_NUM_THREADS=N environment variable.
-
-New Features
- * debugprint new param ids=["CHAR", "id", "int", ""]
-   This makes the identifier printed to be the python id, a unique char, a
-   unique int, or not have it printed. We changed the default to be "CHAR"
-   as this is more readable. (Frederic B.)
- * debugprint new param stop_on_name=[False, True]. If True, we don't print
-   anything below an intermediate variable that has a name. Defaults to False.
-   (Frederic B.)
- * debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
- * If you use Enthought Python Distribution (EPD) now we use its blas
-   implementation by default (tested on Linux and Windows)
-   (Frederic B., Simon McGregor)
- * MRG random now raises an error with a clear message when the passed shape
-   contains dimensions with bad value like 0. (Frédéric B. reported by Ian G.)
- * "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
- * "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
- * We add dimensions to CudaNdarray to automatically broadcast more frequently.
-   (Frederic B.)
- * theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
- * New theano flag cmodule.warn_no_version. Default False. If True,
-   will print a warning when compiling one or more Op with C code that
-   can't be cached because there is no c_code_cache_version() function
-   associated to at least one of those Ops.  (Frederic B.)
- * CPU alloc now always generate C code (Pascal L.)
- * New Theano flag cmodule.warn_no_version=False. When True, warn when an op
-   with C code is not versioned (which forces to recompile it everytimes).
-   (Frédéric B.)
- * Made a few Ops with C code versioned to reduce compilation time.
-   (Frédéric B, Pascal L.)
- * C code reuses preallocated outputs (only done by Scan) (Pascal L.)
- * Garbage collection of intermediate results during Theano function calls
-   for Ops with C code (Pascal L.)
- * Theano flags compiledir_format now support the parameter numpy_version.
- * Theano GPU variables, shared variable and constant now support <, <=,
-   > and >= as as those not on the GPU.
-
-Sparse
- * Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
-   have the same sparsity pattern. (Frederic B.)
-
-Sparse Sandbox graduate
- * Remove0 op: it removes stored elements with value 0. (Frederic B.)
-
-Sparse Sandbox Additions (not reviewed/documented/tested, but used by some people)
- * They are all in the theano.sparse.sandbox.sp2 module
- * Op class: Cast, Poisson, Multinomial, EliminateZeros, Sum, Binomial
- * Op class: SamplingDot, SamplingDotCsr (inserted automatically)
- * Op function: structured_sigmoid, structured_exp, structured_pow, structured_minimum
- * Op class: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
- * opt: local_sampling_dot_csr, local_structured_add_s_v
-
-Internal changes
- * Define new exceptions MissingInputError and UnusedInputError, and use them
-   in theano.function, instead of TypeError and ValueError. (Pascal L.)
- * Better handling of bitwidth and max values of integers and pointers
-   across platforms (Pascal L.)
-
-Crash Fix
- * Do not try to use the BLAS library when blas.ldflags is manually set to an
-   empty string (Frederic B.)
- * When importing theano on a computer without GPU with the Theano
-   flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by  Luo Heng)
- * Optimization printed a useless error when scipy was not available. (Frederic B.)
- * GPU conv crash/slowdown on newer hardware (James B.)
- * Better error handling in GPU conv (Frederic B.)
- * GPU optimization that moves element-wise Ops to the GPU. Crash happened in
-   a particular execution order of this optimization and the
-   element-wise fusion optimization when upcasting some inputs to
-   float32 (to compute them on the GPU).
-   (Frederic B., reported by Sander Dieleman)
- * GpuReshape in some particular case when the input is not contiguous
-   (Frederic B., reported by Sander Dieleman)
- * GpuSoftmaxWithBias with shape (0, N) with N > 1.
-   (Frédéric B., reported by Razvan P.)
- * Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
-   (Pascal L., reported by Simon McGregor)
- * Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
-   dimensions, which could typically result in optimization crashes (Olivier D.)
- * Fixed crash when concatenating some arrays with specific broadcasting
-   patterns (Olivier D.)
- * Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylon)
- * In advanced indexing, if some inputs are constant, no need to call constant(...)
-   on their value any more. (Pascal L., reported by John Salvatier)
- * Fix crash on GPU when the GpuSubtensor didn't put the right stride
-   when the results tensor had a dimensions with size of 1. (Pascal L,
-   reported Graham T.)
+https://github.com/Theano/Theano/wiki/Devnews

 =============
 Release Notes

--- a/bin/theano-nose
+++ b/bin/theano-nose
@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests.
 option will be interpreted as an indication of the number of tests to be run
 between notifications of progress to standard output.

+If the '--theano' option is used, it is replaced with the path to theano.
+Useful if you don't know where it was installed.
+
 `run_tests_in_batch.py` will in turn call back this script in another process.
 """

@@ -39,6 +42,12 @@ import sys
 from nose.plugins import Plugin

 def main():
+    # Handle the --theano arguments
+    if "--theano" in sys.argv:
+        i = sys.argv.index("--theano")
+        import theano
+        sys.argv[i] = theano.__path__[0]
+
    # Handle --batch[=n] arguments
    batch_args = [arg for arg in sys.argv if arg.startswith('--batch')]
    for arg in batch_args:
@@ -137,6 +146,11 @@ def help():

            --without-knownfailure: Do not load the KnownFailure plugin.

+            --theano: This parameter is replaced with the path to the theano library.
+                      As theano-nose is a wrapper to nosetests, it expect a path to the tests to run.
+                      If you don't know where theano is installed, use this option
+                      to have it inserted automatically.
+
        The other options will be passed to nosetests, see ``nosetests -h``.
        """


--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(),
                          "python_version": platform.python_version(),
                          "theano_version": theano.__version__,
                          "numpy_version": numpy.__version__,
-                          "g++": gcc_version_str.replace(" ", "_"),
+                          "gxx_version": gcc_version_str.replace(" ", "_"),
                         }
 compiledir_format_keys = ", ".join(compiledir_format_dict.keys())
 default_compiledir_format =\

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
@@ -11,7 +11,7 @@ from theano import tensor, scalar, config

 from theano.gof.python25 import all, any

-from theano.sandbox.cuda import GpuOp
+from theano.sandbox.cuda import GpuOp, device_properties
 from theano.sandbox.cuda.type import CudaNdarrayType
 from theano.sandbox.cuda import filter as type_support_filter

@@ -641,7 +641,9 @@ class GpuSum(GpuOp):
                printf("running kernel_reduce_sum_%(pattern)s_%(name)s\\n");
            int n_shared = sizeof(float) * n_threads.x * n_threads.y * n_threads.z;
            if (verbose>1)
-                printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d, nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d, nb_block=%%d, n_shared=%%d\\n",
+                printf("n_threads.x=%%d, n_threads.y=%%d, n_threads.z=%%d,"
+                       " nb_threads=%%d, n_blocks.x=%%d, n_blocks.y=%%d,"
+                       " nb_block=%%d, n_shared=%%d\\n",
                                  n_threads.x,n_threads.y,n_threads.z,
                                  n_threads.x*n_threads.y*n_threads.z,
                                  n_blocks.x,n_blocks.y,
@@ -673,7 +675,8 @@ class GpuSum(GpuOp):
            if (cudaSuccess != sts)
            {
                PyErr_Format(PyExc_RuntimeError,
-                    "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                    "Cuda error: %%s: %%s."
+                    " (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
                    "kernel_reduce_sum_%(pattern)s_%(name)s",
                    cudaGetErrorString(sts),
                    n_blocks.x,
@@ -876,7 +879,8 @@ class GpuSum(GpuOp):
                    std::min(CudaNdarray_SIZE(%(x)s),
                            NUM_VECTOR_OP_THREADS_PER_BLOCK));
            dim3 n_blocks(1);
-            if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s n_threads.x=%%d, size=%%d, ndim=%%d\\n",
+            if (verbose) printf("running kernel_reduce_sum_ccontig_%(name)s"
+                                " n_threads.x=%%d, size=%%d, ndim=%%d\\n",
                                n_threads.x,CudaNdarray_SIZE(%(x)s),%(x)s->nd);
            int n_shared = sizeof(float) * n_threads.x;
            kernel_reduce_sum_ccontig_%(name)s<<<n_blocks, n_threads, n_shared>>>(
@@ -887,7 +891,9 @@ class GpuSum(GpuOp):
            cudaError_t sts = cudaGetLastError();
            if (cudaSuccess != sts)
            {
-                PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                PyErr_Format(PyExc_RuntimeError,
+                             "Cuda error: %%s: %%s."
+                             " (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
                    "kernel_reduce_sum_ccontig_%(name)s",
                    cudaGetErrorString(sts),
                    n_blocks.x,
@@ -937,11 +943,13 @@ class GpuSum(GpuOp):
        :param N: the number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111
                  Work for N=1,2,3
        """
-        assert N in [1,2,3]
+        assert N in [1, 2, 3]
        makecall = self._makecall(node, name, x, z, fail)
-        N_pattern = ''.join(['1']*N)
-        param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)])
-        strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]" % locals() for i in xrange(N+1)])
+        N_pattern = ''.join(['1'] * N)
+        param_dim = ",".join(["CudaNdarray_HOST_DIMS(%(x)s)[%(i)s]" % locals()
+                              for i in xrange(N + 1)])
+        strides_dim = ",".join(["CudaNdarray_HOST_STRIDES(%(x)s)[%(i)s]"
+                                % locals() for i in xrange(N + 1)])
        threads_y = """
            //get as many y threads as we can fit
            while (n_threads.x * (n_threads.y+1) <= NUM_VECTOR_OP_THREADS_PER_BLOCK)
@@ -962,10 +970,10 @@ class GpuSum(GpuOp):
                    break;
            }
 """ % locals()
-        if len(self.reduce_mask)==2:
+        if len(self.reduce_mask) == 2:
            threads_y = ''
            threads_z = ''
-        if len(self.reduce_mask)==3:
+        if len(self.reduce_mask) == 3:
            threads_z = ''
        print >> sio, """
        {
@@ -975,15 +983,18 @@ class GpuSum(GpuOp):
                            NUM_VECTOR_OP_THREADS_PER_BLOCK));
            %(threads_y)s
            %(threads_z)s
-            dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],NUM_VECTOR_OP_BLOCKS));
+            dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
+                                   NUM_VECTOR_OP_BLOCKS));
            %(makecall)s
        }
        """ % locals()

    def c_code_reduce_01(self, sio, node, name, x, z, fail):
        self.c_code_reduce_01X(sio, node, name, x, z, fail, 1)
+
    def c_code_reduce_011(self, sio, node, name, x, z, fail):
        self.c_code_reduce_01X(sio, node, name, x, z, fail, 2)
+
    def c_code_reduce_0111(self, sio, node, name, x, z, fail):
        self.c_code_reduce_01X(sio, node, name, x, z, fail, 3)

@@ -1021,7 +1032,9 @@ class GpuSum(GpuOp):
            cudaError_t sts = cudaGetLastError();
            if (cudaSuccess != sts)
            {
-                PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                PyErr_Format(PyExc_RuntimeError,
+                    "Cuda error: %%s: %%s."
+                    " (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
                    "kernel_reduce_sum_010_%(name)s",
                    cudaGetErrorString(sts),
                    n_blocks.x,
@@ -1033,9 +1046,11 @@ class GpuSum(GpuOp):
            }
        }
        """ % locals()
+
    def c_code_reduce_010(self, sio, node, name, x, z, fail):
        makecall = self._makecall(node, name, x, z, fail)
-        makecall_inner = self._makecall(node, name, x, z, fail, pattern="010_inner")
+        makecall_inner = self._makecall(node, name, x, z, fail,
+                                        pattern="010_inner")
        pattern = ''.join(str(i) for i in self.reduce_mask)
        print >> sio, """
        {
@@ -1085,7 +1100,9 @@ class GpuSum(GpuOp):
                cudaError_t sts = cudaGetLastError();
                if (cudaSuccess != sts)
                {
-                    PyErr_Format(PyExc_RuntimeError, "Cuda error: %%s: %%s. (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
+                    PyErr_Format(PyExc_RuntimeError,
+                        "Cuda error: %%s: %%s."
+                        " (grid: %%i x %%i; block: %%i x %%i x %%i)\\n",
                        "kernel_reduce_sum_010_%(name)s",
                        cudaGetErrorString(sts),
                        n_blocks.x,
@@ -1233,6 +1250,7 @@ class GpuSum(GpuOp):
            %(makecall)s
        }
        """ % locals()
+
    def c_code_reduce_111(self, sio, node, name, x, z, fail):
        makecall = self._makecall(node, name, x, z, fail)
        print >> sio, """
@@ -1275,7 +1293,8 @@ class GpuSum(GpuOp):
                    std::min(CudaNdarray_HOST_DIMS(%(x)s)[0],
                        NUM_VECTOR_OP_BLOCKS));

-            while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS && n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1])
+            while (n_blocks.x * n_blocks.y <= NUM_VECTOR_OP_BLOCKS &&
+                   n_blocks.y < CudaNdarray_HOST_DIMS(%(x)s)[1])
            {
                n_blocks.y += 1;
            }
@@ -1356,7 +1375,7 @@ class GpuSum(GpuOp):
    def c_support_code_apply(self, node, nodename):
        sio = StringIO.StringIO()
        nd_in = len(self.reduce_mask)
-        if all(i==1 for i in self.reduce_mask):
+        if all(i == 1 for i in self.reduce_mask):
            #this kernel is ok for up to a few thousand elements, but
            # it only runs on ONE multiprocessor
            reducebuf = self._k_reduce_buf('Z[0]')
@@ -1411,7 +1430,7 @@ class GpuSum(GpuOp):
                %(reducebuf)s
            }
            """ % locals()
-        if self.reduce_mask == (1,1):
+        if self.reduce_mask == (1, 1):
            #this kernel is ok for up to a few thousand elements, but
            # it only runs on ONE multiprocessor
            reducebuf = self._k_reduce_buf('Z[0]')
@@ -1444,29 +1463,33 @@ class GpuSum(GpuOp):
            }
            """ % locals()
        #01, 011, 0111
-        if 0 == self.reduce_mask[0] and all(self.reduce_mask[1:]) and nd_in in[2,3,4]:
+        if (0 == self.reduce_mask[0] and
+            all(self.reduce_mask[1:]) and
+            nd_in in[2, 3, 4]):
            # this kernel uses one block for each row.
            # threads per block for each element per row.

-            N_pattern = ''.join(['1']*(nd_in-1))
-            if nd_in==2:
+            N_pattern = ''.join(['1'] * (nd_in - 1))
+            if nd_in == 2:
                for_i1 = "for (int i1 = threadIdx.x; i1 < d1; i1 += blockDim.x)"
-                for_i2="int i2=0, sA2=0;"
-                for_i3="int i3=0, sA3=0;"
-            if nd_in==3:
+                for_i2 = "int i2=0, sA2=0;"
+                for_i3 = "int i3=0, sA3=0;"
+            if nd_in == 3:
                for_i1 = "for (int i1 = threadIdx.y; i1 < d1; i1 += blockDim.y)"
                for_i2 = "for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x)"
-                for_i3="int i3=0, sA3=0;"
-            if nd_in==4:
+                for_i3 = "int i3=0, sA3=0;"
+            if nd_in == 4:
                for_i1 = "for (int i1 = threadIdx.z; i1 < d1; i1 += blockDim.z)"
                for_i2 = "for (int i2 = threadIdx.y; i2 < d2; i2 += blockDim.y)"
                for_i3 = "for (int i3 = threadIdx.x; i3 < d3; i3 += blockDim.x)"

            reducebuf = self._k_reduce_buf('Z[i0 * sZ0]')
-            param_dim = ",".join(["const int d%(i)s" % locals() for i in xrange(nd_in)])
-            param_strides = ",".join(["const int sA%(i)s" % locals() for i in xrange(nd_in)])
-            decl = self._k_decl(node,nodename)
-            init = self._k_init(node,nodename)
+            param_dim = ",".join(["const int d%(i)s" % locals()
+                                  for i in xrange(nd_in)])
+            param_strides = ",".join(["const int sA%(i)s" % locals()
+                                      for i in xrange(nd_in)])
+            decl = self._k_decl(node, nodename)
+            init = self._k_init(node, nodename)
            print >> sio, """
            %(decl)s{
                %(init)s
@@ -1484,7 +1507,7 @@ class GpuSum(GpuOp):
                }
            }
            """ % locals()
-        if self.reduce_mask == (0,1,0) or self.reduce_mask == (1,0):
+        if self.reduce_mask == (0, 1, 0) or self.reduce_mask == (1, 0):
            # this kernel uses one block for each column,
            # threads per block for each element per column.

@@ -1497,7 +1520,8 @@ class GpuSum(GpuOp):
                    const int d0,
                    const int d1,
                    const int d2,
-                    const float *A, const int sA0, const int sA1, const int sA2,
+                    const float *A, const int sA0,
+                    const int sA1, const int sA2,
                    float * Z, const int sZ0, const int sZ1)
            {
                const int threadCount = blockDim.x;
@@ -1525,7 +1549,7 @@ class GpuSum(GpuOp):

            }
            """ % locals()
-        if self.reduce_mask == (0,1,0):
+        if self.reduce_mask == (0, 1, 0):
            print >> sio, """
            static __global__ void kernel_reduce_sum_010_AD_%(nodename)s(
                    const int A,
@@ -1533,7 +1557,8 @@ class GpuSum(GpuOp):
                    const int C,
                    const int D,
                    //const int E, // THIS is 32
-                    const float *X, const int sX0, const int sX1, const int sX2,
+                    const float *X, const int sX0,
+                    const int sX1, const int sX2,
                    float * Z, const int sZ0, const int sZ1)
            {
                const int threadCount = blockDim.x;
@@ -1564,9 +1589,10 @@ class GpuSum(GpuOp):

            }
            """ % locals()
-        if self.reduce_mask == (0,1,0):
+        if self.reduce_mask == (0, 1, 0):
            #
-            # This kernel is optimized when the inner most dimensions have the smallest stride.
+            # This kernel is optimized when the inner most dimensions
+            # have the smallest stride.

            # this kernel uses one block for multiple column(up to 32TODO),
            # threads per block for each element per column.
@@ -1575,10 +1601,12 @@ class GpuSum(GpuOp):
 #thread.y = dim 1
 #block.x = dim 0
 #block.y = dim 1 rest
-            init = self._k_init(node,nodename)
+            init = self._k_init(node, nodename)
            decl = self._k_decl(node, nodename, pattern="010_inner")
-            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
-            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]','blockDim.x')
+            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
+                                                    'blockDim.x')
+            reducebuf = self._k_reduce_buf_multiple('Z[i0 * sZ0 + i2*sZ1]',
+                                                    'blockDim.x')
            print >> sio, """
            %(decl)s
            {
@@ -1602,7 +1630,7 @@ class GpuSum(GpuOp):
              }
            }
            """ % locals()
-        if self.reduce_mask == (1,1,0):
+        if self.reduce_mask == (1, 1, 0):
            # this kernel uses one block for each column,
            # threads per block for each element per column.

@@ -1615,7 +1643,8 @@ class GpuSum(GpuOp):
                    const int d0,
                    const int d1,
                    const int d2,
-                    const float *A, const int sA0, const int sA1, const int sA2,
+                    const float *A, const int sA0,
+                    const int sA1, const int sA2,
                    float * Z, const int sZ0)
            {
                const int threadCount = blockDim.x * blockDim.y;
@@ -1642,7 +1671,7 @@ class GpuSum(GpuOp):
                %(reducebuf)s
            }
            """ % locals()
-        if self.reduce_mask == (1,0,0):
+        if self.reduce_mask == (1, 0, 0):
            reducebuf = self._k_reduce_buf('Z[i1 * sZ0 + i2 * sZ1]')
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
@@ -1664,7 +1693,7 @@ class GpuSum(GpuOp):
                }
            }
            """ % locals()
-        if self.reduce_mask == (1,1,1):
+        if self.reduce_mask == (1, 1, 1):
            reducebuf = self._k_reduce_buf('Z[0]')
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
@@ -1686,7 +1715,7 @@ class GpuSum(GpuOp):
                %(reducebuf)s
            }
            """ % locals()
-        if self.reduce_mask == (0,0,1):
+        if self.reduce_mask == (0, 0, 1):
            # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
@@ -1695,7 +1724,8 @@ class GpuSum(GpuOp):
                    const int d0,
                    const int d1,
                    const int d2,
-                    const float *A, const int sA0, const int sA1, const int sA2,
+                    const float *A, const int sA0,
+                    const int sA1, const int sA2,
                    float * Z, const int sZ0, const int sZ1)
            {
                const int threadCount = blockDim.x;
@@ -1721,7 +1751,7 @@ class GpuSum(GpuOp):
                }
            }
            """ % locals()
-        if self.reduce_mask == (0,0,1,1):
+        if self.reduce_mask == (0, 0, 1, 1):
            # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i1 * sZ1]')
@@ -1749,7 +1779,7 @@ class GpuSum(GpuOp):
                }
            }
            """ % locals()
-        if self.reduce_mask == (0,1,0,1):
+        if self.reduce_mask == (0, 1, 0, 1):
            # this kernel uses one block for each row,
            # threads per block for each element per row.
            reducebuf = self._k_reduce_buf('Z[i0 * sZ0 + i2 * sZ1]')
@@ -1777,7 +1807,7 @@ class GpuSum(GpuOp):
                }
            }
            """ % locals()
-        if self.reduce_mask == (1,1,1,1):
+        if self.reduce_mask == (1, 1, 1, 1):
            reducebuf = self._k_reduce_buf('Z[0]')
            decl = self._k_decl(node, nodename)
            init = self._k_init(node, nodename)
@@ -1800,7 +1830,7 @@ class GpuSum(GpuOp):
                %(reducebuf)s
            }
            """ % locals()
-        if self.reduce_mask == (1,0,1,1):
+        if self.reduce_mask == (1, 0, 1, 1):
            reducebuf = self._k_reduce_buf('Z[blockIdx.x*sZ0]')
            print >> sio, """
            static __global__ void kernel_reduce_sum_1011_%(nodename)s(
@@ -1808,7 +1838,8 @@ class GpuSum(GpuOp):
                    const unsigned int d1,
                    const unsigned int d2,
                    const unsigned int d3,
-                    const float *A, const int sA0, const int sA1, const int sA2, const int sA3,
+                    const float *A, const int sA0, const int sA1,
+                    const int sA2, const int sA3,
                    float * Z, const int sZ0)
            {
                const int threadCount = blockDim.x * blockDim.y * blockDim.z;
@@ -1867,7 +1898,7 @@ class GpuSubtensor(tensor.Subtensor, GpuOp):
        assert isinstance(x.type, CudaNdarrayType)
        rval = tensor.Subtensor.make_node(self, x, *inputs)
        otype = CudaNdarrayType(rval.outputs[0].type.broadcastable)
-        return Apply(self, [x]+rval.inputs[1:], [otype()])
+        return Apply(self, [x] + rval.inputs[1:], [otype()])

    def perform(self, node, inputs, out_):
        out, = out_
@@ -1907,6 +1938,7 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):
    #If True or False, we assert that we use the take version or not
    #If None, we choose the best one applicable
    perform_using_take = None
+    max_threads = 0

    def make_node(self, x, ilist):
        x_ = as_cuda_ndarray_variable(x)
@@ -1946,9 +1978,18 @@ class GpuAdvancedSubtensor1(tensor.AdvancedSubtensor1, GpuOp):

            idx = idx.view("float32")
            idx = cuda_ndarray.cuda_ndarray.CudaNdarray(idx)
+            if self.max_threads == 0:
+                num = theano.sandbox.cuda.use.device_number
+                if device_properties(num)['regsPerBlock'] < (8192 * 2):
+                    self.max_threads = 256
+                else:
+                    self.max_threads = 512
+
            o = x.take(idx,
                       0,  # axis
-                       out_[0][0])  # return
+                       out_[0][0],  # return
+                       "raise",
+                       self.max_threads)
            if x is not x_orig:
                o = o.reshape(out_shape)
            out[0] = o
@@ -2033,14 +2074,14 @@ class GpuIncSubtensor(tensor.IncSubtensor, GpuOp):
        assert isinstance(x.type, CudaNdarrayType)
        assert isinstance(y.type, CudaNdarrayType)
        rval = tensor.IncSubtensor.make_node(self, x, y, *inputs)
-        return Apply(self, [x,y]+rval.inputs[2:], [x.type()])
+        return Apply(self, [x, y] + rval.inputs[2:], [x.type()])


 class GpuFlatten(tensor.Flatten, GpuOp):
    """
    Implement Flatten on the gpu.
    """
-    def make_node(self, x ):
+    def make_node(self, x):
        assert isinstance(x.type, CudaNdarrayType)
        rval = tensor.Flatten.make_node(self, x)
        host_out_broadcastable = rval.outputs[0].type.broadcastable
@@ -2096,10 +2137,12 @@ class GpuJoin(tensor.Join, GpuOp):
            # dimension in "axis" can be different, so make equal for ==
            tmp_shape[axis] = template_shape[axis]
            if tuple(tmp_shape) != template_shape:
-                raise ValueError, "Shape of input CudaNdarrays must agree except for the 'axis' dimension"
+                raise ValueError("Shape of input CudaNdarrays must"
+                                 " agree except for the 'axis' dimension")

        if len(template_shape) != node.outputs[0].type.ndim:
-            raise ValueError, "Number of dimension of input tensors disagree with dimensions passed at graph creation time."
+            raise ValueError("Number of dimension of input tensors disagree"
+                             " with dimensions passed at graph creation time.")

        # final shape must be the same as all input tensors
        # except for the "axis" dimension, so we can simply
@@ -2110,7 +2153,8 @@ class GpuJoin(tensor.Join, GpuOp):
        # just to be explicit, check that dim=1 for broadcastable
        # dimensions
        for i, bcastable in enumerate(node.outputs[0].type.broadcastable):
-            assert not bcastable or final_shape[i] == 1, "Broadcastable dimension but dim != 1, this is invalid"
+            assert not bcastable or final_shape[i] == 1, (
+                "Broadcastable dimension but dim != 1, this is invalid")

        rval = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(final_shape)

@@ -2120,9 +2164,9 @@ class GpuJoin(tensor.Join, GpuOp):
        # except for 'axis'

        def construct_slices(curlen):
-            slices = [slice(None,None,None) for i in \
+            slices = [slice(None, None, None) for i in \
                            range(len(template_shape))]
-            slices[axis] = slice(curpos,curpos+curlen,None)
+            slices[axis] = slice(curpos, curpos + curlen, None)
            return tuple(slices)

        for i, cnda in enumerate(cndas):
@@ -2157,7 +2201,9 @@ class GpuAlloc(GpuOp):
        v = as_cuda_ndarray_variable(value)
        sh = [tensor.as_tensor_variable(s) for s in shape]
        if v.ndim != len(shape):
-            raise TypeError('GpuAlloc requires value of same dimensions as shape', value, len(shape))
+            raise TypeError(
+                'GpuAlloc requires value of same dimensions as shape',
+                value, len(shape))

        bcast = []
        for s in sh:
@@ -2170,7 +2216,7 @@ class GpuAlloc(GpuOp):
                const_shp = None
            bcast.append(numpy.all(1 == const_shp))
        otype = CudaNdarrayType(dtype='float32', broadcastable=bcast)
-        return Apply(self, [v]+sh, [otype()])
+        return Apply(self, [v] + sh, [otype()])

    def perform(self, node, inputs, out_):
        out, = out_
@@ -2178,7 +2224,7 @@ class GpuAlloc(GpuOp):
        sh = tuple([int(i) for i in inputs[1:]])
        if out[0] is None or out[0].shape != sh:
            out[0] = cuda_ndarray.cuda_ndarray.CudaNdarray.zeros(sh)
-        out[0][...] = v # broadcast v to fill us up
+        out[0][...] = v  # broadcast v to fill us up

    def c_code(self, node, name, inputs, out_, sub):
        out, = out_
@@ -2186,12 +2232,12 @@ class GpuAlloc(GpuOp):
        value = inputs[0]
        shps = inputs[1:]
        nd = len(shps)
-        str =  "int dims[%(nd)s];\n" % locals()
-        for idx,sh in enumerate(shps):
+        str = "int dims[%(nd)s];\n" % locals()
+        for idx, sh in enumerate(shps):
            str += "dims[%(idx)s] = PyInt_AsLong((PyObject*)%(sh)s);\n" % locals()

        str += "if(%(out)s==NULL\n" % locals()
-        for idx,sh in enumerate(shps):
+        for idx, sh in enumerate(shps):
            str += "||CudaNdarray_HOST_DIMS(%(out)s)[%(idx)s]!=dims[%(idx)s]" % locals()
        str += """){
            Py_XDECREF(%(out)s);
@@ -2350,10 +2396,9 @@ def tensordot(a, b, axes=2):
            "Axes should be scalar valued or a list/tuple of len 2.",
            axes)

+
 # Those are predifined CudaNdarrayType as done in tensor.basic
 # Useful mostly for test as the gpu op are inserted automatically...
-
-fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
 def scalar(name=None, dtype=None):
    """Return a symbolic scalar variable.
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2363,8 +2408,9 @@ def scalar(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=())
    return type(name)
+fscalar = CudaNdarrayType(dtype='float32', broadcastable=())
+

-fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
 def vector(name=None, dtype=None):
    """Return a symbolic vector variable.
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2374,8 +2420,9 @@ def vector(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=(False, ))
    return type(name)
+fvector = CudaNdarrayType(dtype='float32', broadcastable=(False, ))
+

-fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
 def matrix(name=None, dtype=None):
    """Return a symbolic matrix variable.
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2385,8 +2432,9 @@ def matrix(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False))
    return type(name)
+fmatrix = CudaNdarrayType(dtype='float32', broadcastable=(False, False))
+

-frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
 def row(name=None, dtype=None):
    """Return a symbolic row variable (ndim=2, broadcastable=[True,False]).
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2396,8 +2444,9 @@ def row(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=(True, False))
    return type(name)
+frow = CudaNdarrayType(dtype='float32', broadcastable=(True, False))
+

-fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
 def col(name=None, dtype=None):
    """Return a symbolic column variable (ndim=2, broadcastable=[False,True]).
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2407,8 +2456,9 @@ def col(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=(False, True))
    return type(name)
+fcol = CudaNdarrayType(dtype='float32', broadcastable=(False, True))
+

-ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,)*3)
 def tensor3(name=None, dtype=None):
    """Return a symbolic 3-D variable.
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2418,8 +2468,9 @@ def tensor3(name=None, dtype=None):
        dtype = config.floatX
    type = CudaNdarrayType(dtype=dtype, broadcastable=(False, False, False))
    return type(name)
+ftensor3 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 3)
+

-ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)
 def tensor4(name=None, dtype=None):
    """Return a symbolic 4-D variable.
    :param dtype: numeric type (None means to use theano.config.floatX)
@@ -2430,6 +2481,7 @@ def tensor4(name=None, dtype=None):
    type = CudaNdarrayType(dtype=dtype,
                           broadcastable=(False, False, False, False))
    return type(name)
+ftensor4 = CudaNdarrayType(dtype='float32', broadcastable=(False,) * 4)


 @theano.compile.profilemode.register_profiler_printer
@@ -2446,22 +2498,24 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
        gpu = 0
        trans = 0
        for (_, node), t in apply_time.items():
-            if isinstance(node.op.__class__.__name__, (HostFromGpu, GpuFromHost)):
+            if isinstance(node.op.__class__.__name__,
+                          (HostFromGpu, GpuFromHost)):
                trans += t
            elif node.op.__class__.__name__.lower().startswith("gpu"):
                gpu += t
            else:
                cpu += t
        print
-        print "    Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op"%(
-            cpu, cpu/local_time*100, gpu, gpu/local_time*100, trans, trans/local_time*100)
+        print "    Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op" % (
+            cpu, cpu / local_time * 100, gpu, gpu / local_time * 100,
+            trans, trans / local_time * 100)

        print
        print "    Theano function input that are float64"
        print "    <fct name> <input name> <input type> <str input>"
        for fct in fct_call.keys():
            for i in fct.input_storage:
-                if hasattr(i.type, 'dtype') and i.type.dtype=='float64':
+                if hasattr(i.type, 'dtype') and i.type.dtype == 'float64':
                    print '        ', fct.name, i.name, i.type, i

        print
@@ -2470,5 +2524,13 @@ def profile_printer(fct_name, compile_time, fct_call_time, fct_call,
        print '    <Apply> <Apply position> <fct name> <inputs type> <outputs type>'
        for fct in fct_call.keys():
            for idx, node in enumerate(fct.maker.fgraph.toposort()):
-                if any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.outputs) and not any(hasattr(i,'dtype') and i.dtype=='float64' for i in node.inputs):
-                    print '        ', str(node), idx, fct.name, str([getattr(i,'dtype',None) for i in node.inputs]),str([getattr(i,'dtype',None) for i in node.outputs])
+                if (any(hasattr(i, 'dtype') and i.dtype == 'float64'
+                        for i in node.outputs) and
+                    not any(hasattr(i, 'dtype') and i.dtype == 'float64'
+                            for i in node.inputs)):
+
+                    print '        ', str(node), idx, fct.name,
+                    print str([getattr(i, 'dtype', None)
+                               for i in node.inputs]),
+                    print str([getattr(i, 'dtype', None)
+                               for i in node.outputs])
--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    PyObject * axis_obj = Py_None;
    PyObject * out_obj = Py_None;
    PyObject * clipmode_obj = NULL;
-    if (! PyArg_ParseTuple(args, "O|OOO", &indices_obj, &axis_obj,
-                           &out_obj, &clipmode_obj))
+    int max_threads = 1; // max threads per blocks
+
+    if (! PyArg_ParseTuple(args, "O|OOOi", &indices_obj, &axis_obj,
+                           &out_obj, &clipmode_obj, &max_threads))
        return NULL;

    //Check argument indices
@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    PyObject * axis_iobj = PyNumber_Long(axis_obj);
    if (!axis_iobj) {
        PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: axis must be convertable to a long");
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        return NULL;
    }
    long axis = PyInt_AsLong(axis_iobj);
    Py_DECREF(axis_iobj); axis_iobj=NULL;
    if (axis != 0) {
        PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only axis=0 is currently supported");
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        return NULL;
    }

@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (!out) {
        out = (CudaNdarray*)CudaNdarray_New();
        if (!out){
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            free(dims);
            return NULL;
        }
        if (CudaNdarray_alloc_contiguous(out, self->nd, dims)) {
            Py_DECREF(out);
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            free(dims);
            return NULL;
        }
@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (clipmode_obj) {
        char * clipmode = PyString_AsString(clipmode_obj);
        if (! clipmode){
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
        }
        if (strcmp(clipmode, "raise") != 0) {
-            PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only the raise mode is currently supported");
-            Py_DECREF(indices_obj);
+            PyErr_Format(PyExc_NotImplementedError,
+                         "CudaNdarray_TakeFrom: only the raise mode is currently supported. Got '%s'",
+                         clipmode);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
        }
-        Py_DECREF(clipmode_obj);
    }
    void (*k3)(const int, const int, const int,
               const npy_int64*,
@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (err_var == NULL) {
        err_var = (int*)device_malloc(sizeof(int));
        if (!err_var) { // PyErr set by device_malloc
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            PyErr_Format(PyExc_RuntimeError,
                         "Error setting device error code to 0. %s",
                         cudaGetErrorString(err));
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    }

    dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(out)[0],65535),1,1);
+
    switch (self->nd) {
        case 1:
            {
                dim3 n_threads(1, 1, 1);
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           self->nd, cudaGetLastError(),
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
                k3<<<n_blocks, n_threads>>>(
                        dims[0],
@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            break;
        case 2:
            {
-                dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], 512), 1, 1);
+                dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads), 1, 1);
+
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           cudaGetLastError(), self->nd,
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
+
                k3<<<n_blocks, n_threads>>>(
                        dims[0], //dimensions
                        dims[1],
@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            break;
        case 3:
            {
-                int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], 512);
-                int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], 512 / ty);
+                int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], max_threads);
+                int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads / ty);
                dim3 n_threads(tx, ty, 1);
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           self->nd, cudaGetLastError(),
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
                k3<<<n_blocks, n_threads>>>(
                        dims[0], //dimensions
@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
                     "Cuda error: %s: %s.\n",
                     "CudaNdarray_TakeFrom",
                     cudaGetErrorString(err));
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
    }
@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            "Cuda error: %s: %s when trying to get the error value.\n",
            "CudaNdarray_TakeFrom",
            cudaGetErrorString(err));
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
    }
@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
        err = cudaMemset((void*)err_var, 0, sizeof(int));
        if (cudaSuccess != err) {
            PyErr_Format(PyExc_MemoryError, "Error setting device error code to 0 after having an index error. %s", cudaGetErrorString(err));
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            return NULL;
        }
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
  
    }
    
-    Py_DECREF(indices_obj);
+    Py_DECREF(indices);
        
    if (verbose) printf("TAKE SUCCEDED\n");
    return (PyObject *)out;

--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -7,6 +7,7 @@ import subprocess
 import sys
 import warnings

+import theano
 from theano.gof.cc import hash_from_file
 from theano.gof.cmodule import (std_libs, std_lib_dirs,
                                std_include_dirs, dlimport,
@@ -119,6 +120,16 @@ class NVCC_compiler(object):
        cuda_ndarray_cuh_hash = hash_from_file(
            os.path.join(os.path.split(__file__)[0], 'cuda_ndarray.cuh'))
        flags.append('-DCUDA_NDARRAY_CUH=' + cuda_ndarray_cuh_hash)
+
+        # We compile cuda_ndarray.cu during import.
+        # We should not add device properties at that time.
+        # As the device is not selected yet!
+        # TODO: compile cuda_ndarray when we bind to a GPU?
+        import theano.sandbox.cuda
+        if hasattr(theano.sandbox, 'cuda'):
+            n = theano.sandbox.cuda.use.device_number
+            p = theano.sandbox.cuda.device_properties(n)
+            flags.append('-arch=sm_' + str(p['major']) + str(p['minor']))
        return flags

    @staticmethod
@@ -217,7 +228,9 @@ class NVCC_compiler(object):
        # '--gpu-code=compute_13',
        #nvcc argument
        preargs1 = [pa for pa in preargs
-                    if pa.startswith('-O') or pa.startswith('--maxrregcount=')]
+                    if pa.startswith('-O') or
+                    pa.startswith('--maxrregcount=') or
+                    pa.startswith('-arch=')]
        preargs2 = [pa for pa in preargs
                    if pa not in preargs1]  # other arguments

@@ -337,6 +350,7 @@ class NVCC_compiler(object):
                    pass
                print >> sys.stderr, l
            print nvcc_stdout
+            print cmd
            raise Exception('nvcc return status', p.returncode,
                            'for cmd', ' '.join(cmd))
        elif config.cmodule.compilation_warning and nvcc_stdout:

--- a/theano/scan_module/tests/test_scan.py
+++ b/theano/scan_module/tests/test_scan.py
@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase):
        for step in xrange(1, 4):
            v_out[step] = v_u[step] * W_in + v_out[step - 1] * W
        theano_values = f2(v_u, v_x0, W_in, W)
-        assert numpy.allclose(theano_values, v_out)
+        assert numpy.allclose(theano_values, v_out), (theano_values, v_out,
+                                                      theano_values - v_out)

        # TO DEL
        topo = f2.maker.fgraph.toposort()
@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase):
            v_y[i] = numpy.dot(v_x[i - 1], vWout)

        (theano_x, theano_y) = f4(v_u1, v_u2, v_x0, v_y0, vW_in1)
-        assert numpy.allclose(theano_x, v_x)
-        assert numpy.allclose(theano_y, v_y)
+        assert numpy.allclose(theano_x, v_x), (theano_x, v_x, theano_x - v_x)
+        assert numpy.allclose(theano_y, v_y), (theano_y, v_y, theano_y - v_y)

    def test_multiple_outs_taps(self):
        l = 5
@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase):
        ny1[4] = (ny1[3] + ny1[1]) * numpy.dot(ny0[3], vWout)
        ny2[4] = numpy.dot(v_u1[4], vW_in1)

-
    def test_using_taps_sequence(self):
        # this test refers to a bug reported by Nicolas
        # Boulanger-Lewandowski June 6th
        x = theano.tensor.dvector()
        y, updates = theano.scan(lambda x: [x],
                                 sequences=dict(input=x, taps=[-1]),
-                                 outputs_info = [None])
+                                 outputs_info=[None])
        inp = numpy.arange(5).astype('float64')
        rval = theano.function([x], y, updates=updates)(inp)
        assert numpy.all(rval == inp[:-1])
@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase):
        # equivalent is done
        (theano_x0, theano_x1) = f9(vu0, vu1, vu2, vx0, vx1)
        # assert that theano does what it should
-        assert numpy.allclose(theano_x0, numpy_x0)
-        assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1, theano_x1 - numpy_x1)
+        assert numpy.allclose(theano_x0, numpy_x0), (theano_x0, numpy_x0,
+                                                     theano_x0 - numpy_x0)
+        assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1,
+                                                     theano_x1 - numpy_x1)
        # assert that it was done in place

        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase):
        vx1 = asarrayX(rng.uniform())
        x0 = theano.shared(vx0)
        x1 = theano.shared(vx1)
-        outputs, updates = theano.scan(lambda x,y: (x + asarrayX(1),
-                                                    y + asarrayX(1)),
+        outputs, updates = theano.scan(lambda x, y: (x + asarrayX(1),
+                                                     y + asarrayX(1)),
                                       [],
-                                       [x0,x1],
-                                       n_steps = 3)
+                                       [x0, x1],
+                                       n_steps=3)
        x0 = asarrayX(numpy.zeros((3,)))
        x0[0] = vx0
        x0 = theano.tensor.constant(x0)
@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase):
        v_eW = numpy.array(rng.uniform(size=(5, 5)) - .5, dtype=floatX)
        v_eh0 = numpy.array(rng.uniform(size=(5,)) - .5, dtype=floatX)

-
        def rnn_fn(_u, _y, _W):

            srng = theano.tensor.shared_randomstreams.RandomStreams(seed)

--- a/theano/tensor/__init__.py
+++ b/theano/tensor/__init__.py
@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \
    jacobian, hessian

 from theano.tensor.sort import sort
+from extra_ops import (DiffOp, bincount, squeeze,
+                       repeat, bartlett, fill_diagonal)
--- a/theano/tensor/extra_ops.py
+++ b/theano/tensor/extra_ops.py
@@ -3,8 +3,8 @@ import numpy

 import theano
 import basic
-from theano import gof, tensor, scalar
-from theano.sandbox.linalg.ops import diag
+from theano import gof, scalar
+import basic as tensor


 class DiffOp(theano.Op):
@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op):
            raise NotImplementedError('%s: gradient is currently implemented'
                            ' for matrices only' % self.__class__.__name__)
        wr_a = fill_diagonal(grad, 0)  # valid for any number of dimensions
-        wr_val = diag(grad).sum()  # diag is only valid for matrices
+        # diag is only valid for matrices
+        import theano.sandbox.linalg
+        wr_val = theano.sandbox.linalg.ops.diag(grad).sum()
        return [wr_a, wr_val]
 fill_diagonal_ = FillDiagonal()