Merge pull request #863 from nouiz/mixed2

Mixed2

Merge pull request #863 from nouiz/mixed2
72a7214a · lamblin · 7ebae191 · 43b81a93 · 72a7214a · 72a7214a
--- a/NEWS.txt
+++ b/NEWS.txt
@@ -2,148 +2,7 @@

 Updates in the Trunk since the last release:

-Bug fixes
- * Outputs of Scan nodes could contain corrupted values: some parts of the
-   output would be repeated a second time, instead of the correct values.
-   It happened randomly, and quite infrequently, but the bug has been present
-   (both in Python and Cython) since April 2011. (Pascal L.)
- * In Sparse sandbox, fix the grad of theano.sparse.sandbox.sp.row_scale.
-   It did not return the right number of elements. (Frederic B.)
- * set_subtensor(x[int vector], new_value) when moved to the GPU
-   was transformed into inc_subtensor on the GPU. Now we have a correct
-   (but slow) GPU implementation.
-   Note 1: set_subtensor(x[slice[,...]], new_value) was working correctly
-   in all cases as well as inc_subtensor(*, *).
-   Note 2: If your code was affected by the incorrect behavior, we now print
-   a warning by default (Frederic B.)
- * Fixed an issue whereby config values were used as default arguments,
-   with those defaults then stuck at old values if the config variables were
-   changed during program execution. (David W-F)
- * Fixed many subtle bugs involving mutable default arguments which may have
-   led to unexpected behaviour, such as objects sharing instance variables
-   they were not supposed to share. (David W-F)
- * Correctly record the GPU device number used when we let the driver select it.
-   (Frederic B.)
-
-Documentation
- * Added in the tutorial documentation on how to extend Theano.
-   This explains how to make a Theano Op from a Python function.
-   http://deeplearning.net/software/theano/tutorial/extending_theano.html
-   (Frédéric B.)
- * New installation instructions for Windows using EPD (Pascal L.)
-
-Interface changes
- * In 0.5, we removed the deprecated sharedvar.value property.
-   Now we raise an error if you access it. (Frederic B.)
- * theano.function does not accept duplicate inputs, so function([x, x], ...)
-   does not work anymore. (Pascal L.)
- * theano.function now raises an error if some of the provided inputs are
-   not part of the computational graph needed to compute the output, for
-   instance, function([x, y], [y]). You can use the kwarg
-   ``on_unused_input={'raise', 'warn', 'ignore'}`` to control this.
-   (Pascal L.)
- * New Theano flag "on_unused_input" that define the default value of the
-   previous point. (Frederic B.)
- * tensor.alloc() now raises an error during graph build time
-   when we try to create less dimensions than the number of dimensions
-   the provided value have. In the past, the error was at run time.
-   (Frederic B.)
-
-Speed up
- * Convolution on the GPU now check the generation of the card to make
-   it faster in some cases (especially medium/big ouput image) (Frédéric B.)
-   (We hardcoded 512 as the maximum number of thread per block. Newer card
-    support up to 1024 threads per block.
- * CPU convolution are now parallelized (Frédric B.)
-   By default use all cores/hyper-threads
-   To control it, use the OMP_NUM_THREADS=N environment variable.
-
-New Features
- * debugprint new param ids=["CHAR", "id", "int", ""]
-   This makes the identifier printed to be the python id, a unique char, a
-   unique int, or not have it printed. We changed the default to be "CHAR"
-   as this is more readable. (Frederic B.)
- * debugprint new param stop_on_name=[False, True]. If True, we don't print
-   anything below an intermediate variable that has a name. Defaults to False.
-   (Frederic B.)
- * debugprint does not print anymore the "|" symbol in a column after the last input. (Frederic B.)
- * If you use Enthought Python Distribution (EPD) now we use its blas
-   implementation by default (tested on Linux and Windows)
-   (Frederic B., Simon McGregor)
- * MRG random now raises an error with a clear message when the passed shape
-   contains dimensions with bad value like 0. (Frédéric B. reported by Ian G.)
- * "CudaNdarray[*] = ndarray" works in more cases (Frederic B.)
- * "CudaNdarray[*] += ndarray" works in more cases (Frederic B.)
- * We add dimensions to CudaNdarray to automatically broadcast more frequently.
-   (Frederic B.)
- * theano.tensor.argsort that wraps numpy.argsort (Hani Almousli).
- * New theano flag cmodule.warn_no_version. Default False. If True,
-   will print a warning when compiling one or more Op with C code that
-   can't be cached because there is no c_code_cache_version() function
-   associated to at least one of those Ops.  (Frederic B.)
- * CPU alloc now always generate C code (Pascal L.)
- * New Theano flag cmodule.warn_no_version=False. When True, warn when an op
-   with C code is not versioned (which forces to recompile it everytimes).
-   (Frédéric B.)
- * Made a few Ops with C code versioned to reduce compilation time.
-   (Frédéric B, Pascal L.)
- * C code reuses preallocated outputs (only done by Scan) (Pascal L.)
- * Garbage collection of intermediate results during Theano function calls
-   for Ops with C code (Pascal L.)
- * Theano flags compiledir_format now support the parameter numpy_version.
- * Theano GPU variables, shared variable and constant now support <, <=,
-   > and >= as as those not on the GPU.
-
-Sparse
- * Implement theano.sparse.mul(sparse1, sparse2) when both inputs don't
-   have the same sparsity pattern. (Frederic B.)
-
-Sparse Sandbox graduate
- * Remove0 op: it removes stored elements with value 0. (Frederic B.)
-
-Sparse Sandbox Additions (not reviewed/documented/tested, but used by some people)
- * They are all in the theano.sparse.sandbox.sp2 module
- * Op class: Cast, Poisson, Multinomial, EliminateZeros, Sum, Binomial
- * Op class: SamplingDot, SamplingDotCsr (inserted automatically)
- * Op function: structured_sigmoid, structured_exp, structured_pow, structured_minimum
- * Op class: StructuredAddSV, StrucutedAddSVCSR (inserted automatically)
- * opt: local_sampling_dot_csr, local_structured_add_s_v
-
-Internal changes
- * Define new exceptions MissingInputError and UnusedInputError, and use them
-   in theano.function, instead of TypeError and ValueError. (Pascal L.)
- * Better handling of bitwidth and max values of integers and pointers
-   across platforms (Pascal L.)
-
-Crash Fix
- * Do not try to use the BLAS library when blas.ldflags is manually set to an
-   empty string (Frederic B.)
- * When importing theano on a computer without GPU with the Theano
-   flags 'device' or 'init_gpu_device' set to gpu* (Frederic B., reported by  Luo Heng)
- * Optimization printed a useless error when scipy was not available. (Frederic B.)
- * GPU conv crash/slowdown on newer hardware (James B.)
- * Better error handling in GPU conv (Frederic B.)
- * GPU optimization that moves element-wise Ops to the GPU. Crash happened in
-   a particular execution order of this optimization and the
-   element-wise fusion optimization when upcasting some inputs to
-   float32 (to compute them on the GPU).
-   (Frederic B., reported by Sander Dieleman)
- * GpuReshape in some particular case when the input is not contiguous
-   (Frederic B., reported by Sander Dieleman)
- * GpuSoftmaxWithBias with shape (0, N) with N > 1.
-   (Frédéric B., reported by Razvan P.)
- * Fix crash under 64-bit Windows, when taking subtensors of the form a[n:]
-   (Pascal L., reported by Simon McGregor)
- * Fixed issue with the MaxAndArgmax Op not properly preserving broadcastable
-   dimensions, which could typically result in optimization crashes (Olivier D.)
- * Fixed crash when concatenating some arrays with specific broadcasting
-   patterns (Olivier D.)
- * Work around a known issue with nvcc 4.1 on MacOS X. (Graham Taylon)
- * In advanced indexing, if some inputs are constant, no need to call constant(...)
-   on their value any more. (Pascal L., reported by John Salvatier)
- * Fix crash on GPU when the GpuSubtensor didn't put the right stride
-   when the results tensor had a dimensions with size of 1. (Pascal L,
-   reported Graham T.)
+https://github.com/Theano/Theano/wiki/Devnews

 =============
 Release Notes

--- a/bin/theano-nose
+++ b/bin/theano-nose
@@ -26,6 +26,9 @@ with the option time_profile=True to conduct time-profiling of the tests.
 option will be interpreted as an indication of the number of tests to be run
 between notifications of progress to standard output.

+If the '--theano' option is used, it is replaced with the path to theano.
+Useful if you don't know where it was installed.
+
 `run_tests_in_batch.py` will in turn call back this script in another process.
 """

@@ -39,6 +42,12 @@ import sys
 from nose.plugins import Plugin

 def main():
+    # Handle the --theano arguments
+    if "--theano" in sys.argv:
+        i = sys.argv.index("--theano")
+        import theano
+        sys.argv[i] = theano.__path__[0]
+
    # Handle --batch[=n] arguments
    batch_args = [arg for arg in sys.argv if arg.startswith('--batch')]
    for arg in batch_args:
@@ -137,6 +146,11 @@ def help():

            --without-knownfailure: Do not load the KnownFailure plugin.

+            --theano: This parameter is replaced with the path to the theano library.
+                      As theano-nose is a wrapper to nosetests, it expect a path to the tests to run.
+                      If you don't know where theano is installed, use this option
+                      to have it inserted automatically.
+
        The other options will be passed to nosetests, see ``nosetests -h``.
        """


--- a/theano/gof/compiledir.py
+++ b/theano/gof/compiledir.py
@@ -37,7 +37,7 @@ compiledir_format_dict = {"platform": platform.platform(),
                          "python_version": platform.python_version(),
                          "theano_version": theano.__version__,
                          "numpy_version": numpy.__version__,
-                          "g++": gcc_version_str.replace(" ", "_"),
+                          "gxx_version": gcc_version_str.replace(" ", "_"),
                         }
 compiledir_format_keys = ", ".join(compiledir_format_dict.keys())
 default_compiledir_format =\

--- a/theano/sandbox/cuda/basic_ops.py
+++ b/theano/sandbox/cuda/basic_ops.py
--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -758,8 +758,10 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    PyObject * axis_obj = Py_None;
    PyObject * out_obj = Py_None;
    PyObject * clipmode_obj = NULL;
-    if (! PyArg_ParseTuple(args, "O|OOO", &indices_obj, &axis_obj,
-                           &out_obj, &clipmode_obj))
+    int max_threads = 1; // max threads per blocks
+
+    if (! PyArg_ParseTuple(args, "O|OOOi", &indices_obj, &axis_obj,
+                           &out_obj, &clipmode_obj, &max_threads))
        return NULL;

    //Check argument indices
@@ -839,14 +841,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    PyObject * axis_iobj = PyNumber_Long(axis_obj);
    if (!axis_iobj) {
        PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: axis must be convertable to a long");
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        return NULL;
    }
    long axis = PyInt_AsLong(axis_iobj);
    Py_DECREF(axis_iobj); axis_iobj=NULL;
    if (axis != 0) {
        PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only axis=0 is currently supported");
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        return NULL;
    }

@@ -869,13 +871,13 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (!out) {
        out = (CudaNdarray*)CudaNdarray_New();
        if (!out){
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            free(dims);
            return NULL;
        }
        if (CudaNdarray_alloc_contiguous(out, self->nd, dims)) {
            Py_DECREF(out);
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            free(dims);
            return NULL;
        }
@@ -887,19 +889,20 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (clipmode_obj) {
        char * clipmode = PyString_AsString(clipmode_obj);
        if (! clipmode){
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
        }
        if (strcmp(clipmode, "raise") != 0) {
-            PyErr_SetString(PyExc_NotImplementedError,"CudaNdarray_TakeFrom: only the raise mode is currently supported");
-            Py_DECREF(indices_obj);
+            PyErr_Format(PyExc_NotImplementedError,
+                         "CudaNdarray_TakeFrom: only the raise mode is currently supported. Got '%s'",
+                         clipmode);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
        }
-        Py_DECREF(clipmode_obj);
    }
    void (*k3)(const int, const int, const int,
               const npy_int64*,
@@ -913,7 +916,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    if (err_var == NULL) {
        err_var = (int*)device_malloc(sizeof(int));
        if (!err_var) { // PyErr set by device_malloc
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
@@ -928,7 +931,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            PyErr_Format(PyExc_RuntimeError,
                         "Error setting device error code to 0. %s",
                         cudaGetErrorString(err));
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            free(dims);
            return NULL;
@@ -936,13 +939,16 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    }

    dim3 n_blocks(std::min(CudaNdarray_HOST_DIMS(out)[0],65535),1,1);
+
    switch (self->nd) {
        case 1:
            {
                dim3 n_threads(1, 1, 1);
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           self->nd, cudaGetLastError(),
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
                k3<<<n_blocks, n_threads>>>(
                        dims[0],
@@ -963,11 +969,15 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            break;
        case 2:
            {
-                dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], 512), 1, 1);
+                dim3 n_threads(std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads), 1, 1);
+
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           cudaGetLastError(), self->nd,
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
+
                k3<<<n_blocks, n_threads>>>(
                        dims[0], //dimensions
                        dims[1],
@@ -987,12 +997,14 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            break;
        case 3:
            {
-                int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], 512);
-                int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], 512 / ty);
+                int ty = std::min(CudaNdarray_HOST_DIMS(out)[2], max_threads);
+                int tx = std::min(CudaNdarray_HOST_DIMS(out)[1], max_threads / ty);
                dim3 n_threads(tx, ty, 1);
                if (verbose)
-                    printf("kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
+                    printf("cudaGetLastError=%d, nd=%d"
+                           " kernel config: (n_blocks.x=%d, n_blocks.y=%d,"
                           " n_threads.x=%i, n_threads.y=%i)\n",
+                           self->nd, cudaGetLastError(),
                           n_blocks.x, n_blocks.y, n_threads.x, n_threads.y);
                k3<<<n_blocks, n_threads>>>(
                        dims[0], //dimensions
@@ -1025,7 +1037,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
                     "Cuda error: %s: %s.\n",
                     "CudaNdarray_TakeFrom",
                     cudaGetErrorString(err));
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
    }
@@ -1040,7 +1052,7 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
            "Cuda error: %s: %s when trying to get the error value.\n",
            "CudaNdarray_TakeFrom",
            cudaGetErrorString(err));
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
    }
@@ -1055,17 +1067,17 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
        err = cudaMemset((void*)err_var, 0, sizeof(int));
        if (cudaSuccess != err) {
            PyErr_Format(PyExc_MemoryError, "Error setting device error code to 0 after having an index error. %s", cudaGetErrorString(err));
-            Py_DECREF(indices_obj);
+            Py_DECREF(indices);
            Py_DECREF(out);
            return NULL;
        }
-        Py_DECREF(indices_obj);
+        Py_DECREF(indices);
        Py_DECREF(out);
        return NULL;
  
    }
    
-    Py_DECREF(indices_obj);
+    Py_DECREF(indices);
        
    if (verbose) printf("TAKE SUCCEDED\n");
    return (PyObject *)out;

--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -7,6 +7,7 @@ import subprocess
 import sys
 import warnings

+import theano
 from theano.gof.cc import hash_from_file
 from theano.gof.cmodule import (std_libs, std_lib_dirs,
                                std_include_dirs, dlimport,
@@ -119,6 +120,16 @@ class NVCC_compiler(object):
        cuda_ndarray_cuh_hash = hash_from_file(
            os.path.join(os.path.split(__file__)[0], 'cuda_ndarray.cuh'))
        flags.append('-DCUDA_NDARRAY_CUH=' + cuda_ndarray_cuh_hash)
+
+        # We compile cuda_ndarray.cu during import.
+        # We should not add device properties at that time.
+        # As the device is not selected yet!
+        # TODO: compile cuda_ndarray when we bind to a GPU?
+        import theano.sandbox.cuda
+        if hasattr(theano.sandbox, 'cuda'):
+            n = theano.sandbox.cuda.use.device_number
+            p = theano.sandbox.cuda.device_properties(n)
+            flags.append('-arch=sm_' + str(p['major']) + str(p['minor']))
        return flags

    @staticmethod
@@ -217,7 +228,9 @@ class NVCC_compiler(object):
        # '--gpu-code=compute_13',
        #nvcc argument
        preargs1 = [pa for pa in preargs
-                    if pa.startswith('-O') or pa.startswith('--maxrregcount=')]
+                    if pa.startswith('-O') or
+                    pa.startswith('--maxrregcount=') or
+                    pa.startswith('-arch=')]
        preargs2 = [pa for pa in preargs
                    if pa not in preargs1]  # other arguments

@@ -337,6 +350,7 @@ class NVCC_compiler(object):
                    pass
                print >> sys.stderr, l
            print nvcc_stdout
+            print cmd
            raise Exception('nvcc return status', p.returncode,
                            'for cmd', ' '.join(cmd))
        elif config.cmodule.compilation_warning and nvcc_stdout:

--- a/theano/scan_module/tests/test_scan.py
+++ b/theano/scan_module/tests/test_scan.py
@@ -410,7 +410,8 @@ class T_Scan(unittest.TestCase):
        for step in xrange(1, 4):
            v_out[step] = v_u[step] * W_in + v_out[step - 1] * W
        theano_values = f2(v_u, v_x0, W_in, W)
-        assert numpy.allclose(theano_values, v_out)
+        assert numpy.allclose(theano_values, v_out), (theano_values, v_out,
+                                                      theano_values - v_out)

        # TO DEL
        topo = f2.maker.fgraph.toposort()
@@ -591,8 +592,8 @@ class T_Scan(unittest.TestCase):
            v_y[i] = numpy.dot(v_x[i - 1], vWout)

        (theano_x, theano_y) = f4(v_u1, v_u2, v_x0, v_y0, vW_in1)
-        assert numpy.allclose(theano_x, v_x)
-        assert numpy.allclose(theano_y, v_y)
+        assert numpy.allclose(theano_x, v_x), (theano_x, v_x, theano_x - v_x)
+        assert numpy.allclose(theano_y, v_y), (theano_y, v_y, theano_y - v_y)

    def test_multiple_outs_taps(self):
        l = 5
@@ -683,14 +684,13 @@ class T_Scan(unittest.TestCase):
        ny1[4] = (ny1[3] + ny1[1]) * numpy.dot(ny0[3], vWout)
        ny2[4] = numpy.dot(v_u1[4], vW_in1)

-
    def test_using_taps_sequence(self):
        # this test refers to a bug reported by Nicolas
        # Boulanger-Lewandowski June 6th
        x = theano.tensor.dvector()
        y, updates = theano.scan(lambda x: [x],
                                 sequences=dict(input=x, taps=[-1]),
-                                 outputs_info = [None])
+                                 outputs_info=[None])
        inp = numpy.arange(5).astype('float64')
        rval = theano.function([x], y, updates=updates)(inp)
        assert numpy.all(rval == inp[:-1])
@@ -840,8 +840,10 @@ class T_Scan(unittest.TestCase):
        # equivalent is done
        (theano_x0, theano_x1) = f9(vu0, vu1, vu2, vx0, vx1)
        # assert that theano does what it should
-        assert numpy.allclose(theano_x0, numpy_x0)
-        assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1, theano_x1 - numpy_x1)
+        assert numpy.allclose(theano_x0, numpy_x0), (theano_x0, numpy_x0,
+                                                     theano_x0 - numpy_x0)
+        assert numpy.allclose(theano_x1, numpy_x1), (theano_x1, numpy_x1,
+                                                     theano_x1 - numpy_x1)
        # assert that it was done in place

        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@@ -940,11 +942,11 @@ class T_Scan(unittest.TestCase):
        vx1 = asarrayX(rng.uniform())
        x0 = theano.shared(vx0)
        x1 = theano.shared(vx1)
-        outputs, updates = theano.scan(lambda x,y: (x + asarrayX(1),
-                                                    y + asarrayX(1)),
+        outputs, updates = theano.scan(lambda x, y: (x + asarrayX(1),
+                                                     y + asarrayX(1)),
                                       [],
-                                       [x0,x1],
-                                       n_steps = 3)
+                                       [x0, x1],
+                                       n_steps=3)
        x0 = asarrayX(numpy.zeros((3,)))
        x0[0] = vx0
        x0 = theano.tensor.constant(x0)
@@ -2447,7 +2449,6 @@ class T_Scan(unittest.TestCase):
        v_eW = numpy.array(rng.uniform(size=(5, 5)) - .5, dtype=floatX)
        v_eh0 = numpy.array(rng.uniform(size=(5,)) - .5, dtype=floatX)

-
        def rnn_fn(_u, _y, _W):

            srng = theano.tensor.shared_randomstreams.RandomStreams(seed)

--- a/theano/tensor/__init__.py
+++ b/theano/tensor/__init__.py
@@ -55,3 +55,5 @@ from theano.gradient import Rop, Lop, grad, numeric_grad, verify_grad, \
    jacobian, hessian

 from theano.tensor.sort import sort
+from extra_ops import (DiffOp, bincount, squeeze,
+                       repeat, bartlett, fill_diagonal)
--- a/theano/tensor/extra_ops.py
+++ b/theano/tensor/extra_ops.py
@@ -3,8 +3,8 @@ import numpy

 import theano
 import basic
-from theano import gof, tensor, scalar
-from theano.sandbox.linalg.ops import diag
+from theano import gof, scalar
+import basic as tensor


 class DiffOp(theano.Op):
@@ -446,7 +446,9 @@ class FillDiagonal(gof.Op):
            raise NotImplementedError('%s: gradient is currently implemented'
                            ' for matrices only' % self.__class__.__name__)
        wr_a = fill_diagonal(grad, 0)  # valid for any number of dimensions
-        wr_val = diag(grad).sum()  # diag is only valid for matrices
+        # diag is only valid for matrices
+        import theano.sandbox.linalg
+        wr_val = theano.sandbox.linalg.ops.diag(grad).sum()
        return [wr_a, wr_val]
 fill_diagonal_ = FillDiagonal()