Update doc with instructions for using new gpu backend

bd544674 · slefrancois · 319382b5 · bd544674 · bd544674 · bd544674
--- a/.gitignore
+++ b/.gitignore
@@ -37,3 +37,4 @@ Theano.suo
 .ipynb_checkpoints
 .pydevproject
 .ropeproject
+core
\ No newline at end of file
--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
 Testing GPU Ops
 ^^^^^^^^^^^^^^^
-Ops to be executed on the GPU should inherit from the
+When using the old GPU backend, Ops to be executed on the GPU should inherit
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
+from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
 Theano to distinguish them. Currently, we use this to test if the
 NVIDIA driver works correctly with our sum reduction code on the GPU.

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
    If you want GPU-related tests to run on a specific GPU device, and not
    the default one, you should use :attr:`~config.init_gpu_device`.
-    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
+    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
    See :ref:`libdoc_config` for more information on how to change these
    configuration options.
@@ -508,25 +508,25 @@ Any one of them is enough.
    :ref:`Ubuntu instructions <install_ubuntu_gpu>`.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
 computer, and set the default floating point computations to float32.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
+For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
 You can also set these options in the .theanorc file's ``[global]`` section:
     .. code-block:: cfg
        [global]
-        device = gpu
+        device = cuda
        floatX = float32
 Note that:
-    * If your computer has multiple GPUs and you use 'device=gpu', the driver
+    * If your computer has multiple GPUs and you use 'device=cuda', the driver
      selects the one to use (usually gpu0).
    * You can use the program nvida-smi to change this policy.
-    * You can choose one specific GPU by specifying 'device=gpuX', with X the
+    * You can choose one specific GPU by specifying 'device=cudaX', with X the
      the corresponding GPU index (0, 1, 2, ...)
    * By default, when ``device`` indicates preference for GPU computations,
      Theano will fall back to the CPU if there is a problem with the GPU.
@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
     toggle your GPU on, which can be done with
     `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once your setup is complete, head to :ref:`using_gpu` to find how to verify
 everything is working properly.

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -303,7 +303,7 @@ Test GPU configuration
 .. code-block:: bash
-    THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
 .. note::

--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -445,6 +445,8 @@ routine for matrix multiplication)
 Configure Theano for GPU use
 ############################
+Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
 Theano can be configured with a ``.theanorc`` text file (or
 ``.theanorc.txt``, whichever is easier for you to create under
 Windows). It should be placed in the directory pointed to by the
@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
 .. code-block:: cfg
   [global]
-   device = gpu
+   device = cuda
   floatX = float32
   [nvcc]

--- a/doc/optimizations.txt
+++ b/doc/optimizations.txt
@@ -32,6 +32,7 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 ========================================================= ========= ============ =============
 :term:`merge`                                             x         x
 :term:`constant folding<constant folding>`                x         x
+:term:`GPU transfer`                                      x         x
 :term:`shape promotion<shape promotion>`                  x
 :term:`fill cut<fill cut>`                                x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x
@@ -52,7 +53,6 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 :term:`inplace_elemwise`                                  x
 :term:`inplace_random`                                    x
 :term:`elemwise fusion`                                   x
-:term:`GPU transfer`                                      x
 :term:`local_log_softmax`                                 x                      x
 :term:`local_remove_all_assert`                                                   
 ========================================================= ========= ============ =============

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 hints that give more flexibility to the compilation and optimization of the
 graph.
-For GPU graphs, this borrowing can have a major speed impact.  See the following code:
-.. code-block:: python
-   from theano import function, config, shared, sandbox, tensor, Out
-   import numpy
-   import time
-   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
-   iters = 1000
-   rng = numpy.random.RandomState(22)
-   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
-   f2 = function([],
-                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
-                     borrow=True))
-   t0 = time.time()
-   for i in range(iters):
-       r = f1()
-   t1 = time.time()
-   no_borrow = t1 - t0
-   t0 = time.time()
-   for i in range(iters):
-       r = f2()
-   t1 = time.time()
-   print(
-       "Looping %s times took %s seconds without borrow "
-       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
-   )
-   if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                 ('Gpu' not in type(x.op).__name__)
-                 for x in f1.maker.fgraph.toposort()]):
-       print('Used the cpu')
-   else:
-       print('Used the gpu')
-Which produces this output:
-.. code-block:: none
-   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
-   Using gpu device 0: GeForce GTX 275
-   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
-   Used the gpu
 *Take home message:*
 When an input *x* to a function is not needed after the function
@@ -317,4 +271,3 @@ requirement.  When a return value *y* is large (in terms of memory
 footprint), and you only need to read from it once, right away when
 it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -15,28 +15,41 @@ about how to carry out those computations.  One of the ways we take
 advantage of this flexibility is in carrying out calculations on a
 graphics card.
-There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
+There are two ways currently to use a gpu, on that should support any OpenCL
+device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend which
+only supports NVIDIA cards (:ref:`cuda`).
-.. _cuda:
+.. _gpuarray:
-CUDA backend
+GpuArray Backend
------------
+----------------
-If you have not done so already, you will need to install Nvidia's
+If you have not done so already, you will need to install libgpuarray
-GPU-programming toolchain (CUDA) and configure Theano to use it.
+as well as at least one computing toolkit.  Instructions for doing so
-We provide installation instructions for :ref:`Linux <gpu_linux>`,
+are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
-:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
+While all types of devices are supported if using OpenCL, for the
+remainder of this section, whatever compute device you are using will
+be referred to as GPU.
+.. warning::
+  The backend was designed to support OpenCL, however current support is
+  incomplete. A lot of very useful ops still do not support it because they
+  were ported from the old backend with minimal change.
 Testing Theano with GPU
 ~~~~~~~~~~~~~~~~~~~~~~~
-To see if your GPU is being used, cut and paste the following program into a
+To see if your GPU is being used, cut and paste the following program
-file and run it.
+into a file and run it.
+Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
+``device=cuda{0,1,...}`` to specify which GPU to use.
 .. testcode::
-    from theano import function, config, shared, sandbox
+  from theano import function, config, shared, tensor
-    import theano.tensor as T
  import numpy
  import time
@@ -45,7 +58,7 @@ file and run it.
  rng = numpy.random.RandomState(22)
  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-    f = function([], T.exp(x))
+  f = function([], tensor.exp(x))
  print(f.maker.fgraph.toposort())
  t0 = time.time()
  for i in range(iters):
@@ -53,20 +66,16 @@ file and run it.
  t1 = time.time()
  print("Looping %d times took %f seconds" % (iters, t1 - t0))
  print("Result is %s" % (r,))
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
      print('Used the cpu')
  else:
      print('Used the gpu')
-The program just computes the ``exp()`` of a bunch of random numbers.
+The program just compute ``exp()`` of a bunch of random numbers.  Note
-Note that we use the ``shared`` function to
+that we use the :func:`theano.shared` function to make sure that the
-make sure that the input *x* is stored on the graphics device.
+input *x* is stored on the GPU.
-.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
-If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
-whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
-same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
 .. testoutput::
   :hide:
@@ -79,40 +88,36 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump
 .. code-block:: none
-    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
+  $ THEANO_FLAGS=device=cpu python check1.py
-    [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
+  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-    Looping 1000 times took 3.06635117531 seconds
+  Looping 1000 times took 2.6071999073 seconds
-    Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-      1.62323284]
+    1.62323285]
  Used the cpu
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
+  $ THEANO_FLAGS=device=cuda0 python check1.py
-    Using gpu device 0: GeForce GTX 580
+  Using device cuda0: GeForce GTX 275
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
+  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-    Looping 1000 times took 0.638810873032 seconds
+  Looping 1000 times took 2.28562092781 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-      1.62323296]
+    1.62323285]
  Used the gpu
-Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
 Returning a Handle to Device-Allocated Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The speedup is not greater in the preceding example because the function is
+By default functions that execute on the GPU still return a standard
-returning its result as a NumPy ndarray which has already been copied from the
+numpy ndarray.  A transfer operation is inserted just before the
-device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
+results are returned to ensure a consistent interface with CPU code.
-if you don't mind less portability, you might gain a bigger speedup by changing
+This allows changing the device some code runs on by only replacing
-the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
+the value of the ``device`` flag without touching the code.
-op means "copy the input from the host to the GPU" and it is optimized away
-after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
+If you don't mind a loss of flexibility, you can ask theano to return
+the GPU object directly.  The following code is modified to do just that.
 .. testcode::
-    from theano import function, config, shared, sandbox
+  from theano import function, config, shared, tensor, gpuarray
-    import theano.sandbox.cuda.basic_ops
-    import theano.tensor as T
  import numpy
  import time
@@ -120,139 +125,146 @@ after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
  iters = 1000
  rng = numpy.random.RandomState(22)
-    x = shared(numpy.asarray(rng.rand(vlen), 'float32'))
+  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
+  f = function([], gpuarray.basic_ops.GpuFromHost(None)(tensor.exp(x)))
  print(f.maker.fgraph.toposort())
  t0 = time.time()
  for i in range(iters):
      r = f()
  t1 = time.time()
  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-    print("Result is %s" % (r,))
+  print("Result is %s" % (numpy.asarray(r),))
-    print("Numpy result is %s" % (numpy.asarray(r),))
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
      print('Used the cpu')
  else:
      print('Used the gpu')
-The output from this program is
+Here the :func:`theano.gpuarray.basic_ops.GpuFromHost(None)` call
+means "copy input to the GPU", with ``None`` the default GPU context when not
+explicitly given. However during the optimization phase,
+since the result will already be on the gpu, it will be removed.  It is
+used here to tell theano that we want the result on the GPU.
+The output is
 .. testoutput::
   :hide:
   :options: +ELLIPSIS, +SKIP
-   Using gpu device 0: GeForce GTX 580
+   Using device cuda0: ...
-   [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
+   [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
   Looping 1000 times took ... seconds
-   Result is <CudaNdarray object at 0x...>
+   Result is ...
-   Numpy result is ...
   Used the gpu
 .. code-block:: none
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
+  $ THEANO_FLAGS=device=cuda0 python check2.py
-    Using gpu device 0: GeForce GTX 580
+  Using device cuda0: GeForce GTX 275
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
+  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
-    Looping 1000 times took 0.34898686409 seconds
+  Looping 1000 times took 0.455810785294 seconds
-    Result is <CudaNdarray object at 0x6a7a5f0>
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+    1.62323285]
-      1.62323296]
  Used the gpu
-Here we've shaved off about 50% of the run-time by simply not copying
-the resulting array back to the host.  The object returned by each
-function call is now not a NumPy array but a "CudaNdarray" which can
-be converted to a NumPy ndarray by the normal NumPy casting mechanism
-using something like ``numpy.asarray()``.
-For even more speed you can play with the ``borrow`` flag.  See
+While the time per call appears to be much lower than the two previous
+invocations (and should indeed be lower, since we avoid a transfer)
+the massive speedup we obtained is in part due to asynchronous nature
+of execution on GPUs, meaning that the work isn't completed yet, just
+'launched'.  We'll talk about that later.
+The object returned is a GpuArray from pygpu.  It mostly acts as a
+numpy ndarray with some exceptions due to its data being on the GPU.
+You can copy it to the host and convert it to a regular ndarray by
+using usual numpy casting such as ``numpy.asarray()``.
+For even more speed, you can play with the ``borrow`` flag.  See
 :ref:`borrowfunction`.
-What Can Be Accelerated on the GPU
+What Can be Accelerated on the GPU
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The performance characteristics will change as we continue to optimize our
+The performance characteristics will of course vary from device to
-implementations, and vary from device to device, but to give a rough idea of
+device, and also as we refine our implementation:
-what to expect right now:
+* In general, matrix multiplication, convolution, and large element-wise
-* Only computations
+  operations can be accelerated a lot (5-50x) when arguments are large enough
-  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
+  to keep 30 processors busy.
-  *float64* computations are still relatively slow (Jan 2010).
+* Indexing, dimension-shuffling and constant-time reshaping will be equally fast
-* Matrix
+  on GPU as on CPU.
-  multiplication, convolution, and large element-wise operations can be
+* Summation over rows/columns of tensors can be a little slower on the
-  accelerated a lot (5-50x) when arguments are large enough to keep 30
+  GPU than on the CPU.
-  processors busy.
+* Copying of large quantities of data to and from a device is relatively slow,
-* Indexing,
+  and often cancels most of the advantage of one or two accelerated functions
-  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
+  on that data. Getting GPU performance largely hinges on making data transfer
-  as on CPU.
+  to the device pay off.
-* Summation
-  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
+The backend supports all regular theano data types (float32, float64,
-* Copying
+int, ...), however GPU support varies and some units can't deal with
-  of large quantities of data to and from a device is relatively slow, and
+double (float64) or small (less than 32 bits like int16) data types.
-  often cancels most of the advantage of one or two accelerated functions on
+You will get an error at compile time or runtime if this is the case.
-  that data.  Getting GPU performance largely hinges on making data transfer to
-  the device pay off.
+By default all inputs will get transferred to GPU. You can prevent an
+input from getting transferred by setting its tag.target attribute to
+'cpu'.
+Complex support is untested and most likely completely broken.
 Tips for Improving Performance on GPU
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-* Consider
+* Consider adding ``floatX=float32`` (or the type you are using) to your
-  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
+  ``.theanorc`` file if you plan to do a lot of GPU work.
-  GPU work.
 * Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
-* Prefer
+* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
-  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
+  ``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
-  ``dscalar`` because the former will give you *float32* variables when
+  you *float32* variables when ``floatX=float32``.
-  ``floatX=float32``.
+* Ensure that your output variables have a *float32* dtype and not *float64*.
-* Ensure
+  The more *float32* variables are in your graph, the more work the GPU can do for
-  that your output variables have a *float32* dtype and not *float64*.  The
-  more *float32* variables are in your graph, the more work the GPU can do for
  you.
-* Minimize
+* Minimize transfers to the GPU device by using ``shared`` *float32* variables
-  tranfers to the GPU device by using ``shared`` *float32* variables to store
+  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
-  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
+  When using the GPU, *float32* tensor ``shared`` variables are stored on
-  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
+  the GPU by default to eliminate transfer time for GPU ops using those variables.
-  eliminate transfer time for GPU ops using those variables.
 * If you aren't happy with the performance you see, try running your script with
  ``profile=True`` flag. This should print some timing information at program
  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
  programming, have a look at how it's implemented in theano.sandbox.cuda.
-  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
+  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
-  This can tell you if not enough of your graph is on the GPU or if there
+  Xs(X%) in transfer op*. This can tell you if not enough of your graph is
-  is too much memory transfer.
+  on the GPU or if there is too much memory transfer.
-* Use nvcc options. nvcc supports those options to speed up some
+* Use nvcc options. nvcc supports those options to speed up some computations:
-  computations: `-ftz=true` to `flush denormals values to
+  `-ftz=true` to `flush denormals values to zeros.
-  zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
+  <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
  `--prec-div=false` and `--prec-sqrt=false` options to speed up
  division and square root operation by being less precise. You can
  enable all of them with the `nvcc.flags=--use_fast_math` Theano
  flag or you can enable them individually as in this example:
  `nvcc.flags=-ftz=true --prec-div=false`.
-* To investigate whether if all the Ops in the computational graph are running on GPU.
+* To investigate whether all the Ops in the computational graph are
-  It is possible to debug or check your code by providing a value to `assert_no_cpu_op`
+  running on GPU, it is possible to debug or check your code by providing
-  flag, i.e. `warn`, for warning `raise` for raising an error or `pdb` for putting a breakpoint
+  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
-  in the computational graph if there is a CPU Op.
+  raising an error or `pdb` for putting a breakpoint in the computational
+  graph if there is a CPU Op.
-.. _gpu_async:
-GPU Async capabilities
+GPU Async Capabilities
 ~~~~~~~~~~~~~~~~~~~~~~
-Ever since Theano 0.6 we started to use the asynchronous capability of
+By default, all operations on the GPU are run asynchronously.  This
-GPUs. This allows us to be faster but with the possibility that some
+means that they are only scheduled to run and the function returns.
-errors may be raised later than when they should occur. This can cause
+This is made somewhat transparently by the underlying libgpuarray.
-difficulties when profiling Theano apply nodes. There is a NVIDIA
-driver feature to help with these issues. If you set the environment
+A forced synchronization point is introduced when doing memory
-variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
+transfers between device and host.
-automatically synchronized. This reduces performance but provides good
-profiling and appropriately placed error messages.
+It is possible to force synchronization for a particular GpuArray by
+calling its ``sync()`` method.  This is useful to get accurate timings
-This feature interacts with Theano garbage collection of intermediate
+when doing benchmarks.
-results. To get the most of this feature, you need to disable the gc
-as it inserts synchronization points in the graph. Set the Theano flag
-``allow_gc=False`` to get even faster speed! This will raise the memory
-usage.
 Changing the Value of Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -261,9 +273,8 @@ To change the value of a ``shared`` variable, e.g. to provide new data to proces
 use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
 see :ref:`aliasing`.
 Exercise
-++++++++
+~~~~~~~~
 Consider again the logistic regression:
@@ -343,15 +354,29 @@ Where does it come from? (Use ``profile=True`` flag.)
 What can be done to further increase the speed of the GPU version? Put your ideas to test.
+:download:`Solution<using_gpu_solution_1.py>`
+-------------------------------------------
+.. _cuda:
+CUDA backend
+------------
+If you have not done so already, you will need to install Nvidia's
+GPU-programming toolchain (CUDA) and configure Theano to use it.
+We provide installation instructions for :ref:`Linux <gpu_linux>`,
+:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
+The old CUDA backend can be activated using the flags ``device=gpu`` or
+``device=gpu{0,1,...}``
 .. Note::
-   * Only 32 bit floats are currently supported (development is in progress).
+   * Only 32 bit floats are supported.
   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
   * There is a limit of one GPU per process.
-   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
-   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
   * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
   * ``Cast`` inputs before storing them into a ``shared`` variable.
@@ -361,211 +386,6 @@ What can be done to further increase the speed of the GPU version? Put your idea
     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
     * Notice that a new casting mechanism is being developed.
-:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
-.. _gpuarray:
-GpuArray Backend
----------------
-If you have not done so already, you will need to install libgpuarray
-as well as at least one computing toolkit.  Instructions for doing so
-are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
-While all types of devices are supported if using OpenCL, for the
-remainder of this section, whatever compute device you are using will
-be referred to as GPU.
-.. warning::
-  While it is fully our intention to support OpenCL, as of May 2014
-  this support is still in its infancy.  A lot of very useful ops
-  still do not support it because they were ported from the old
-  backend with minimal change.
-Testing Theano with GPU
-~~~~~~~~~~~~~~~~~~~~~~~
-To see if your GPU is being used, cut and paste the following program
-into a file and run it.
-.. testcode::
-  from theano import function, config, shared, tensor
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], tensor.exp(x))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (r,))
-  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-The program just compute ``exp()`` of a bunch of random numbers.  Note
-that we use the :func:`theano.shared` function to make sure that the
-input *x* is stored on the GPU.
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS
-   [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the cpu
-.. code-block:: none
-  $ THEANO_FLAGS=device=cpu python check1.py
-  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-  Looping 1000 times took 2.6071999073 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the cpu
-  $ THEANO_FLAGS=device=cuda0 python check1.py
-  Using device cuda0: GeForce GTX 275
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-  Looping 1000 times took 2.28562092781 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-Returning a Handle to Device-Allocated Data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-By default functions that execute on the GPU still return a standard
-numpy ndarray.  A transfer operation is inserted just before the
-results are returned to ensure a consistent interface with CPU code.
-This allows changing the deivce some code runs on by only replacing
-the value of the ``device`` flag without touching the code.
-If you don't mind a loss of flexibility, you can ask theano to return
-the GPU object directly.  The following code is modifed to do just that.
-.. testcode::
-  from theano import function, config, shared, tensor, gpuarray
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], gpuarray.basic_ops.GpuFromHost(None)(tensor.exp(x)))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (numpy.asarray(r),))
-  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-Here the :func:`theano.gpuarray.basic_ops.GpuFromHost(None)` call
-means "copy input to the GPU", with ``None`` the default GPU context when not
-explicitly given. However during the optimization phase,
-since the result will already be on the gpu, it will be removed.  It is
-used here to tell theano that we want the result on the GPU.
-The output is
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS, +SKIP
-   Using device cuda0: ...
-   [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the gpu
-.. code-block:: none
-  $ THEANO_FLAGS=device=cuda0 python check2.py
-  Using device cuda0: GeForce GTX 275
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
-  Looping 1000 times took 0.455810785294 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-While the time per call appears to be much lower than the two previous
-invocations (and should indeed be lower, since we avoid a transfer)
-the massive speedup we obtained is in part due to asynchronous nature
-of execution on GPUs, meaning that the work isn't completed yet, just
-'launched'.  We'll talk about that later.
-The object returned is a GpuArray from pygpu.  It mostly acts as a
-numpy ndarray with some exceptions due to its data being on the GPU.
-You can copy it to the host and convert it to a regular ndarray by
-using usual numpy casting such as ``numpy.asarray()``.
-For even more speed, you can play with the ``borrow`` flag.  See
-:ref:`borrowfunction`.
-What Can be Accelerated on the GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The performance characteristics will of course vary from device to
-device, and also as we refine our implementation.
-This backend supports all regular theano data types (float32, float64,
-int, ...) however GPU support varies and some units can't deal with
-double (float64) or small (less than 32 bits like int16) data types.
-You will get an error at compile time or runtime if this is the case.
-By default all inputs will get transferred to GPU.  You can prevent an
-input from getting transferred by setting its tag.target attribute to
-'cpu'.
-Complex support is untested and most likely completely broken.
-In general, large operations like matrix multiplication, or
-element-wise operations with large inputs, will be significatly
-faster.
-GPU Async Capabilities
-~~~~~~~~~~~~~~~~~~~~~~
-By default, all operations on the GPU are run asynchronously.  This
-means that they are only scheduled to run and the function returns.
-This is made somewhat transparently by the underlying libgpuarray.
-A forced synchronization point is introduced when doing memory
-transfers between device and host.
-It is possible to force synchronization for a particular GpuArray by
-calling its ``sync()`` method.  This is useful to get accurate timings
-when doing benchmarks.
 -------------------------------------------

--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
@@ -11,8 +11,6 @@ import numpy
 import theano
 import theano.tensor as tt
-from theano import sandbox, Out
 theano.config.floatX = 'float32'
 rng = numpy.random
@@ -20,7 +18,7 @@ rng = numpy.random
 N = 400
 feats = 784
 D = (rng.randn(N, feats).astype(theano.config.floatX),
-rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
+    rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
 training_steps = 10000
 # Declare Theano symbolic variables
@@ -41,30 +39,19 @@ cost = tt.cast(xent.mean(), 'float32') + \
    0.01 * (w ** 2).sum()  # The cost to optimize
 gw, gb = tt.grad(cost, [w, b])
-"""
-# Compile expressions to functions
-train = theano.function(
-            inputs=[x, y],
-            outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
-            name="train")
-predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
-            name="predict")
-"""
 # Compile expressions to functions
 train = theano.function(
            inputs=[],
            outputs=[prediction, xent],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
            name="train")
 predict = theano.function(inputs=[], outputs=prediction,
            name="predict")
-if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+if any([n.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for n in
 train.maker.fgraph.toposort()]):
    print('Used the cpu')
-elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+elif any([n.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for n in
 train.maker.fgraph.toposort()]):
    print('Used the gpu')
 else:
@@ -101,171 +88,171 @@ prediction on D
 # in the script, followed by a summary for all functions.
 # We'll show here only the summary:
-Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
+Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 1.590916e+00s
+  Time in 10001 calls to Function.__call__: 1.300452e+00s
-  Time in Function.fn.__call__: 1.492365e+00s (93.805%)
+  Time in Function.fn.__call__: 1.215823e+00s (93.492%)
-  Time in thunks: 1.408159e+00s (88.512%)
+  Time in thunks: 1.157602e+00s (89.015%)
-  Total compile time: 6.309664e+00s
+  Total compile time: 8.922548e-01s
-    Number of Apply nodes: 25
+    Number of Apply nodes: 17
-    Theano Optimizer time: 4.848340e-01s
+    Theano Optimizer time: 6.270301e-01s
-       Theano validate time: 5.454302e-03s
+       Theano validate time: 5.993605e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
+    Theano Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
+       Import time 3.543139e-03s
+Time in all call to theano.grad() 1.848292e-02s
+Time since theano import 2.864s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
+  64.5%    64.5%       0.747s       3.73e-05s     C    20001       3   theano.tensor.blas_c.CGemv
-  30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
+  33.1%    97.7%       0.384s       4.79e-06s     C    80001       9   theano.tensor.elemwise.Elemwise
-   5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
+   1.0%    98.6%       0.011s       1.14e-06s     C    10000       1   theano.tensor.elemwise.Sum
-   1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
+   0.7%    99.4%       0.009s       2.85e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
-   1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
+   0.3%    99.7%       0.004s       3.64e-07s     C    10001       2   theano.tensor.basic.AllocEmpty
-   1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
+   0.3%   100.0%       0.004s       1.78e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
+  64.5%    64.5%       0.747s       3.73e-05s     C     20001        3   CGemv{inplace}
-  15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
+  18.7%    83.2%       0.217s       2.17e-05s     C     10000        1   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
-   7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
+   8.9%    92.1%       0.103s       1.03e-05s     C     10000        1   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
-   5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
+   4.3%    96.4%       0.050s       4.98e-06s     C     10000        1   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
-   4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
+   1.0%    97.4%       0.011s       1.14e-06s     C     10000        1   Sum{acc_dtype=float64}
-   1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
+   0.5%    97.9%       0.006s       2.83e-07s     C     20001        3   InplaceDimShuffle{x}
-   1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
+   0.4%    98.3%       0.004s       4.22e-07s     C     10000        1   Elemwise{sub,no_inplace}
-   0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
+   0.3%    98.6%       0.004s       3.70e-07s     C     10000        1   Elemwise{neg,no_inplace}
-   0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
+   0.3%    98.9%       0.004s       3.64e-07s     C     10001        2   AllocEmpty{dtype='float32'}
-   0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
+   0.3%    99.2%       0.004s       1.78e-07s     C     20001        3   Shape_i{0}
-   0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+   0.2%    99.5%       0.003s       2.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
-   0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
+   0.2%    99.7%       0.003s       2.65e-07s     C     10000        1   Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
-   0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
+   0.2%    99.9%       0.002s       1.98e-07s     C     10000        1   Elemwise{Cast{float32}}
-   0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
+   0.1%   100.0%       0.002s       1.54e-07s     C     10000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
-   0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
+   0.0%   100.0%       0.000s       4.77e-06s     C        1        1   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
-   0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
-   0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-  31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+  34.0%    34.0%       0.394s       3.94e-05s   10000     7   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-  27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
+  30.5%    64.5%       0.353s       3.53e-05s   10000    15   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
-  15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+  18.7%    83.2%       0.217s       2.17e-05s   10000    12   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
-   7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
+   8.9%    92.1%       0.103s       1.03e-05s   10000    13   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
-   5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
+   4.3%    96.4%       0.050s       4.98e-06s   10000    11   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
-   4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
+   1.0%    97.4%       0.011s       1.14e-06s   10000    14   Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
-   1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
+   0.4%    97.8%       0.004s       4.22e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
-   1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
+   0.3%    98.1%       0.004s       3.76e-07s   10000     0   InplaceDimShuffle{x}(b)
-   0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+   0.3%    98.4%       0.004s       3.70e-07s   10000    10   Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
-   0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
+   0.3%    98.7%       0.004s       3.64e-07s   10000     5   AllocEmpty{dtype='float32'}(Shape_i{0}.0)
-   0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
+   0.2%    99.0%       0.003s       2.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
-   0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
+   0.2%    99.2%       0.003s       2.65e-07s   10000     9   Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
-   0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
+   0.2%    99.4%       0.002s       2.21e-07s   10000     1   Shape_i{0}(x)
-   0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
+   0.2%    99.6%       0.002s       1.98e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
-   0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
+   0.2%    99.7%       0.002s       1.90e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
-   0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
+   0.1%    99.9%       0.002s       1.54e-07s   10000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
-   0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
+   0.1%   100.0%       0.001s       1.34e-07s   10000     3   Shape_i{0}(y)
-   0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
+   0.0%   100.0%       0.000s       3.89e-05s      1     3   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
+   0.0%   100.0%       0.000s       4.77e-06s      1     4   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
-   0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
+   0.0%   100.0%       0.000s       1.19e-06s      1     0   InplaceDimShuffle{x}(b)
-   ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
+   ... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
 # 2.2 Profiling for GPU computations
 # In your terminal, type:
-$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
+$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
 # You'll see first the output of the script:
 Used the gpu
 target values for D
 prediction on D
-Results were produced using a GeForce GTX TITAN
+Results were produced using a GeForce GTX TITAN X
 # Profiling summary for all functions:
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 3.535239e+00s
+  Time in 10001 calls to Function.__call__: 4.181247e+00s
-  Time in Function.fn.__call__: 3.420863e+00s (96.765%)
+  Time in Function.fn.__call__: 4.081113e+00s (97.605%)
-  Time in thunks: 2.865905e+00s (81.067%)
+  Time in thunks: 3.915566e+00s (93.646%)
-  Total compile time: 4.728150e-01s
+  Total compile time: 9.256095e+00s
-    Number of Apply nodes: 36
+    Number of Apply nodes: 21
-    Theano Optimizer time: 4.283385e-01s
+    Theano Optimizer time: 9.996419e-01s
-       Theano validate time: 7.687330e-03s
+       Theano validate time: 6.523132e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
+    Theano Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
+       Import time 4.228115e-03s
+Time in all call to theano.grad() 3.286195e-02s
+Time since theano import 15.415s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
+  59.5%    59.5%       2.329s       1.16e-04s     C    20001       3   theano.sandbox.gpuarray.blas.GpuGemv
-  17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
+  29.8%    89.3%       1.166s       1.30e-05s     C    90001      10   theano.sandbox.gpuarray.elemwise.GpuElemwise
-  15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
+   4.1%    93.4%       0.162s       8.10e-06s     C    20001       3   theano.sandbox.gpuarray.basic_ops.HostFromGpu
-   8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
+   3.3%    96.7%       0.131s       1.31e-05s     C    10000       1   theano.sandbox.gpuarray.elemwise.GpuCAReduceCuda
-   7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
+   1.6%    98.3%       0.061s       6.10e-06s     C    10000       1   theano.sandbox.gpuarray.basic_ops.GpuFromHost
-   4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
+   0.8%    99.1%       0.033s       1.09e-06s     C    30001       4   theano.sandbox.gpuarray.elemwise.GpuDimShuffle
-   0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
+   0.7%    99.8%       0.026s       2.59e-06s     C    10001       2   theano.sandbox.gpuarray.basic_ops.GpuAllocEmpty
-   0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
+   0.2%   100.0%       0.008s       3.95e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
+  59.5%    59.5%       2.329s       1.16e-04s     C     20001        3   GpuGemv{inplace=True}
-   8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
+   4.1%    63.6%       0.162s       8.10e-06s     C     20001        3   HostFromGpu(gpuarray)
-   8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
+   4.0%    67.6%       0.157s       1.57e-05s     C     10000        1   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
-   7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
+   3.8%    71.4%       0.149s       1.49e-05s     C     10000        1   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
-   7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
+   3.7%    75.1%       0.144s       1.44e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
-   7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
+   3.6%    78.7%       0.141s       1.41e-05s     C     10000        1   GpuElemwise{gt,no_inplace}
-   6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
+   3.4%    82.1%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Cast{float32}}[]<gpuarray>
-   6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
+   3.4%    85.5%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
-   6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
+   3.3%    88.8%       0.131s       1.31e-05s     C     10000        1   GpuCAReduceCuda{add}
-   5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
+   2.9%    91.7%       0.112s       1.12e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
-   4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
+   2.6%    94.3%       0.102s       1.02e-05s     C     10000        1   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
-   4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+   2.5%    96.7%       0.096s       9.63e-06s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
-   4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
+   1.6%    98.3%       0.061s       6.10e-06s     C     10000        1   GpuFromHost<None>
-   4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
+   0.7%    99.0%       0.026s       2.59e-06s     C     10001        2   GpuAllocEmpty{dtype='float32', context_name=None}
-   0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
+   0.5%    99.5%       0.021s       1.06e-06s     C     20001        3   InplaceGpuDimShuffle{x}
-   0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
+   0.3%    99.8%       0.011s       1.14e-06s     C     10000        1   InplaceGpuDimShuffle{1,0}
-   0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
+   0.2%   100.0%       0.008s       3.95e-07s     C     20001        3   Shape_i{0}
-   0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
+   0.0%   100.0%       0.000s       2.00e-05s     C        1        1   GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
-   0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
+   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
-   0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
-   ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-   8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
+  55.0%    55.0%       2.154s       2.15e-04s   10000     7   GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+   4.5%    59.5%       0.176s       1.76e-05s   10000    18   GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
-   8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
+   4.0%    63.5%       0.157s       1.57e-05s   10000    12   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
-   7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
+   3.8%    67.3%       0.149s       1.49e-05s   10000    15   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
-   7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
+   3.7%    71.0%       0.144s       1.44e-05s   10000     4   GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
-   7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
+   3.6%    74.6%       0.141s       1.41e-05s   10000    16   GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
-   6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
+   3.4%    78.0%       0.133s       1.33e-05s   10000    10   GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
-   6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
+   3.4%    81.4%       0.133s       1.33e-05s   10000     9   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
-   6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
+   3.3%    84.7%       0.131s       1.31e-05s   10000    17   GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
-   5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
+   2.9%    87.5%       0.112s       1.12e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
-   4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
+   2.6%    90.1%       0.102s       1.02e-05s   10000    20   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
-   4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
+   2.5%    92.6%       0.096s       9.63e-06s   10000    13   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
-   4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
+   2.3%    94.9%       0.090s       9.04e-06s   10000    19   HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
-   4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
+   1.8%    96.7%       0.072s       7.16e-06s   10000    14   HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
-   4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
+   1.6%    98.3%       0.061s       6.10e-06s   10000     6   GpuFromHost<None>(Shape_i{0}.0)
-   3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
+   0.7%    99.0%       0.026s       2.59e-06s   10000     5   GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
-   0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
+   0.3%    99.3%       0.013s       1.33e-06s   10000     0   InplaceGpuDimShuffle{x}(b)
-   0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
+   0.3%    99.6%       0.011s       1.14e-06s   10000     2   InplaceGpuDimShuffle{1,0}(x)
-   0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
+   0.2%    99.8%       0.008s       7.94e-07s   10000     8   InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
-   0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
+   0.1%    99.9%       0.005s       5.27e-07s   10000     1   Shape_i{0}(x)
-   ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
+   ... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
 # 3. Conclusions

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
    t0 = 0
    t1 = -1
+    f() # Ignore first function call to get representative time.
    if execute:
        sync = (hasattr(theano, "sandbox") and
                hasattr(theano.sandbox, "cuda") and
                theano.sandbox.cuda.cuda_available)
+        sync2 = (hasattr(theano, "gpuarray") and
+                theano.gpuarray.pygpu_activated)
        t0 = time.time()
        for i in range(iters):
            f()
        if sync:
            theano.sandbox.cuda.synchronize()
+        if sync2:
+            c.get_value(borrow=True, return_internal_type=True).sync()
        t1 = time.time()
    return t1 - t0, impl