Merge pull request #4500 from slefrancois/gpu_out_sandbox

Update doc with instructions for using new gpu backend

Merge pull request #4500 from slefrancois/gpu_out_sandbox
ec0419a6 · Pascal Lamblin · 0044349f · 974bd517 · ec0419a6 · ec0419a6
--- a/.gitignore
+++ b/.gitignore
@@ -37,3 +37,4 @@ Theano.suo
 .ipynb_checkpoints
 .pydevproject
 .ropeproject
+core
\ No newline at end of file
--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
 Testing GPU Ops
 ^^^^^^^^^^^^^^^
-Ops to be executed on the GPU should inherit from the
+When using the old GPU backend, Ops to be executed on the GPU should inherit
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
+from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
 Theano to distinguish them. Currently, we use this to test if the
 NVIDIA driver works correctly with our sum reduction code on the GPU.

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
    If you want GPU-related tests to run on a specific GPU device, and not
    the default one, you should use :attr:`~config.init_gpu_device`.
-    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
+    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
    See :ref:`libdoc_config` for more information on how to change these
    configuration options.
@@ -508,25 +508,25 @@ Any one of them is enough.
    :ref:`Ubuntu instructions <install_ubuntu_gpu>`.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
 computer, and set the default floating point computations to float32.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
+For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
 You can also set these options in the .theanorc file's ``[global]`` section:
     .. code-block:: cfg
        [global]
-        device = gpu
+        device = cuda
        floatX = float32
 Note that:
-    * If your computer has multiple GPUs and you use 'device=gpu', the driver
+    * If your computer has multiple GPUs and you use 'device=cuda', the driver
-      selects the one to use (usually gpu0).
+      selects the one to use (usually cuda0).
-    * You can use the program nvida-smi to change this policy.
+    * You can use the program ``nvidia-smi`` to change this policy.
-    * You can choose one specific GPU by specifying 'device=gpuX', with X the
+    * You can choose one specific GPU by specifying 'device=cudaX', with X the
      the corresponding GPU index (0, 1, 2, ...)
    * By default, when ``device`` indicates preference for GPU computations,
      Theano will fall back to the CPU if there is a problem with the GPU.
@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
     toggle your GPU on, which can be done with
     `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 Once your setup is complete, head to :ref:`using_gpu` to find how to verify
 everything is working properly.

--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -303,7 +303,7 @@ Test GPU configuration
 .. code-block:: bash
-    THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
 .. note::

--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -445,6 +445,8 @@ routine for matrix multiplication)
 Configure Theano for GPU use
 ############################
+Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
 Theano can be configured with a ``.theanorc`` text file (or
 ``.theanorc.txt``, whichever is easier for you to create under
 Windows). It should be placed in the directory pointed to by the
@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
 .. code-block:: cfg
   [global]
-   device = gpu
+   device = cuda
   floatX = float32
   [nvcc]

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -51,11 +51,11 @@ Environment Variables
    .. code-block:: bash
-        THEANO_FLAGS='floatX=float32,device=gpu0,lib.cnmem=1'  python <myscript>.py
+        THEANO_FLAGS='floatX=float32,device=cuda0,lib.cnmem=1'  python <myscript>.py
    If a value is defined several times in ``THEANO_FLAGS``,
    the right-most definition is used. So, for instance, if
-    ``THEANO_FLAGS='device=cpu,device=gpu0'``, then gpu0 will be used.
+    ``THEANO_FLAGS='device=cpu,device=cuda0'``, then cuda0 will be used.
 .. envvar:: THEANORC
@@ -70,7 +70,7 @@ Environment Variables
        [global]
        floatX = float32
-        device = gpu0
+        device = cuda0
        [lib]
        cnmem = 1
@@ -102,22 +102,21 @@ import theano and print the config variable, as in:
 .. attribute:: device
-    String value: either ``'cpu'``, ``'gpu'``, ``'gpu0'``, ``'gpu1'``,
+    String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
-    ``'gpu2'``, or ``'gpu3'``
+    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...
-    Default device for computations. If ``gpu*``, change the default to try
+    Default device for computations. If ``'cuda*``, change the default to try
-    to move computation to it and to put shared variable of float32 on
+    to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
-    it.
+    the openCL libraries will be used. To let the driver select the device,
-    Choose the default compute device for theano graphs.  Setting this to a
+    use ``'cuda'`` or ``'opencl'``. If ``'gpu*'``, the old gpu backend will
-    ``gpu*`` string will make theano to try by default to move computation to it.
+    be used, although users are encouraged to migrate to the new GpuArray 
-    Also it will make theano put by default shared variable of float32 on it.
+    backend. If we are not able to use the GPU,
-    ``'gpu'`` lets the driver select the GPU to use, while ``'gpu?'`` makes Theano try
+    either we fall back on the CPU, or an error is raised, depending
-    to use a specific device. If we are not able to use the GPU, either we fall back
+    on the :attr:`force_device` flag.
-    on the CPU, or an error is raised, depending on the :attr:`force_device` flag.
    This flag's value cannot be modified during the program execution.
-    Do not use upper case letters, only lower case even if NVIDIA use
+    Do not use upper case letters, only lower case even if NVIDIA uses
    capital letters.
 .. attribute:: force_device
@@ -138,11 +137,12 @@ import theano and print the config variable, as in:
 .. attribute:: init_gpu_device
-    String value: either ``''``, ``'gpu'``, ``'gpu0'``, ``'gpu1'``, ``'gpu2'``,
+    String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
-    or ``'gpu3'``
+    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...
    Initialize the gpu device to use.
-    When its value is gpu*, the theano flag :attr:`device` must be ``"cpu"``.
+    When its value is ``'cuda*'``, ``'opencl*'`` or ``'gpu*'``, the theano
+    flag :attr:`device` must be ``'cpu'``.
    Unlike :attr:`device`, setting this flag to a specific GPU will not
    try to use this device by default, in particular it will **not** move
    computations, nor shared variables, to the specified GPU.

--- a/doc/optimizations.txt
+++ b/doc/optimizations.txt
@@ -32,6 +32,7 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 ========================================================= ========= ============ =============
 :term:`merge`                                             x         x
 :term:`constant folding<constant folding>`                x         x
+:term:`GPU transfer`                                      x         x
 :term:`shape promotion<shape promotion>`                  x
 :term:`fill cut<fill cut>`                                x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x
@@ -52,7 +53,6 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 :term:`inplace_elemwise`                                  x
 :term:`inplace_random`                                    x
 :term:`elemwise fusion`                                   x
-:term:`GPU transfer`                                      x
 :term:`local_log_softmax`                                 x                      x
 :term:`local_remove_all_assert`                                                   
 ========================================================= ========= ============ =============

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 hints that give more flexibility to the compilation and optimization of the
 graph.
-For GPU graphs, this borrowing can have a major speed impact.  See the following code:
-.. code-block:: python
-   from theano import function, config, shared, sandbox, tensor, Out
-   import numpy
-   import time
-   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
-   iters = 1000
-   rng = numpy.random.RandomState(22)
-   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
-   f2 = function([],
-                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
-                     borrow=True))
-   t0 = time.time()
-   for i in range(iters):
-       r = f1()
-   t1 = time.time()
-   no_borrow = t1 - t0
-   t0 = time.time()
-   for i in range(iters):
-       r = f2()
-   t1 = time.time()
-   print(
-       "Looping %s times took %s seconds without borrow "
-       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
-   )
-   if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                 ('Gpu' not in type(x.op).__name__)
-                 for x in f1.maker.fgraph.toposort()]):
-       print('Used the cpu')
-   else:
-       print('Used the gpu')
-Which produces this output:
-.. code-block:: none
-   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
-   Using gpu device 0: GeForce GTX 275
-   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
-   Used the gpu
 *Take home message:*
 When an input *x* to a function is not needed after the function
@@ -317,4 +271,3 @@ requirement.  When a return value *y* is large (in terms of memory
 footprint), and you only need to read from it once, right away when
 it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -15,28 +15,47 @@ about how to carry out those computations.  One of the ways we take
 advantage of this flexibility is in carrying out calculations on a
 graphics card.
-There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
+There are two ways currently to use a gpu, one that should support any OpenCL
+device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
+only supports NVIDIA cards (:ref:`cuda`).
-.. _cuda:
+.. warning::
-CUDA backend
+  If you want to use the new GpuArray backend, make sure to have the 
------------
+  development version of Theano installed. The 0.8.X releases have not
+  been optimized to work correctly with the new backend.
-If you have not done so already, you will need to install Nvidia's
+.. _gpuarray:
-GPU-programming toolchain (CUDA) and configure Theano to use it.
-We provide installation instructions for :ref:`Linux <gpu_linux>`,
+GpuArray Backend
-:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
+----------------
+If you have not done so already, you will need to install libgpuarray
+as well as at least one computing toolkit.  Instructions for doing so
+are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
+While all types of devices are supported if using OpenCL, for the
+remainder of this section, whatever compute device you are using will
+be referred to as GPU.
+.. warning::
+  The backend was designed to support OpenCL, however current support is
+  incomplete. A lot of very useful ops still do not support it because they
+  were ported from the old backend with minimal change.
 Testing Theano with GPU
 ~~~~~~~~~~~~~~~~~~~~~~~
-To see if your GPU is being used, cut and paste the following program into a
+To see if your GPU is being used, cut and paste the following program
-file and run it.
+into a file and run it.
+Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
+``device=cuda{0,1,...}`` to specify which GPU to use.
 .. testcode::
-    from theano import function, config, shared, sandbox
+  from theano import function, config, shared, tensor
-    import theano.tensor as T
  import numpy
  import time
@@ -45,7 +64,7 @@ file and run it.
  rng = numpy.random.RandomState(22)
  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-    f = function([], T.exp(x))
+  f = function([], tensor.exp(x))
  print(f.maker.fgraph.toposort())
  t0 = time.time()
  for i in range(iters):
@@ -53,20 +72,16 @@ file and run it.
  t1 = time.time()
  print("Looping %d times took %f seconds" % (iters, t1 - t0))
  print("Result is %s" % (r,))
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
      print('Used the cpu')
  else:
      print('Used the gpu')
-The program just computes the ``exp()`` of a bunch of random numbers.
+The program just computes ``exp()`` of a bunch of random numbers.  Note
-Note that we use the ``shared`` function to
+that we use the :func:`theano.shared` function to make sure that the
-make sure that the input *x* is stored on the graphics device.
+input *x* is stored on the GPU.
-.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
-If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
-whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
-same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
 .. testoutput::
   :hide:
@@ -79,40 +94,37 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump
 .. code-block:: none
-    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
+  $ THEANO_FLAGS=device=cpu python gpu_tutorial1.py
-    [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
+  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-    Looping 1000 times took 3.06635117531 seconds
+  Looping 1000 times took 2.271284 seconds
-    Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-      1.62323284]
+    1.62323285]
  Used the cpu
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
+  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
-    Using gpu device 0: GeForce GTX 580
+  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
+  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-    Looping 1000 times took 0.638810873032 seconds
+  Looping 1000 times took 1.202734 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-      1.62323296]
+    1.62323285]
  Used the gpu
-Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
 Returning a Handle to Device-Allocated Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The speedup is not greater in the preceding example because the function is
+By default functions that execute on the GPU still return a standard
-returning its result as a NumPy ndarray which has already been copied from the
+numpy ndarray.  A transfer operation is inserted just before the
-device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
+results are returned to ensure a consistent interface with CPU code.
-if you don't mind less portability, you might gain a bigger speedup by changing
+This allows changing the device some code runs on by only replacing
-the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
+the value of the ``device`` flag without touching the code.
-op means "copy the input from the host to the GPU" and it is optimized away
-after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
+If you don't mind a loss of flexibility, you can ask theano to return
+the GPU object directly.  The following code is modified to do just that.
 .. testcode::
-    from theano import function, config, shared, sandbox
+  from theano import function, config, shared, tensor
-    import theano.sandbox.cuda.basic_ops
-    import theano.tensor as T
  import numpy
  import time
@@ -120,139 +132,140 @@ after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
  iters = 1000
  rng = numpy.random.RandomState(22)
-    x = shared(numpy.asarray(rng.rand(vlen), 'float32'))
+  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
+  f = function([], tensor.exp(x).transfer(None))
  print(f.maker.fgraph.toposort())
  t0 = time.time()
  for i in range(iters):
      r = f()
  t1 = time.time()
  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-    print("Result is %s" % (r,))
+  print("Result is %s" % (numpy.asarray(r),))
-    print("Numpy result is %s" % (numpy.asarray(r),))
+  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
+                ('Gpu' not in type(x.op).__name__)
+                for x in f.maker.fgraph.toposort()]):
      print('Used the cpu')
  else:
      print('Used the gpu')
-The output from this program is
+Here ``tensor.exp(x).transfer(None)`` means "copy ``exp(x)`` to the GPU",
+with ``None`` the default GPU context when not explicitly given.
+For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`. 
+The output is
 .. testoutput::
   :hide:
   :options: +ELLIPSIS, +SKIP
-   Using gpu device 0: GeForce GTX 580
+   $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
-   [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
+   Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
-   Looping 1000 times took ... seconds
+   [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-   Result is <CudaNdarray object at 0x...>
+   Looping 1000 times took 0.088381 seconds
-   Numpy result is ...
+   Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
+     1.62323285]
   Used the gpu
 .. code-block:: none
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
+  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
-    Using gpu device 0: GeForce GTX 580
+  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
+  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-    Looping 1000 times took 0.34898686409 seconds
+  Looping 1000 times took 0.089194 seconds
-    Result is <CudaNdarray object at 0x6a7a5f0>
+  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
+    1.62323285]
-      1.62323296]
  Used the gpu
-Here we've shaved off about 50% of the run-time by simply not copying
+While the time per call appears to be much lower than the two previous
-the resulting array back to the host.  The object returned by each
+invocations (and should indeed be lower, since we avoid a transfer)
-function call is now not a NumPy array but a "CudaNdarray" which can
+the massive speedup we obtained is in part due to asynchronous nature
-be converted to a NumPy ndarray by the normal NumPy casting mechanism
+of execution on GPUs, meaning that the work isn't completed yet, just
-using something like ``numpy.asarray()``.
+'launched'.  We'll talk about that later.
-For even more speed you can play with the ``borrow`` flag.  See
+The object returned is a GpuArray from pygpu.  It mostly acts as a
+numpy ndarray with some exceptions due to its data being on the GPU.
+You can copy it to the host and convert it to a regular ndarray by
+using usual numpy casting such as ``numpy.asarray()``.
+For even more speed, you can play with the ``borrow`` flag.  See
 :ref:`borrowfunction`.
-What Can Be Accelerated on the GPU
+What Can be Accelerated on the GPU
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The performance characteristics will change as we continue to optimize our
+The performance characteristics will of course vary from device to
-implementations, and vary from device to device, but to give a rough idea of
+device, and also as we refine our implementation:
-what to expect right now:
+* In general, matrix multiplication, convolution, and large element-wise
-* Only computations
+  operations can be accelerated a lot (5-50x) when arguments are large enough
-  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
+  to keep 30 processors busy.
-  *float64* computations are still relatively slow (Jan 2010).
+* Indexing, dimension-shuffling and constant-time reshaping will be equally fast
-* Matrix
+  on GPU as on CPU.
-  multiplication, convolution, and large element-wise operations can be
+* Summation over rows/columns of tensors can be a little slower on the
-  accelerated a lot (5-50x) when arguments are large enough to keep 30
+  GPU than on the CPU.
-  processors busy.
+* Copying of large quantities of data to and from a device is relatively slow,
-* Indexing,
+  and often cancels most of the advantage of one or two accelerated functions
-  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
+  on that data. Getting GPU performance largely hinges on making data transfer
-  as on CPU.
+  to the device pay off.
-* Summation
-  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
+The backend supports all regular theano data types (float32, float64,
-* Copying
+int, ...), however GPU support varies and some units can't deal with
-  of large quantities of data to and from a device is relatively slow, and
+double (float64) or small (less than 32 bits like int16) data types.
-  often cancels most of the advantage of one or two accelerated functions on
+You will get an error at compile time or runtime if this is the case.
-  that data.  Getting GPU performance largely hinges on making data transfer to
-  the device pay off.
+By default all inputs will get transferred to GPU. You can prevent an
+input from getting transferred by setting its ``tag.target`` attribute to
+'cpu'.
+Complex support is untested and most likely completely broken.
 Tips for Improving Performance on GPU
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-* Consider
+* Consider adding ``floatX=float32`` (or the type you are using) to your
-  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
+  ``.theanorc`` file if you plan to do a lot of GPU work.
-  GPU work.
+* The GPU backend supports *float64* variables, but they are still slower
-* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
+  to compute than *float32*. The more *float32*, the better GPU performance
-* Prefer
+  you will get. 
-  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
+* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` (which 
-  ``dscalar`` because the former will give you *float32* variables when
+  follow the type set in ``floatX``) to ``dmatrix``, ``dvector`` and
-  ``floatX=float32``.
+  ``dscalar``. The latter enforce double precision (*float64* on most 
-* Ensure
+  machines), which slows down GPU computations on current hardware.
-  that your output variables have a *float32* dtype and not *float64*.  The
+* Minimize transfers to the GPU device by using ``shared`` variables
-  more *float32* variables are in your graph, the more work the GPU can do for
+  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
-  you.
+  When using the GPU, tensor ``shared`` variables are stored on
-* Minimize
+  the GPU by default to eliminate transfer time for GPU ops using those variables.
-  tranfers to the GPU device by using ``shared`` *float32* variables to store
-  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
-  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
-  eliminate transfer time for GPU ops using those variables.
 * If you aren't happy with the performance you see, try running your script with
  ``profile=True`` flag. This should print some timing information at program
  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
-  programming, have a look at how it's implemented in theano.sandbox.cuda.
+  programming, have a look at how it's implemented in theano.gpuarray.
-  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
+  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
-  This can tell you if not enough of your graph is on the GPU or if there
+  Xs(X%) in transfer op*. This can tell you if not enough of your graph is
-  is too much memory transfer.
+  on the GPU or if there is too much memory transfer.
-* Use nvcc options. nvcc supports those options to speed up some
+* To investigate whether all the Ops in the computational graph are
-  computations: `-ftz=true` to `flush denormals values to
+  running on GPU, it is possible to debug or check your code by providing
-  zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
+  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
-  `--prec-div=false` and `--prec-sqrt=false` options to speed up
+  raising an error or `pdb` for putting a breakpoint in the computational
-  division and square root operation by being less precise. You can
+  graph if there is a CPU Op.
-  enable all of them with the `nvcc.flags=--use_fast_math` Theano
-  flag or you can enable them individually as in this example:
+  .. _gpu_async:
-  `nvcc.flags=-ftz=true --prec-div=false`.
-* To investigate whether if all the Ops in the computational graph are running on GPU.
+GPU Async Capabilities
-  It is possible to debug or check your code by providing a value to `assert_no_cpu_op`
-  flag, i.e. `warn`, for warning `raise` for raising an error or `pdb` for putting a breakpoint
-  in the computational graph if there is a CPU Op.
-.. _gpu_async:
-GPU Async capabilities
 ~~~~~~~~~~~~~~~~~~~~~~
-Ever since Theano 0.6 we started to use the asynchronous capability of
+By default, all operations on the GPU are run asynchronously.  This
-GPUs. This allows us to be faster but with the possibility that some
+means that they are only scheduled to run and the function returns.
-errors may be raised later than when they should occur. This can cause
+This is made somewhat transparently by the underlying libgpuarray.
-difficulties when profiling Theano apply nodes. There is a NVIDIA
-driver feature to help with these issues. If you set the environment
+A forced synchronization point is introduced when doing memory
-variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
+transfers between device and host.
-automatically synchronized. This reduces performance but provides good
-profiling and appropriately placed error messages.
+It is possible to force synchronization for a particular GpuArray by
+calling its ``sync()`` method.  This is useful to get accurate timings
-This feature interacts with Theano garbage collection of intermediate
+when doing benchmarks.
-results. To get the most of this feature, you need to disable the gc
-as it inserts synchronization points in the graph. Set the Theano flag
-``allow_gc=False`` to get even faster speed! This will raise the memory
-usage.
 Changing the Value of Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -261,9 +274,8 @@ To change the value of a ``shared`` variable, e.g. to provide new data to proces
 use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
 see :ref:`aliasing`.
 Exercise
-++++++++
+~~~~~~~~
 Consider again the logistic regression:
@@ -343,15 +355,29 @@ Where does it come from? (Use ``profile=True`` flag.)
 What can be done to further increase the speed of the GPU version? Put your ideas to test.
+:download:`Solution<using_gpu_solution_1.py>`
+-------------------------------------------
+.. _cuda:
+CUDA backend
+------------
+If you have not done so already, you will need to install Nvidia's
+GPU-programming toolchain (CUDA) and configure Theano to use it.
+We provide installation instructions for :ref:`Linux <gpu_linux>`,
+:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
+The old CUDA backend can be activated using the flags ``device=gpu`` or
+``device=gpu{0,1,...}``
 .. Note::
-   * Only 32 bit floats are currently supported (development is in progress).
+   * Only 32 bit floats are supported.
   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
   * There is a limit of one GPU per process.
-   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
-   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
   * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
   * ``Cast`` inputs before storing them into a ``shared`` variable.
@@ -361,211 +387,6 @@ What can be done to further increase the speed of the GPU version? Put your idea
     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
     * Notice that a new casting mechanism is being developed.
-:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
-.. _gpuarray:
-GpuArray Backend
----------------
-If you have not done so already, you will need to install libgpuarray
-as well as at least one computing toolkit.  Instructions for doing so
-are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
-While all types of devices are supported if using OpenCL, for the
-remainder of this section, whatever compute device you are using will
-be referred to as GPU.
-.. warning::
-  While it is fully our intention to support OpenCL, as of May 2014
-  this support is still in its infancy.  A lot of very useful ops
-  still do not support it because they were ported from the old
-  backend with minimal change.
-Testing Theano with GPU
-~~~~~~~~~~~~~~~~~~~~~~~
-To see if your GPU is being used, cut and paste the following program
-into a file and run it.
-.. testcode::
-  from theano import function, config, shared, tensor
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], tensor.exp(x))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (r,))
-  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-The program just compute ``exp()`` of a bunch of random numbers.  Note
-that we use the :func:`theano.shared` function to make sure that the
-input *x* is stored on the GPU.
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS
-   [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the cpu
-.. code-block:: none
-  $ THEANO_FLAGS=device=cpu python check1.py
-  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-  Looping 1000 times took 2.6071999073 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the cpu
-  $ THEANO_FLAGS=device=cuda0 python check1.py
-  Using device cuda0: GeForce GTX 275
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-  Looping 1000 times took 2.28562092781 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-Returning a Handle to Device-Allocated Data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-By default functions that execute on the GPU still return a standard
-numpy ndarray.  A transfer operation is inserted just before the
-results are returned to ensure a consistent interface with CPU code.
-This allows changing the deivce some code runs on by only replacing
-the value of the ``device`` flag without touching the code.
-If you don't mind a loss of flexibility, you can ask theano to return
-the GPU object directly.  The following code is modifed to do just that.
-.. testcode::
-  from theano import function, config, shared, tensor, gpuarray
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], gpuarray.basic_ops.GpuFromHost(None)(tensor.exp(x)))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (numpy.asarray(r),))
-  if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-Here the :func:`theano.gpuarray.basic_ops.GpuFromHost(None)` call
-means "copy input to the GPU", with ``None`` the default GPU context when not
-explicitly given. However during the optimization phase,
-since the result will already be on the gpu, it will be removed.  It is
-used here to tell theano that we want the result on the GPU.
-The output is
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS, +SKIP
-   Using device cuda0: ...
-   [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the gpu
-.. code-block:: none
-  $ THEANO_FLAGS=device=cuda0 python check2.py
-  Using device cuda0: GeForce GTX 275
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
-  Looping 1000 times took 0.455810785294 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-While the time per call appears to be much lower than the two previous
-invocations (and should indeed be lower, since we avoid a transfer)
-the massive speedup we obtained is in part due to asynchronous nature
-of execution on GPUs, meaning that the work isn't completed yet, just
-'launched'.  We'll talk about that later.
-The object returned is a GpuArray from pygpu.  It mostly acts as a
-numpy ndarray with some exceptions due to its data being on the GPU.
-You can copy it to the host and convert it to a regular ndarray by
-using usual numpy casting such as ``numpy.asarray()``.
-For even more speed, you can play with the ``borrow`` flag.  See
-:ref:`borrowfunction`.
-What Can be Accelerated on the GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The performance characteristics will of course vary from device to
-device, and also as we refine our implementation.
-This backend supports all regular theano data types (float32, float64,
-int, ...) however GPU support varies and some units can't deal with
-double (float64) or small (less than 32 bits like int16) data types.
-You will get an error at compile time or runtime if this is the case.
-By default all inputs will get transferred to GPU.  You can prevent an
-input from getting transferred by setting its tag.target attribute to
-'cpu'.
-Complex support is untested and most likely completely broken.
-In general, large operations like matrix multiplication, or
-element-wise operations with large inputs, will be significatly
-faster.
-GPU Async Capabilities
-~~~~~~~~~~~~~~~~~~~~~~
-By default, all operations on the GPU are run asynchronously.  This
-means that they are only scheduled to run and the function returns.
-This is made somewhat transparently by the underlying libgpuarray.
-A forced synchronization point is introduced when doing memory
-transfers between device and host.
-It is possible to force synchronization for a particular GpuArray by
-calling its ``sync()`` method.  This is useful to get accurate timings
-when doing benchmarks.
 -------------------------------------------

--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
@@ -11,8 +11,6 @@ import numpy
 import theano
 import theano.tensor as tt
-from theano import sandbox, Out
 theano.config.floatX = 'float32'
 rng = numpy.random
@@ -20,7 +18,7 @@ rng = numpy.random
 N = 400
 feats = 784
 D = (rng.randn(N, feats).astype(theano.config.floatX),
-rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
+    rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
 training_steps = 10000
 # Declare Theano symbolic variables
@@ -41,30 +39,19 @@ cost = tt.cast(xent.mean(), 'float32') + \
    0.01 * (w ** 2).sum()  # The cost to optimize
 gw, gb = tt.grad(cost, [w, b])
-"""
-# Compile expressions to functions
-train = theano.function(
-            inputs=[x, y],
-            outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
-            name="train")
-predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
-            name="predict")
-"""
 # Compile expressions to functions
 train = theano.function(
            inputs=[],
            outputs=[prediction, xent],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
            name="train")
 predict = theano.function(inputs=[], outputs=prediction,
            name="predict")
-if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+if any([n.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for n in
 train.maker.fgraph.toposort()]):
    print('Used the cpu')
-elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+elif any([n.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for n in
 train.maker.fgraph.toposort()]):
    print('Used the gpu')
 else:
@@ -101,171 +88,171 @@ prediction on D
 # in the script, followed by a summary for all functions.
 # We'll show here only the summary:
-Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
+Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 1.590916e+00s
+  Time in 10001 calls to Function.__call__: 1.300452e+00s
-  Time in Function.fn.__call__: 1.492365e+00s (93.805%)
+  Time in Function.fn.__call__: 1.215823e+00s (93.492%)
-  Time in thunks: 1.408159e+00s (88.512%)
+  Time in thunks: 1.157602e+00s (89.015%)
-  Total compile time: 6.309664e+00s
+  Total compile time: 8.922548e-01s
-    Number of Apply nodes: 25
+    Number of Apply nodes: 17
-    Theano Optimizer time: 4.848340e-01s
+    Theano Optimizer time: 6.270301e-01s
-       Theano validate time: 5.454302e-03s
+       Theano validate time: 5.993605e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
+    Theano Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
+       Import time 3.543139e-03s
+Time in all call to theano.grad() 1.848292e-02s
+Time since theano import 2.864s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
+  64.5%    64.5%       0.747s       3.73e-05s     C    20001       3   theano.tensor.blas_c.CGemv
-  30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
+  33.1%    97.7%       0.384s       4.79e-06s     C    80001       9   theano.tensor.elemwise.Elemwise
-   5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
+   1.0%    98.6%       0.011s       1.14e-06s     C    10000       1   theano.tensor.elemwise.Sum
-   1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
+   0.7%    99.4%       0.009s       2.85e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
-   1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
+   0.3%    99.7%       0.004s       3.64e-07s     C    10001       2   theano.tensor.basic.AllocEmpty
-   1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
+   0.3%   100.0%       0.004s       1.78e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
+  64.5%    64.5%       0.747s       3.73e-05s     C     20001        3   CGemv{inplace}
-  15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
+  18.7%    83.2%       0.217s       2.17e-05s     C     10000        1   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
-   7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
+   8.9%    92.1%       0.103s       1.03e-05s     C     10000        1   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
-   5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
+   4.3%    96.4%       0.050s       4.98e-06s     C     10000        1   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
-   4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
+   1.0%    97.4%       0.011s       1.14e-06s     C     10000        1   Sum{acc_dtype=float64}
-   1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
+   0.5%    97.9%       0.006s       2.83e-07s     C     20001        3   InplaceDimShuffle{x}
-   1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
+   0.4%    98.3%       0.004s       4.22e-07s     C     10000        1   Elemwise{sub,no_inplace}
-   0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
+   0.3%    98.6%       0.004s       3.70e-07s     C     10000        1   Elemwise{neg,no_inplace}
-   0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
+   0.3%    98.9%       0.004s       3.64e-07s     C     10001        2   AllocEmpty{dtype='float32'}
-   0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
+   0.3%    99.2%       0.004s       1.78e-07s     C     20001        3   Shape_i{0}
-   0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+   0.2%    99.5%       0.003s       2.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
-   0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
+   0.2%    99.7%       0.003s       2.65e-07s     C     10000        1   Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
-   0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
+   0.2%    99.9%       0.002s       1.98e-07s     C     10000        1   Elemwise{Cast{float32}}
-   0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
+   0.1%   100.0%       0.002s       1.54e-07s     C     10000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
-   0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
+   0.0%   100.0%       0.000s       4.77e-06s     C        1        1   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
-   0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
-   0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-  31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+  34.0%    34.0%       0.394s       3.94e-05s   10000     7   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-  27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
+  30.5%    64.5%       0.353s       3.53e-05s   10000    15   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
-  15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+  18.7%    83.2%       0.217s       2.17e-05s   10000    12   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
-   7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
+   8.9%    92.1%       0.103s       1.03e-05s   10000    13   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
-   5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
+   4.3%    96.4%       0.050s       4.98e-06s   10000    11   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
-   4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
+   1.0%    97.4%       0.011s       1.14e-06s   10000    14   Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
-   1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
+   0.4%    97.8%       0.004s       4.22e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
-   1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
+   0.3%    98.1%       0.004s       3.76e-07s   10000     0   InplaceDimShuffle{x}(b)
-   0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+   0.3%    98.4%       0.004s       3.70e-07s   10000    10   Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
-   0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
+   0.3%    98.7%       0.004s       3.64e-07s   10000     5   AllocEmpty{dtype='float32'}(Shape_i{0}.0)
-   0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
+   0.2%    99.0%       0.003s       2.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
-   0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
+   0.2%    99.2%       0.003s       2.65e-07s   10000     9   Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
-   0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
+   0.2%    99.4%       0.002s       2.21e-07s   10000     1   Shape_i{0}(x)
-   0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
+   0.2%    99.6%       0.002s       1.98e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
-   0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
+   0.2%    99.7%       0.002s       1.90e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
-   0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
+   0.1%    99.9%       0.002s       1.54e-07s   10000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
-   0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
+   0.1%   100.0%       0.001s       1.34e-07s   10000     3   Shape_i{0}(y)
-   0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
+   0.0%   100.0%       0.000s       3.89e-05s      1     3   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
+   0.0%   100.0%       0.000s       4.77e-06s      1     4   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
-   0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
+   0.0%   100.0%       0.000s       1.19e-06s      1     0   InplaceDimShuffle{x}(b)
-   ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
+   ... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
 # 2.2 Profiling for GPU computations
 # In your terminal, type:
-$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
+$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
 # You'll see first the output of the script:
 Used the gpu
 target values for D
 prediction on D
-Results were produced using a GeForce GTX TITAN
+Results were produced using a GeForce GTX TITAN X
 # Profiling summary for all functions:
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 3.535239e+00s
+  Time in 10001 calls to Function.__call__: 4.181247e+00s
-  Time in Function.fn.__call__: 3.420863e+00s (96.765%)
+  Time in Function.fn.__call__: 4.081113e+00s (97.605%)
-  Time in thunks: 2.865905e+00s (81.067%)
+  Time in thunks: 3.915566e+00s (93.646%)
-  Total compile time: 4.728150e-01s
+  Total compile time: 9.256095e+00s
-    Number of Apply nodes: 36
+    Number of Apply nodes: 21
-    Theano Optimizer time: 4.283385e-01s
+    Theano Optimizer time: 9.996419e-01s
-       Theano validate time: 7.687330e-03s
+       Theano validate time: 6.523132e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
+    Theano Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
+       Import time 4.228115e-03s
+Time in all call to theano.grad() 3.286195e-02s
+Time since theano import 15.415s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
+  59.5%    59.5%       2.329s       1.16e-04s     C    20001       3   theano.sandbox.gpuarray.blas.GpuGemv
-  17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
+  29.8%    89.3%       1.166s       1.30e-05s     C    90001      10   theano.sandbox.gpuarray.elemwise.GpuElemwise
-  15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
+   4.1%    93.4%       0.162s       8.10e-06s     C    20001       3   theano.sandbox.gpuarray.basic_ops.HostFromGpu
-   8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
+   3.3%    96.7%       0.131s       1.31e-05s     C    10000       1   theano.sandbox.gpuarray.elemwise.GpuCAReduceCuda
-   7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
+   1.6%    98.3%       0.061s       6.10e-06s     C    10000       1   theano.sandbox.gpuarray.basic_ops.GpuFromHost
-   4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
+   0.8%    99.1%       0.033s       1.09e-06s     C    30001       4   theano.sandbox.gpuarray.elemwise.GpuDimShuffle
-   0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
+   0.7%    99.8%       0.026s       2.59e-06s     C    10001       2   theano.sandbox.gpuarray.basic_ops.GpuAllocEmpty
-   0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
+   0.2%   100.0%       0.008s       3.95e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
+  59.5%    59.5%       2.329s       1.16e-04s     C     20001        3   GpuGemv{inplace=True}
-   8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
+   4.1%    63.6%       0.162s       8.10e-06s     C     20001        3   HostFromGpu(gpuarray)
-   8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
+   4.0%    67.6%       0.157s       1.57e-05s     C     10000        1   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
-   7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
+   3.8%    71.4%       0.149s       1.49e-05s     C     10000        1   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
-   7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
+   3.7%    75.1%       0.144s       1.44e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
-   7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
+   3.6%    78.7%       0.141s       1.41e-05s     C     10000        1   GpuElemwise{gt,no_inplace}
-   6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
+   3.4%    82.1%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Cast{float32}}[]<gpuarray>
-   6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
+   3.4%    85.5%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
-   6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
+   3.3%    88.8%       0.131s       1.31e-05s     C     10000        1   GpuCAReduceCuda{add}
-   5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
+   2.9%    91.7%       0.112s       1.12e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
-   4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
+   2.6%    94.3%       0.102s       1.02e-05s     C     10000        1   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
-   4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
+   2.5%    96.7%       0.096s       9.63e-06s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
-   4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
+   1.6%    98.3%       0.061s       6.10e-06s     C     10000        1   GpuFromHost<None>
-   4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
+   0.7%    99.0%       0.026s       2.59e-06s     C     10001        2   GpuAllocEmpty{dtype='float32', context_name=None}
-   0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
+   0.5%    99.5%       0.021s       1.06e-06s     C     20001        3   InplaceGpuDimShuffle{x}
-   0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
+   0.3%    99.8%       0.011s       1.14e-06s     C     10000        1   InplaceGpuDimShuffle{1,0}
-   0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
+   0.2%   100.0%       0.008s       3.95e-07s     C     20001        3   Shape_i{0}
-   0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
+   0.0%   100.0%       0.000s       2.00e-05s     C        1        1   GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
-   0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
+   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
-   0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
-   ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-   8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
+  55.0%    55.0%       2.154s       2.15e-04s   10000     7   GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+   4.5%    59.5%       0.176s       1.76e-05s   10000    18   GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
-   8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
+   4.0%    63.5%       0.157s       1.57e-05s   10000    12   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
-   7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
+   3.8%    67.3%       0.149s       1.49e-05s   10000    15   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
-   7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
+   3.7%    71.0%       0.144s       1.44e-05s   10000     4   GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
-   7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
+   3.6%    74.6%       0.141s       1.41e-05s   10000    16   GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
-   6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
+   3.4%    78.0%       0.133s       1.33e-05s   10000    10   GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
-   6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
+   3.4%    81.4%       0.133s       1.33e-05s   10000     9   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
-   6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
+   3.3%    84.7%       0.131s       1.31e-05s   10000    17   GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
-   5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
+   2.9%    87.5%       0.112s       1.12e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
-   4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
+   2.6%    90.1%       0.102s       1.02e-05s   10000    20   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
-   4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
+   2.5%    92.6%       0.096s       9.63e-06s   10000    13   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
-   4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
+   2.3%    94.9%       0.090s       9.04e-06s   10000    19   HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
-   4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
+   1.8%    96.7%       0.072s       7.16e-06s   10000    14   HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
-   4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
+   1.6%    98.3%       0.061s       6.10e-06s   10000     6   GpuFromHost<None>(Shape_i{0}.0)
-   3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
+   0.7%    99.0%       0.026s       2.59e-06s   10000     5   GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
-   0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
+   0.3%    99.3%       0.013s       1.33e-06s   10000     0   InplaceGpuDimShuffle{x}(b)
-   0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
+   0.3%    99.6%       0.011s       1.14e-06s   10000     2   InplaceGpuDimShuffle{1,0}(x)
-   0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
+   0.2%    99.8%       0.008s       7.94e-07s   10000     8   InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
-   0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
+   0.1%    99.9%       0.005s       5.27e-07s   10000     1   Shape_i{0}(x)
-   ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
+   ... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
 # 3. Conclusions

--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
@@ -81,7 +81,7 @@ single name and a single device.
   It is often the case that multi-gpu operation requires or assumes
   that all the GPUs involved are equivalent.  This is not the case
   for this implementation.  Since the user has the task of
-   distrubuting the jobs across the different device a model can be
+   distributing the jobs across the different device a model can be
   built on the assumption that one of the GPU is slower or has
   smaller memory.
@@ -141,4 +141,4 @@ is a example.
 Of course you can mix transfers and operations in any order you
 choose. However you should try to minimize transfer operations
-because they will introduce overhead any may reduce performance.
+because they will introduce overhead that may reduce performance.
--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -104,10 +104,9 @@ class DeviceParam(ConfigParam):
 AddConfigVar(
    'device',
-    ("Default device for computations. If gpu*, change the default to try "
+    ("Default device for computations. If cuda* or opencl*, change the"
-     "to move computation to it and to put shared variable of float32 "
+     "default to try to move computation to the GPU. Do not use upper case"
-     "on it. Do not use upper case letters, only lower case even if "
+     "letters, only lower case even if NVIDIA uses capital letters."),
-     "NVIDIA use capital letters."),
    DeviceParam('cpu', allow_override=False),
    in_c_key=False)

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
    t0 = 0
    t1 = -1
+    f()  # Ignore first function call to get representative time.
    if execute:
        sync = (hasattr(theano, "sandbox") and
                hasattr(theano.sandbox, "cuda") and
                theano.sandbox.cuda.cuda_available)
+        sync2 = (hasattr(theano, "gpuarray") and
+                 theano.gpuarray.pygpu_activated)
        t0 = time.time()
        for i in range(iters):
            f()
        if sync:
            theano.sandbox.cuda.synchronize()
+        if sync2:
+            c.get_value(borrow=True, return_internal_type=True).sync()
        t1 = time.time()
    return t1 - t0, impl

--- a/theano/sandbox/gpuarray/__init__.py
+++ b/theano/sandbox/gpuarray/__init__.py
@@ -4,6 +4,7 @@ which refered to theano.sandbox.gpuarray."""
 import warnings
 from theano.gpuarray import *
-message = "theano.sandbox.gpuarray has been moved to theano.gpuarray." + \
+message = ("theano.sandbox.gpuarray has been moved to theano.gpuarray. "
-    " Please update your code and pickles."
+    "Please update your code and pickles. If the warning persists, "
+    "clear theano's cache ('$theano/bin/theano-cache clear').")
 warnings.warn(message)