Remove gpuarray references from documentation

f7a506ff · Maxim Kochurov · Brandon T. Willard · cc584d6c · f7a506ff · f7a506ff
--- a/doc/acknowledgement.rst
+++ b/doc/acknowledgement.rst
@@ -14,11 +14,8 @@ Acknowledgements
 * The developers of `Theano <https://github.com/Theano/Theano>`_
 * All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_.
 * All Theano users that have given us feedback.
-* The GPU implementation of tensordot is based on code from Tijmen
-  Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
 * Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in:
    P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
  We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
-* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` was included in Theano in the previous GPU back-end. It is still in the history, but not in the current version. It has the same license.
--- a/doc/extending/creating_a_c_op.rst
+++ b/doc/extending/creating_a_c_op.rst
@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op`
 =====================================
 This tutorial covers how to extend Aesara with an :class:`Op` that offers a C
-implementation. It does not cover :class:`Op`\s that run on a GPU but it does introduce
+implementation.  This tutorial is aimed at individuals who already know how to
-many elements and concepts which are relevant for GPU :class:`Op`\s. This tutorial is
+extend Aesara (see tutorial :ref:`creating_an_op`) by adding a new :class:`Op`
-aimed at individuals who already know how to extend Aesara (see tutorial
+with a Python implementation and will only cover the additional knowledge
-:ref:`creating_an_op`) by adding a new :class:`Op` with a Python implementation
+required to also produce :class:`Op`\s with C implementations.
-and will only cover the additional knowledge required to also produce :class:`Op`\s
-with C implementations.
 Providing an Aesara :class:`Op` with a C implementation requires to interact with
 Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
@@ -927,7 +925,7 @@ discussed below.
       further below.
 For every input which has a :attr:`dtype` attribute (this means
-Tensors, and equivalent types on GPU), the following macros will be
+Tensors), the following macros will be
 defined unless your `Op` class has an :attr:`Op.check_input` attribute
 defined to False. In these descrptions 'i' refers to the position
 (indexed from 0) in the input array.
@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled
 by Aesara.
 For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`.
-For the GPU, you must add in this second flag `nvcc.flags=-g` (it slow
-down computation on the GPU, but it is enabled by default on the CPU).
 Then you must start Python inside GDB and in it start your Python
 process:

--- a/doc/extending/creating_an_op.rst
+++ b/doc/extending/creating_an_op.rst
@@ -824,10 +824,10 @@ will not be accepted.
 :class:`NanGuardMode` help users find where in the graph NaN appear. But
 sometimes, we want some variables to not be checked. For example, in
-the old GPU back-end, we use a float32 :class:`CudaNdarray` to store the MRG
+the old GPU back-end, we used a float32 :class:`CudaNdarray` to store the MRG
 random number generator state (they are integers). So if :class:`NanGuardMode`
-check it, it will generate false positive. Another case is related to
+checked it, it would generate a false positive. Another case is related to
-:class:`[Gpu]AllocEmpty` or some computation on it (like done by :class:`Scan`).
+:class:`AllocEmpty` or some computations on it (like done by :class:`Scan`).
 You can tell :class:`NanGuardMode` to do not check a variable with:
 :attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically

--- a/doc/extending/ctype.rst
+++ b/doc/extending/ctype.rst
@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for
        Allows to specify a special compiler. This will force this compiler for
        the current compilation block (a particular :class:`Op` or the full
-        graph).  This is used for the GPU code.
+        graph).
    .. method:: c_code_cache_version()
@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f
 In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent
 the C variable names of the :class:`DeepCopyOp` input and output
-respectively. See an example for the type ``GpuArrayType`` (GPU
+respectively. The version parameter is what is returned by
-array) in the file ``aesara/gpuarray/type.py``. The version
+:meth:`DeepCopyOp.c_code_cache_version`. By default, it will recompile the C
-parameter is what is returned by :meth:`DeepCopyOp.c_code_cache_version`. By
+code for each process.
-default, it will recompile the C code for each process.
 :class:`ViewOp`
 ===============

--- a/doc/extending/graph_rewriting.rst
+++ b/doc/extending/graph_rewriting.rst
@@ -829,9 +829,9 @@ Explanations:
 * ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`.
 * ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph.
-* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /...
+* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically /...
 * ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase.
-* ``Aesara Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
+* ``Aesara Linker time (includes C code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
 * ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
 * ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``.

--- a/doc/extending/graphstructures.rst
+++ b/doc/extending/graphstructures.rst
@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by
 identifying and replacing certain patterns in the graph with other specialized
 patterns that produce the same results but are either faster or more
 stable. Optimizations can also detect identical subgraphs and ensure that the
-same values are not computed twice or reformulate parts of the graph to a GPU
+same values are not computed twice.
-specific version.
 For example, one (simple) optimization that Aesara uses is to replace
 the pattern :math:`\frac{xy}{y}` by :math:`x`.

--- a/doc/extending/other_ops.rst
+++ b/doc/extending/other_ops.rst
@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused
 by our elemwise operations on tensors. If the scalar operation has C code, the
 elemwise implementation will automatically have C code too. This
 will enable the fusion of elemwise operations using your new scalar
-operation. It can also reuse the GPU elemwise code. It is similar for
+operation. It is similar for reduction operations.
-reduction operations.
 Be careful about some possible problems in the definition of the
 ``grad`` method, and about dependencies that may not be available. In
@@ -125,11 +124,7 @@ Random distribution
 We have 3 base random number generators. One that wraps NumPy's random
 generator, one that implements MRG31k3p and one that wraps CURAND.
-The fastest, but less developed, is CURAND. It works only on CUDA-enabled
+The recommended and 2nd faster is MRG. It works on the CPU and
-GPUs. It does not work on the CPU and it has fewer random distributions
-implemented.
-The recommended and 2nd faster is MRG. It works on the GPU and CPU and
 has more implemented distributions.
 The slowest is our wrapper on NumPy's random generator.

--- a/doc/extending/type.rst
+++ b/doc/extending/type.rst
@@ -194,12 +194,11 @@ default values.
      :noindex:
      If filter_inplace is defined, it will be called instead of
-      filter() This is to allow reusing the old allocated memory. As
+      filter() This is to allow reusing the old allocated memory. This was used
-      of this writing this is used only when we transfer new data to a
+      only when new data was transferred to a shared variable on a GPU.
-      shared variable on the gpu.
+      ``storage`` will be the old value (e.g. the old `ndarray`).
-      ``storage`` will be the old value. i.e. The old numpy array,
-      CudaNdarray, ...
    .. method:: is_valid_value(value)
      :noindex:

--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -6,17 +6,13 @@
 Frequently Asked Questions
 ==========================
-Does Aesara support Python 3?
------------------------------
-We support both Python 2 >= 2.7 and Python 3 >= 3.4.
 Output slight numerical difference
 ----------------------------------
-Sometimes when you compare the output of Aesara using different
+Sometimes when you compare the output of Aesara using different Aesara flags,
-Aesara flags, Aesara versions, CPU and GPU or with other software like
+Aesara versions, CPU and GPU devices, or with other software like NumPy, you
-NumPy, you will see small numerical differences.
+will see small numerical differences.
 This is normal. Floating point numbers are approximations of real
 numbers. This is why doing a+(b+c) vs (a+b)+c can give small
@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most
 optimizations and disables the generation of any c/cuda code. This is useful
 for quickly testing a simple idea.
-If c/cuda code is necessary, as when using a GPU, the flag
+If C code is necessary, the flag
 ``optimizer=fast_compile`` can be used instead. It instructs Aesara to
-skip time consuming optimizations but still generate c/cuda code.
+skip time consuming optimizations but still generate C code.
 Similarly using the flag ``optimizer_excluding=inplace`` will speed up
 compilation by preventing optimizations that replace operations with a
@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to
 reuse them during the next call to the same Aesara function, if they are of the
 correct shape. The shape could change if the shapes of the inputs change.
-.. note::
-   With :attr:`preallocate <config.gpuarray__preallocate>`, this isn't
-   very useful with GPU anymore.
 .. _unsafe_optimization:
 Unsafe optimization
@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation:
    f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c'))
    f(10.)
-New GPU backend using libgpuarray
---------------------------------
-The new aesara GPU backend (:ref:`gpuarray`) uses ``config.gpuarray__preallocate`` for GPU memory allocation.
 Related Projects
 ----------------

--- a/doc/install.rst
+++ b/doc/install.rst
@@ -13,7 +13,4 @@ Supported platforms:
  install_windows
  install_centos6
-Once your setup is complete and if you installed the GPU libraries, head to :ref:`testing_the_gpu` to find how to verify
-everything is working properly.
 To update your current installation see :ref:`updating`.
--- a/doc/install_generic.inc
+++ b/doc/install_generic.inc
@@ -12,23 +12,14 @@ Stable Installation
 With ``conda``
 ^^^^^^^^^^^^^^
-If you use conda, you can directly install both aesara and pygpu. Libgpuarray
+If you use conda, you can directly install aesara.
-will be automatically installed as a dependency of pygpu.
 .. code-block:: bash
-    conda install aesara pygpu
+    conda install aesara
-.. warning::
-   The Aesara developers do not maintain ``pygpu``, so compatibility isn't
-   guaranteed.
 With ``pip``
 ^^^^^^^^^^^^
-If you use pip, you have to install Aesara and libgpuarray separately.
 aesara
 ::::::
@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with:
 If you encountered any trouble, head to the :ref:`troubleshooting` page.
-libgpuarray
-:::::::::::
-Download it with::
-    git clone https://github.com/Theano/libgpuarray.git
-    cd libgpuarray
-and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
 Bleeding-Edge Installation (recommended)
 ----------------------------------------
@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with:
 If you encountered any trouble, head to the :ref:`troubleshooting` page.
-libgpuarray
-^^^^^^^^^^^
-Install the latest, development version of libgpuarray following the
-`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
-.. note::
-    Currently, you need ``libgpuarray`` version ``0.7.X`` that is not in conda default channel.
-    But you can install it with our own channel ``mila-udem`` (that only supports Python 2.7, 3.5 and 3.6)::
-        conda install -c mila-udem pygpu
 Developer Installation
 ----------------------
@@ -116,8 +84,3 @@ Install the developer version of Aesara with:
  source directory.
 If you encountered any trouble, head to the :ref:`troubleshooting` page.
-libgpuarray
-^^^^^^^^^^^
-See instructions for bleeding-edge installation about ``libgpuarray``.
--- a/doc/install_macos.rst
+++ b/doc/install_macos.rst
@@ -17,21 +17,6 @@ details so that we can add alternative instructions.
 .. include:: requirements.inc
-.. _gpu_macos:
-.. attention::
-    For MacOS you should be able to follow the above instructions to
-    setup CUDA, but be aware of the following caveats:
-       * If you want to compile the CUDA SDK code, you may need to temporarily
-         revert back to Apple's gcc (``sudo port select gcc``) as their Makefiles
-         are not compatible with MacPort's gcc.
-       * If CUDA seems unable to find a CUDA-capable GPU, you may need to manually
-         toggle your GPU on, which can be done with
-         `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
 .. attention::
    Aesara officially supports only clang on OS X.  This can be installed

--- a/doc/install_ubuntu.rst
+++ b/doc/install_ubuntu.rst
@@ -11,8 +11,6 @@ Ubuntu Installation Instructions
    from GitHub, please make sure you are reading `the latest version of this
    page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_.
-.. _gpu_linux:
 .. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below)
 .. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
 .. |CompilerName| replace:: ``g++``
@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended)
 If you want to acquire the requirements through your system packages
 and install them system wide follow these instructions:
-For Ubuntu 16.04 with cuda 7.5
+For Ubuntu 16.04
 .. code-block:: bash
    sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz
    sudo pip install Aesara
-    # cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
    sudo apt-get install g++-4.9
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20

--- a/doc/install_windows.rst
+++ b/doc/install_windows.rst
@@ -30,26 +30,10 @@ Install requirements and optional packages
    * Arguments between <...> are optional.
    * ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended.
-    * ``git`` package installs git source control through conda, which is required for the development versions of Aesara and libgpuarray
+    * ``git`` package installs git source control through conda, which is required for the development version of Aesara
-.. _gpu_windows:
-Install and configure the GPU drivers (recommended)
+.. Installation of Aesara.
---------------------------------------------------
-.. warning::
-    OpenCL support is still minimal for now.
-Install CUDA drivers
-^^^^^^^^^^^^^^^^^^^^
-Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
-to install the CUDA driver and the CUDA Toolkit.
-You must reboot the computer after the driver installation.
-.. Installation of Aesara and libgpuarray.
 .. include:: install_generic.inc
    :start-after: .. _install_generic:
@@ -73,7 +57,3 @@ generic guidelines to get a working environment:
             path`` option.
          3. Enable OpenMP support by checking the option ``openmp support
             option``.
-    * Install CUDA with the same instructions as above.
-    * Install the latest, development version of libgpuarray following the
-      `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
--- a/doc/library/compile/shared.rst
+++ b/doc/library/compile/shared.rst
@@ -30,12 +30,12 @@
       By default, return a copy of the data. If ``borrow=True`` (and
       ``return_internal_type=False``), maybe it will return a copy.
-       For tensor, it will always return a ndarray by default, so if
+       For tensor, it will always return an `ndarray` by default, so if
-       the data is on the GPU, it will return a copy, but if the data
+       the data is on another device, it will return a copy, but if the data
       is on the CPU, it will return the original data.  If you do
       ``borrow=True`` and ``return_internal_type=True``, it will
-       always return the original data, not a copy, but this can be a
+       always return the original data, not a copy, but this can be a non-`ndarray`
-       GPU object.
+       type of object.
    .. method:: set_value(self, new_value, borrow=False)

--- a/doc/library/config.rst
+++ b/doc/library/config.rst
@@ -51,11 +51,11 @@ Environment Variables
    .. code-block:: bash
-        AESARA_FLAGS='floatX=float32,device=cuda0,gpuarray__preallocate=1'  python <myscript>.py
+        AESARA_FLAGS='floatX=float32'  python <myscript>.py
    If a value is defined several times in ``AESARA_FLAGS``,
    the right-most definition is used, so, for instance, if
-    ``AESARA_FLAGS='device=cpu,device=cuda0'`` is set, then ``cuda0`` will be
+    ``AESARA_FLAGS='floatX=float32,floatX=float64'`` is set, then ``float64`` will be
    used.
 .. envvar:: AESARARC
@@ -72,15 +72,11 @@ Environment Variables
        floatX = float32
        device = cuda0
-        [gpuarray]
-        preallocate = 1
    Configuration attributes that are available directly in ``config``
-    (e.g. ``config.device``, ``config.mode``) should be defined in the
+    (e.g. ``config.mode``) should be defined in the ``[global]`` section.
-    ``[global]`` section.
+    Attributes from a subsection of ``config``
-    Attributes from a subsection of ``config`` (e.g. ``config.gpuarray__preallocate``,
+    (e.g. ``config.dnn__conv__algo_fwd``) should be defined in their
-    ``config.dnn__conv__algo_fwd``) should be defined in their corresponding
+    corresponding section (e.g. ``[dnn.conv]``).
-    section (e.g. ``[gpuarray]``, ``[dnn.conv]``).
    Multiple configuration files can be specified by separating them with ``':'``
    characters (as in ``$PATH``).  Multiple configuration files will be merged,
@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in:
 .. attribute:: device
-    String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
+    String value: either ``'cpu'``
-    ``'opencl0:0'``, ``'opencl0:1'``, ...
-    Default device for computations. If ``'cuda*``, change the default to try
-    to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
-    the OpenCL libraries will be used. To let the driver select the device,
-    use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
-    either we fall back on the CPU, or an error is raised, depending
-    on the :attr:`force_device` flag.
-    This flag's value cannot be modified during the program execution.
-    Do not use upper case letters; only lower case, even if NVIDIA uses
-    capital letters.
 .. attribute:: force_device
@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in:
    Default: ``False``
-    If ``True`` and ``device=gpu*``, Aesara raises an error when it cannot
-    use the specified :attr:`device`. If ``True`` and ``device=cpu``,
-    Aesara disables the GPU.  If ``False`` and ``device=gpu*``, and when the
-    specified device cannot be used, Aesara emits a warning and falls back to
-    the CPU.
-    This flag's value cannot be modified during the program execution.
-.. attribute:: init_gpu_device
-    String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
-    ``'opencl0:0'``, ``'opencl0:1'``, ...
-    Initialize the gpu device to use.
-    When its value is ``'cuda*'`` or ``'opencl*'``, the Aesara
-    flag :attr:`device` must be ``'cpu'``.
-    Unlike :attr:`device`, setting this flag to a specific GPU will not
-    make Aesara attempt to use the device by default.  More specifically, it
-    will **not** move computations, nor shared variables, to the specified GPU.
-    This flag can be used to run GPU-specific tests on a particular GPU, instead
-    of the default one.
    This flag's value cannot be modified during the program execution.
 .. attribute:: print_active_device
@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in:
    Default: ``True``
-    Print the active device when the GPU device is initialized.
+    Print the active device when the device is initialized.
 .. attribute:: floatX
@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in:
    Default: ``'default'``
    If ``more``, sometimes Aesara will select :class:`Op` implementations that
-    are more "deterministic", but slower. In particular, on the GPU,
+    are more "deterministic", but slower.  See the ``dnn.conv.algo*``
-    Aesara will avoid using ``AtomicAdd``. Sometimes Aesara will still use
-    non-deterministic implementations, e.g. when there isn't a GPU :class:`Op`
-    implementation that is deterministic. See the ``dnn.conv.algo*``
    flags for more cases.
 .. attribute:: allow_gc
@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in:
    functions with many fast :class:`Op`\s, but it also increases Aesara's memory
    usage.
-.. note:: If :attr:`config.gpuarray__preallocate` is the default value
-    or not disabled ``(-1)``, this is not useful anymore on the GPU.
 .. attribute:: config.scan__allow_output_prealloc
    Bool value, either ``True`` or ``False``
@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in:
    <https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__
    library, which is faster than the standard ``libm``.
-.. attribute:: config.gpuarray__preallocate
-    Float value
-    Default: 0 (Preallocation of size 0, only cache the allocation)
-    Controls the preallocation of memory with the gpuarray backend.
-    This value represents the start size (either in MB or the fraction
-    of total GPU memory) of the memory pool. If more memory is needed,
-    Aesara will try to obtain more, but this can cause memory
-    fragmentation.
-    A negative value will completely disable the allocation cache.
-    This can have a severe impact on performance and should not be
-    used outside of debugging.
-        * < 0: disabled
-        * 0 <= N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
-        * > 1: use this number in megabytes (MB) of memory.
-    .. note::
-        This could cause memory fragmentation, so, if you have a memory
-        error while using the cache, try to allocate more memory at
-        the start, or disable it.
-    .. note::
-        The clipping at 95% can be bypassed by specifying the exact
-        number of megabytes. If more then 95% are needed, it will try
-        automatically to get more memory. But this can cause
-        fragmentation, see note above.
-.. attribute:: config.gpuarray__sched
-    String value: ``'default'``, ``'multi'``, ``'single'``
-    Default: ``'default'``
-    Control the stream mode of contexts.
-    The sched parameter passed for context creation to ``pygpu``.  With
-    CUDA, using ``"multi"`` means using the parameter
-    ``cudaDeviceScheduleBlockingSync``. This is useful to lower the CPU overhead
-    when waiting for a GPU.
-.. attribute:: config.gpuarray__single_stream
-    Boolean value
-    Default: ``True``
-    Control the stream mode of contexts.
-    If your computations consist of mostly small arrays, using
-    single-stream will avoid the synchronization overhead and usually
-    be faster.  For larger arrays it does not make a difference yet.
-.. attribute:: config.gpuarray__cache_path
-   Default: ``config.compiledir``/gpuarray_kernels
-   Directory to cache pre-compiled kernels for the gpuarray backend.
 .. attribute:: linker
    String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'``

--- a/doc/library/sparse/index.rst
+++ b/doc/library/sparse/index.rst
@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it.
 The sparse module provides the same functionality as the tensor
 module. The difference lies under the covers because sparse matrices
-do not store data in a contiguous array. Note that there are no GPU
+do not store data in a contiguous array. The sparse module has
-implementations for sparse matrices in Aesara. The sparse module has
 been used in:
 - NLP: Dense linear transformations of sparse vectors.

--- a/doc/library/tensor/nnet/conv.rst
+++ b/doc/library/tensor/nnet/conv.rst
@@ -29,51 +29,13 @@ The recommended user interface are:
 With those new interface, Aesara will automatically use the fastest
 implementation in many cases. On the CPU, the implementation is a GEMM
-based one. On the GPU, there is a GEMM based and :ref:`cuDNN
+based one.
-<libdoc_gpuarray_dnn>` version.
-By default on the GPU, if cuDNN is available, it will be used,
-otherwise we will fall back to using gemm based version (slower than
-cuDNN in most cases and uses more memory). To get an error if cuDNN
-can not be used, you can supply the Aesara flag ``dnn.enable=True``.
-Either cuDNN and the gemm version can be disabled using the Aesara flags
-``optimizer_excluding=conv_dnn`` and ``optimizer_excluding=conv_gemm``,
-respectively. If both are disabled, it will raise an error.
-For the cuDNN version, there are different algorithms with different
-memory/speed trade-offs. Manual selection of the right one is very
-difficult as it depends on the shapes and hardware. So it can change
-for each layer. An auto-tuning mode exists and can be activated by
-those flags: ``dnn__conv__algo_fwd=time_once``,
-``dnn__conv__algo_bwd_data=time_once`` and
-``dnn__conv__algo_bwd_filter=time_once``. Note, they are good mostly
-when the shape do not change.
 This auto-tuning has the inconvenience that the first call is much
 slower as it tries and times each implementation it has. So if you
 benchmark, it is important that you remove the first call from your
 timing.
-Also, a meta-optimizer has been implemented for the gpu convolution
-implementations to automatically choose the fastest implementation
-for each specific convolution in your graph. For each instance, it will
-compile and benchmark each applicable implementation and choose the
-fastest one. It can be enabled using ``optimizer_including=conv_meta``.
-The meta-optimizer can also selectively disable cudnn and gemm version
-using the Aesara flag ``metaopt__optimizer_excluding=conv_dnn`` and
-``metaopt__optimizer_excluding=conv_gemm`` respectively.
-.. note::
-    Aesara had older user interface like
-    aesara.tensor.nnet.conv.conv2d. Do not use them anymore. They
-    will give you slower code and won't allow easy switch between CPU
-    and GPU computation. They also support less type of convolution.
 Implementation Details
 ======================
@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you.
    - :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
      old 2d convolution. DO NOT USE ANYMORE.
-    - :func:`GpuCorrMM <aesara.gpuarray.blas.GpuCorrMM>`
-      This is a GPU-only 2d correlation implementation taken from
-      `caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_. It does not flip the kernel.
      For each element in a batch, it first creates a
      `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
      Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you.
      This is a CPU-only 2d correlation implementation taken from
      `caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
      It does not flip the kernel.
-    - :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` GPU-only
-      convolution using NVIDIA's cuDNN library.
 - Implemented operators for neural network 3D / video convolution:
-    - :func:`GpuCorr3dMM <aesara.gpuarray.blas.GpuCorr3dMM>`
-      This is a GPU-only 3d correlation relying on a Toeplitz matrix
-      and gemm implementation (see :func:`GpuCorrMM <aesara.sandbox.cuda.blas.GpuCorrMM>`)
-      It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
-      ``(no of channels * filter width * filter height * filter depth, output width * output height * output depth)``.
    - :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
      This is a CPU-only 3d correlation implementation based on
      the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you.
      replacement for nnet.conv3d. For convolutions done on CPU,
      nnet.conv3d will be replaced by Corr3dMM.
-    - :func:`dnn_conv3d <aesara.gpuarray.dnn.dnn_conv3d>` GPU-only
-      3D convolution using NVIDIA's cuDNN library (as :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` but for 3d).
-      If cuDNN is available, by default, Aesara will replace all nnet.conv3d
-      operations with dnn_conv.
    - :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
      Another conv3d implementation that uses the conv2d with data reshaping.
      It is faster in some corner cases than conv3d. It flips the kernel.

--- a/doc/library/tensor/nnet/ctc.rst
+++ b/doc/library/tensor/nnet/ctc.rst
@@ -14,8 +14,7 @@
 .. note::
-   This interface is the preferred interface. It will be moved
+   This interface is the preferred interface.
-   automatically to the GPU.
 .. note::

--- a/doc/optimizations.rst
+++ b/doc/optimizations.rst
@@ -42,7 +42,6 @@ Optimization                                              o4             o3  o2
 ========================================================= ============== === === ================= ============= ======
 :term:`merge`                                             x              x   x    x                              x
 :term:`constant folding<constant folding>`                x              x   x    x                              x
-:term:`GPU transfer`                                      x              x   x    x                              x
 :term:`shape promotion<shape promotion>`                  x              x                                       x
 :term:`fill cut<fill cut>`                                x              x                                       x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x              x                                       x
@@ -247,32 +246,10 @@ Optimization                                              o4             o3  o2
        This optimization compresses subgraphs of computationally cheap
        elementwise operations into a single Op that does the whole job in a
        single pass over the inputs (like loop fusion).  This is a win when
-        transfer from main memory to the CPU (or from graphics memory to the
+        transfer from main memory to the CPU is a bottleneck.
-        GPU) is a bottleneck.
        See :class:`FusionOptimizer`
-    GPU transfer
-        The current strategy for choosing which expressions to evaluate on the
-        CPU and which to evaluate on the GPU is a greedy one.  There are a
-        number of Ops ***TODO*** with GPU implementations and whenever we find
-        a graph copying data from GPU to CPU in order to evaluate an
-        expression that could have been evaluated on the GPU, we substitute
-        the GPU version of that Op for the CPU version.  Likewise if we are
-        copying the output of a Op with a GPU implementation to the GPU,
-        then we substitute the GPU version for the CPU version.  In this way, if all goes well,
-        this procedure will result in a graph with the following form:
-        1. copy non-shared inputs to GPU
-        2. carry out most/all computations on the GPU
-        3. copy output back to CPU
-        When using a GPU, :func:`shared()` will default to GPU storage for
-        'float32' ndarray arguments, and these shared variables act as seeds
-        for the greedy algorithm.
-        See :func:`aesara.sandbox.cuda.opt.*`.
    local_log_softmax
        This is a stabilization optimization.
        It can happen due to rounding errors that the softmax probability of one value gets to 0.

--- a/doc/requirements.inc
+++ b/doc/requirements.inc
@@ -9,10 +9,6 @@ Requirements
 .. _Python: http://www.python.org/
 .. _LaTeX: http://www.latex-project.org/
 .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
-.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
-.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
-.. _pycuda: https://mathema.tician.de/software/pycuda/
-.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
 .. _warp-ctc: https://github.com/baidu-research/warp-ctc
    Python_ == >= 3.7
@@ -42,20 +38,6 @@ Requirements
    `pydot-ng <https://github.com/pydot/pydot-ng>`_
        To handle large picture for gif/images.
-    `NVIDIA CUDA drivers and SDK`_
-        **Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
-    `libgpuarray`_
-        Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
-    `pycuda`_ and `skcuda`_
-        Required for some extra operations on the GPU like fft and
-        solvers. We use them to wrap cufft and cusolver. Quick install
-        ``pip install pycuda scikit-cuda``. For cuda 8, the dev
-        version of skcuda (will be released as 0.5.2) is needed for
-        cusolver: ``pip install pycuda; pip install
-        git+https://github.com/lebedov/scikit-cuda.git#egg=scikit-cuda``.
    `warp-ctc`_
        Required for :ref:`Aesara CTC implementation
        <libdoc_tensor_nnet_ctc>`. It is faster then using an
@@ -84,28 +66,3 @@ Install requirements and optional packages
    conda install numpy scipy mkl pytest <sphinx> <pydot-ng>
 * Arguments between <...> are optional.
-Install and configure the GPU drivers (recommended)
---------------------------------------------------
-.. warning::
-    OpenCL support is still minimal for now.
-1. Install CUDA drivers
-    * Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
-      to install the CUDA driver and the CUDA Toolkit.
-    * You must reboot the computer after the driver installation.
-    * Test that it was loaded correctly after the reboot, executing the
-      command `nvidia-smi` from the command line.
-    .. note::
-        Sanity check: The *bin* subfolder should contain an *nvcc*
-        program. This folder is called the *cuda root* directory.
-2. Fix 'lib' path
-    * Add the CUDA 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
-      64-bit OS) to your ``$LD_LIBRARY_PATH`` environment
-      variable. Example: ``/usr/local/cuda/lib64``
--- a/doc/scripts/docgen.py
+++ b/doc/scripts/docgen.py
@@ -54,7 +54,7 @@ if __name__ == '__main__':
    pythonpath = os.pathsep.join([throot, pythonpath])
    sys.path[0:0] = [throot]  # We must not use os.environ.
-    # Make sure we don't use gpu to compile documentation
+    # Make sure we don't use other devices to compile documentation
    env_th_flags = os.environ.get('AESARA_FLAGS', '')
    os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True'

--- a/doc/troubleshooting.rst
+++ b/doc/troubleshooting.rst
@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z).
 This scenario arises when an operation requires allocation of a large contiguous
 block of memory but no blocks of sufficient size are available.
-GPUs do not have virtual memory and as such all allocations must be assigned to
-a continuous memory region. CPUs do not have this limitation because or their
-support for virtual memory. Multiple allocations on a GPU can result in memory
-fragmentation which can makes it more difficult to find contiguous regions
-of memory of sufficient size during subsequent memory allocations.
 A known example is related to writing data to shared variables. When updating a
 shared variable Aesara will allocate new space if the size of the data does not
 match the size of the space already assigned to the variable. This can lead to
@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64}
 It should be noted that using float32 and int{32, 64} together
 inside a function would provide float64 as output.
-Since the GPU can't compute this kind of output, it would be
-preferable not to use those dtypes together.
 To help you find where float64 are created, see the
 :attr:`warn_float64` Aesara flag.
@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If
 some test fails on your machine, you are encouraged to tell us what went
 wrong in the GitHub issues.
-.. warning::
-    Aesara's test should **NOT** be run with ``device=cuda``
-    or they will fail. The tests automatically use the gpu, if any, when
-    needed. If you don't want Aesara to ever use the gpu when running tests,
-    you can set :attr:`config.device` to ``cpu`` and
-    :attr:`config.force_device` to ``True``.
 .. _slow_or_memory:
 Why is my code so slow/uses so much memory
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 There is a few things you can easily do to change the trade-off
-between speed and memory usage. It nothing is said, this affect the
+between speed and memory usage.
-CPU and GPU memory usage.
-Could speed up and lower memory usage:
- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
-   memory then Aesara version. But some flags allow it to use more
-   memory. GPU only.
 Could raise memory usage but speed up computation:
- :attr:`config.gpuarray__preallocate` = 1  # Preallocates the GPU memory
-  and then manages it in a smart way. Does not raise much the memory
-  usage, but if you are at the limit of GPU memory available you might
-  need to specify a lower value. GPU only.
 - :attr:`config.allow_gc` =False
- :attr:`config.optimizer_excluding` =low_memory , GPU only for now.
 Could lower the memory usage, but raise computation time:
- :attr:`config.scan__allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled
+- :attr:`config.scan__allow_gc` = True
 - :attr:`config.scan__allow_output_prealloc` =False
 - Use :func:`batch_normalization()
  <aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
@@ -293,7 +265,7 @@ Aesara/BLAS speed test:
    python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py
 This will print a table with different versions of BLAS/numbers of
-threads on multiple CPUs and GPUs. It will also print some Aesara/NumPy
+threads on multiple CPUs. It will also print some Aesara/NumPy
 configuration information. Then, it will print the running time of the same
 benchmarks for your installation. Try to find a CPU similar to yours in
 the table, and check that the single-threaded timings are roughly the same.

--- a/doc/tutorial/aliasing.rst
+++ b/doc/tutorial/aliasing.rst
@@ -194,24 +194,25 @@ makes it possible to expose Aesara's internal variables without a copy, then it
 proceeds as fast as an in-place update.
-When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
+..
-be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
+   When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
+   be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
-* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
+   * Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
     GPU memory for the new value would be allocated before the old memory was released. If you're
     running near the limits of GPU memory, this could cause you to run out of GPU memory
     unnecessarily.
     *Solution*: update to a newer version of Aesara.
-* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
+   * If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
     you will want to reuse the memory that you allocated the first time if possible - it is both
     faster and more memory efficient.
     *Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
     data to make sure that every chunk is the same size.
-* It is also worth mentioning that, current GPU copying routines
+   * It is also worth mentioning that, current GPU copying routines
     support only contiguous memory.  So Aesara must make the value you
     provide *C-contiguous* prior to copying it.  This can require an
     extra copy of the data on the host.
@@ -219,8 +220,6 @@ be costly.  Here are a few tips to ensure fast and efficient use of GPU memory a
     *Solution*: make sure that the value
     you assign to a GpuArraySharedVariable is *already*  *C-contiguous*.
-(Further information on the current implementation of the GPU version
-of ``set_value()`` can be found here: :ref:`libdoc_gpuarray_type`)
 .. _borrowfunction:

--- a/doc/tutorial/debug_faq.rst
+++ b/doc/tutorial/debug_faq.rst
@@ -329,26 +329,6 @@ Tips:
  of type *float64*.
-"Why does my GPU function seem to be slow?"
-------------------------------------------
-When you compile an Aesara function, if you do not get the speedup that you expect over the
-CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
-on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
-is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
-options:
-* ``warn``: Raise a warning
-* ``pdb``: Stop with a pdb in the computational graph during the compilation
-* ``raise``: Raise an error,
-  if there is a CPU Op in the computational graph.
-It is possible to use this mode by providing the flag in AESARA_FLAGS, such as:
-``AESARA_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
-But note that this optimization will not catch all the CPU Ops, it might miss some
-Ops.
 .. _faq_monitormode:
 "How do I Step through a Compiled Function?"

--- a/doc/tutorial/examples.rst
+++ b/doc/tutorial/examples.rst
@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with
 them in NumPy as usual.  The updates mechanism can be a syntactic convenience,
 but it is mainly there for efficiency.  Updates to shared variables can
 sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
-updates).  Also, Aesara has more control over where and how shared variables are
+updates).
-allocated, which is one of the important elements of getting good performance
-on the :ref:`GPU<using_gpu>`.
 It may happen that you expressed some formula using a shared variable, but
 you do *not* want to use its value. In this case, you can use the
@@ -375,7 +373,6 @@ distribution.  Likewise,  ``rv_n`` represents a random stream of 2x2 matrices of
 draws from a normal distribution.  The distributions that are implemented are
 defined as :class:`RandomVariable`\s
 in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU.
-See `Other Implementations`_ for GPU version.
 Now let's use these objects.  If we call ``f()``, we get random uniform numbers.
@@ -502,22 +499,6 @@ Other Random Distributions
 There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`.
-.. _example_other_random:
-Other Implementations
---------------------
-There is another implementations based on :ref:`MRG31k3p
-<libdoc_rng_mrg>`.
-The `RandomStream` only work on the CPU, MRG31k3p work on the CPU and GPU.
-.. note::
-    To use you the MRG version easily, you can just change the import to:
-        .. code-block:: python
-            from aesara.sandbox.rng_mrg import MRG_RandomStream as RandomStream
 .. _logistic_regression:

--- a/doc/tutorial/index.rst
+++ b/doc/tutorial/index.rst
@@ -48,8 +48,6 @@ Advanced
 .. toctree::
    sparse
-    using_gpu
-    using_multi_gpu
    conv_arithmetic
 Advanced configuration and debugging

--- a/doc/tutorial/loop.rst
+++ b/doc/tutorial/loop.rst
@@ -17,7 +17,6 @@ Scan
 - Advantages of using ``scan`` over *for* loops:
  - Number of iterations to be part of the symbolic graph.
-  - Minimizes GPU transfers (if GPU is involved).
  - Computes gradients through sequential steps.
  - Slightly faster than using a *for* loop in Python with a compiled Aesara function.
  - Can lower the overall memory usage by detecting the actual amount of memory needed.

--- a/doc/tutorial/modes.rst
+++ b/doc/tutorial/modes.rst
@@ -83,11 +83,8 @@ Consider the logistic regression:
    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
            train.maker.fgraph.toposort()]):
        print('Used the cpu')
-    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
-              train.maker.fgraph.toposort()]):
-        print('Used the gpu')
    else:
-        print('ERROR, not able to tell if aesara used the cpu or the gpu')
+        print('ERROR, not able to tell if aesara used the cpu or another device')
        print(train.maker.fgraph.toposort())
    for i in range(training_steps):
@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter.
 Aesara defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled.
+- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
 - ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible.
 - ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python
    implementations. This mode can take much longer than the other modes, but can identify

--- a/doc/tutorial/modes_solution_1.py
+++ b/doc/tutorial/modes_solution_1.py
@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction,
 if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in
 train.maker.fgraph.toposort()):
    print('Used the cpu')
-elif any(x.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for x in
-train.maker.fgraph.toposort()):
-    print('Used the gpu')
 else:
-    print('ERROR, not able to tell if aesara used the cpu or the gpu')
+    print('ERROR, not able to tell if aesara used the cpu or another device')
    print(train.maker.fgraph.toposort())
 for i in range(training_steps):

--- a/doc/tutorial/using_gpu.rst
+++ b/doc/tutorial/using_gpu.rst
-.. _using_gpu:
-=============
-Using the GPU
-=============
-For an introductory discussion of *Graphical Processing Units* (GPU)
-and their use for intensive parallel computation purposes, see `GPGPU
-<http://en.wikipedia.org/wiki/GPGPU>`_.
-One of Aesara's design goals is to specify computations at an abstract
-level, so that the internal function compiler has a lot of flexibility
-about how to carry out those computations.  One of the ways we take
-advantage of this flexibility is in carrying out calculations on a
-graphics card.
-Using the GPU in Aesara is as simple as setting the ``device``
-configuration flag to ``device=cuda``. You can optionally target a
-specific gpu by specifying the number of the gpu as in
-e.g. ``device=cuda2``.  It is also encouraged to set the floating
-point precision to float32 when working on the GPU as that is usually
-much faster.  For example:
-``AESARA_FLAGS='device=cuda,floatX=float32'``.  You can also set these
-options in the .aesararc file's ``[global]`` section:
-     .. code-block:: cfg
-        [global]
-        device = cuda
-        floatX = float32
-.. note::
-    * If your computer has multiple GPUs and you use ``device=cuda``,
-      the driver selects the one to use (usually cuda0).
-    * You can use the program ``nvidia-smi`` to change this policy.
-    * By default, when ``device`` indicates preference for GPU computations,
-      Aesara will fall back to the CPU if there is a problem with the GPU.
-      You can use the flag ``force_device=True`` to instead raise an error when
-      Aesara cannot use the GPU.
-.. _gpuarray:
-GpuArray Backend
----------------
-If you have not done so already, you will need to install libgpuarray
-as well as at least one computing toolkit (CUDA or OpenCL). Detailed
-instructions to accomplish that are provided at
-`libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
-To install Nvidia's GPU-programming toolchain (CUDA) and configure
-Aesara to use it, see the installation instructions for
-:ref:`Linux <gpu_linux>`, :ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
-While all types of devices are supported if using OpenCL, for the
-remainder of this section, whatever compute device you are using will
-be referred to as GPU.
-.. note::
-  GpuArray backend uses ``config.gpuarray__preallocate`` for GPU memory
-  allocation.
-.. warning::
-  The backend was designed to support OpenCL, however current support is
-  incomplete. A lot of very useful ops still do not support it because they
-  were ported from the old backend with minimal change.
-  .. _testing_the_gpu:
-Testing Aesara with GPU
-~~~~~~~~~~~~~~~~~~~~~~~
-To see if your GPU is being used, cut and paste the following program
-into a file and run it.
-Use the Aesara flag ``device=cuda`` to require the use of the GPU. Use the flag
-``device=cuda{0,1,...}`` to specify which GPU to use.
-.. testcode::
-  from aesara import function, config, shared, tensor as at
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], at.exp(x))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (r,))
-  if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-The program just computes ``exp()`` of a bunch of random numbers.  Note
-that we use the :func:`aesara.shared` function to make sure that the
-input *x* is stored on the GPU.
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS
-   [Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the cpu
-.. code-block:: none
-  $ AESARA_FLAGS=device=cpu python gpu_tutorial1.py
-  [Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
-  Looping 1000 times took 2.271284 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the cpu
-  $ AESARA_FLAGS=device=cuda0 python gpu_tutorial1.py
-  Using cuDNN version 5105 on context None
-  Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
-  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-  Looping 1000 times took 1.697514 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-Returning a Handle to Device-Allocated Data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-By default functions that execute on the GPU still return a standard
-numpy ndarray.  A transfer operation is inserted just before the
-results are returned to ensure a consistent interface with CPU code.
-This allows changing the device some code runs on by only replacing
-the value of the ``device`` flag without touching the code.
-If you don't mind a loss of flexibility, you can ask aesara to return
-the GPU object directly.  The following code is modified to do just that.
-.. testcode::
-  from aesara import function, config, shared, tensor as at
-  import numpy
-  import time
-  vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-  iters = 1000
-  rng = numpy.random.RandomState(22)
-  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  f = function([], at.exp(x).transfer(None))
-  print(f.maker.fgraph.toposort())
-  t0 = time.time()
-  for i in range(iters):
-      r = f()
-  t1 = time.time()
-  print("Looping %d times took %f seconds" % (iters, t1 - t0))
-  print("Result is %s" % (numpy.asarray(r),))
-  if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
-                ('Gpu' not in type(x.op).__name__)
-                for x in f.maker.fgraph.toposort()]):
-      print('Used the cpu')
-  else:
-      print('Used the gpu')
-Here ``at.exp(x).transfer(None)`` means "copy ``exp(x)`` to the GPU",
-with ``None`` the default GPU context when not explicitly given.
-For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
-The output is
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS, +SKIP
-   $ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
-   Using cuDNN version 5105 on context None
-   Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
-   [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-   Looping 1000 times took 0.040277 seconds
-   Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-     1.62323285]
-   Used the gpu
-.. code-block:: none
-  $ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
-  Using cuDNN version 5105 on context None
-  Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
-  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-  Looping 1000 times took 0.040277 seconds
-  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
-    1.62323285]
-  Used the gpu
-While the time per call appears to be much lower than the two previous
-invocations (and should indeed be lower, since we avoid a transfer)
-the massive speedup we obtained is in part due to asynchronous nature
-of execution on GPUs, meaning that the work isn't completed yet, just
-'launched'.  We'll talk about that later.
-The object returned is a GpuArray from pygpu.  It mostly acts as a
-numpy ndarray with some exceptions due to its data being on the GPU.
-You can copy it to the host and convert it to a regular ndarray by
-using usual numpy casting such as ``numpy.asarray()``.
-For even more speed, you can play with the ``borrow`` flag.  See
-:ref:`borrowfunction`.
-What Can be Accelerated on the GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The performance characteristics will of course vary from device to
-device, and also as we refine our implementation:
-* In general, matrix multiplication, convolution, and large element-wise
-  operations can be accelerated a lot (5-50x) when arguments are large enough
-  to keep 30 processors busy.
-* Indexing, dimension-shuffling and constant-time reshaping will be
-  equally fast on GPU as on CPU.
-* Summation over rows/columns of tensors can be a little slower on the
-  GPU than on the CPU.
-* Copying of large quantities of data to and from a device is relatively slow,
-  and often cancels most of the advantage of one or two accelerated functions
-  on that data. Getting GPU performance largely hinges on making data transfer
-  to the device pay off.
-The backend supports all regular aesara data types (float32, float64,
-int, ...), however GPU support varies and some units can't deal with
-double (float64) or small (less than 32 bits like int16) data types.
-You will get an error at compile time or runtime if this is the case.
-By default all inputs will get transferred to GPU. You can prevent an
-input from getting transferred by setting its ``tag.target`` attribute to
-'cpu'.
-Complex support is untested and most likely completely broken.
-Tips for Improving Performance on GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-* Consider adding ``floatX=float32`` (or the type you are using) to your
-  ``.aesararc`` file if you plan to do a lot of GPU work.
-* The GPU backend supports *float64* variables, but they are still slower
-  to compute than *float32*. The more *float32*, the better GPU performance
-  you will get.
-* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` (which
-  follow the type set in ``floatX``) to ``dmatrix``, ``dvector`` and
-  ``dscalar``. The latter enforce double precision (*float64* on most
-  machines), which slows down GPU computations on current hardware.
-* Minimize transfers to the GPU device by using ``shared`` variables
-  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
-  When using the GPU, tensor ``shared`` variables are stored on
-  the GPU by default to eliminate transfer time for GPU ops using those
-  variables.
-* If you aren't happy with the performance you see, try running your
-  script with ``profile=True`` flag. This should print some timing
-  information at program termination. Is time being used sensibly?  If
-  an op or Apply is taking more time than its share, then if you know
-  something about GPU programming, have a look at how it's implemented
-  in aesara.gpuarray.  Check the line similar to *Spent Xs(X%) in cpu
-  op, Xs(X%) in gpu op and Xs(X%) in transfer op*. This can tell you
-  if not enough of your graph is on the GPU or if there is too much
-  memory transfer.
-* To investigate whether all the Ops in the computational graph are
-  running on GPU, it is possible to debug or check your code by providing
-  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
-  raising an error or `pdb` for putting a breakpoint in the computational
-  graph if there is a CPU Op.
-  .. _gpu_async:
-GPU Async Capabilities
-~~~~~~~~~~~~~~~~~~~~~~
-By default, all operations on the GPU are run asynchronously.  This
-means that they are only scheduled to run and the function returns.
-This is made somewhat transparently by the underlying libgpuarray.
-A forced synchronization point is introduced when doing memory
-transfers between device and host.
-It is possible to force synchronization for a particular GpuArray by
-calling its ``sync()`` method.  This is useful to get accurate timings
-when doing benchmarks.
-Changing the Value of Shared Variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To change the value of a ``shared`` variable, e.g. to provide new data
-to processes, use ``shared_variable.set_value(new_value)``. For a lot
-more detail about this, see :ref:`aliasing`.
-Exercise
-~~~~~~~~
-Consider again the logistic regression:
-.. testcode::
-    import numpy
-    import aesara
-    import aesara.tensor as at
-    rng = numpy.random
-    N = 400
-    feats = 784
-    D = (rng.randn(N, feats).astype(aesara.config.floatX),
-    rng.randint(size=N,low=0, high=2).astype(aesara.config.floatX))
-    training_steps = 10000
-    # Declare Aesara symbolic variables
-    x = at.matrix("x")
-    y = at.vector("y")
-    w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
-    b = aesara.shared(numpy.asarray(0., dtype=aesara.config.floatX), name="b")
-    x.tag.test_value = D[0]
-    y.tag.test_value = D[1]
-    # Construct Aesara expression graph
-    p_1 = 1 / (1 + at.exp(-at.dot(x, w)-b)) # Probability of having a one
-    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
-    xent = -y*at.log(p_1) - (1-y)*at.log(1-p_1) # Cross-entropy
-    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
-    gw,gb = at.grad(cost, [w,b])
-    # Compile expressions to functions
-    train = aesara.function(
-                inputs=[x,y],
-                outputs=[prediction, xent],
-                updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
-                name = "train")
-    predict = aesara.function(inputs=[x], outputs=prediction,
-                name = "predict")
-    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
-            train.maker.fgraph.toposort()]):
-        print('Used the cpu')
-    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
-              train.maker.fgraph.toposort()]):
-        print('Used the gpu')
-    else:
-        print('ERROR, not able to tell if aesara used the cpu or the gpu')
-        print(train.maker.fgraph.toposort())
-    for i in range(training_steps):
-        pred, err = train(D[0], D[1])
-    print("target values for D")
-    print(D[1])
-    print("prediction on D")
-    print(predict(D[0]))
-    print("floatX=", aesara.config.floatX)
-    print("device=", aesara.config.device)
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS
-   Used the cpu
-   target values for D
-   ...
-   prediction on D
-   ...
-Modify and execute this example to run on GPU with ``floatX=float32``
-and time it using the command line ``time python file.py``. (Of
-course, you may use some of your answer to the exercise in section
-:ref:`Configuration Settings and Compiling Mode<using_modes>`.)
-Is there an increase in speed from CPU to GPU?
-Where does it come from? (Use ``profile=True`` flag.)
-What can be done to further increase the speed of the GPU version? Put
-your ideas to test.
-:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
-Software for Directly Programming a GPU
---------------------------------------
-Leaving aside Aesara which is a meta-programmer, there are:
-* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C)
-  * Vendor-specific
-  * Numeric libraries (BLAS, RNG, FFT) are maturing.
-* **OpenCL**: multi-vendor version of CUDA
-  * More general, standardized.
-  * Fewer libraries, lesser spread.
-* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel
-  computation API from Python
-  * Convenience:
-    Makes it easy to do GPU meta-programming from within Python.
-    Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).
-    GPU memory buffer (``pycuda.gpuarray.GPUArray``).
-    Helpful documentation.
-  * Completeness: Binding to all of CUDA's driver API.
-  * Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
-  * Speed: PyCUDA's base layer is written in C++.
-  * Good memory management of GPU objects:
-    Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').
-    Makes it much easier to write correct, leak- and crash-free code.
-    PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
-    allocated in it is also freed).
-  (This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_
-  and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)
-* **PyOpenCL**: PyCUDA for OpenCL
-Learning to Program with PyCUDA
-------------------------------
-If you already enjoy a good proficiency with the C programming language, you
-may easily leverage your knowledge by learning, first, to program a GPU with the
-CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
-API with a Python wrapper.
-The following resources will assist you in this learning process:
-* **CUDA API and CUDA C: Introductory**
-  * `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_
-  * `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
-* **CUDA API and CUDA C: Advanced**
-  * `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
-    (full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
-  * `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
-    (full lectures, Kirk-Hwu textbook)
-  * `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
-    (extensive coverage, levels from introductory to advanced)
-  * `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
-    (on the relationship between grids, blocks and threads; see also linked and related issues on same page)
-  * `CUDA optimization <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_
-* **PyCUDA: Introductory**
-  * `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_
-  * `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_
-* **PYCUDA: Advanced**
-  * `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_
-The following examples give a foretaste of programming a GPU with PyCUDA. Once
-you feel competent enough, you may try yourself on the corresponding exercises.
-**Example: PyCUDA**
-.. code-block:: python
-  # (from PyCUDA's documentation)
-  import pycuda.autoinit
-  import pycuda.driver as drv
-  import numpy
-  from pycuda.compiler import SourceModule
-  mod = SourceModule("""
-  __global__ void multiply_them(float *dest, float *a, float *b)
-  {
-    const int i = threadIdx.x;
-    dest[i] = a[i] * b[i];
-  }
-  """)
-  multiply_them = mod.get_function("multiply_them")
-  a = numpy.random.randn(400).astype(numpy.float32)
-  b = numpy.random.randn(400).astype(numpy.float32)
-  dest = numpy.zeros_like(a)
-  multiply_them(
-          drv.Out(dest), drv.In(a), drv.In(b),
-          block=(400,1,1), grid=(1,1))
-  assert numpy.allclose(dest, a*b)
-  print(dest)
-Exercise
-~~~~~~~~
-Run the preceding example.
-Modify and execute to work for a matrix of shape (20, 10).
-.. _pyCUDA_aesara:
-**Example: Aesara + PyCUDA**
-.. code-block:: python
-    import numpy, aesara
-    import aesara.misc.pycuda_init
-    from pycuda.compiler import SourceModule
-    import aesara.sandbox.cuda as cuda
-    from aesara.graph.basic import Apply
-    from aesara.graph.op import Op
-    class PyCUDADoubleOp(Op):
-        __props__ = ()
-        def make_node(self, inp):
-            inp = cuda.basic_ops.gpu_contiguous(
-               cuda.basic_ops.as_cuda_ndarray_variable(inp))
-            assert inp.dtype == "float32"
-            return Apply(self, [inp], [inp.type()])
-        def make_thunk(self, node, storage_map, _, _2, impl):
-            mod = SourceModule("""
-        __global__ void my_fct(float * i0, float * o0, int size) {
-        int i = blockIdx.x*blockDim.x + threadIdx.x;
-        if(i<size){
-            o0[i] = i0[i]*2;
-        }
-      }""")
-            pycuda_fct = mod.get_function("my_fct")
-            inputs = [storage_map[v] for v in node.inputs]
-            outputs = [storage_map[v] for v in node.outputs]
-            def thunk():
-                z = outputs[0]
-                if z[0] is None or z[0].shape != inputs[0][0].shape:
-                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
-                grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
-                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
-                           block=(512, 1, 1), grid=grid)
-            return thunk
-Use this code to test it:
->>> x = aesara.tensor.type.fmatrix()
->>> f = aesara.function([x], PyCUDADoubleOp()(x))  # doctest: +SKIP
->>> xv = numpy.ones((4, 5), dtype="float32")
->>> assert numpy.allclose(f(xv), xv*2)  # doctest: +SKIP
->>> print(numpy.asarray(f(xv)))  # doctest: +SKIP
-Exercise
-~~~~~~~~
-Run the preceding example.
-Modify and execute to multiply two matrices: *x* * *y*.
-Modify and execute to return two outputs: *x + y* and *x - y*.
-(Notice that Aesara's current *elemwise fusion* optimization is
-only applicable to computations involving a single output. Hence, to gain
-efficiency over the basic solution that is asked here, the two operations would
-have to be jointly optimized explicitly in the code.)
-Modify and execute to support *stride* (i.e. to avoid constraining the input to be *C-contiguous*).
-Note
----
-* See :ref:`example_other_random` to know how to handle random numbers
-  on the GPU.
-* The mode `FAST_COMPILE` disables C code, so also disables the GPU. You
-  can use the Aesara flag optimizer='fast_compile' to speed up
-  compilation and keep the GPU.
--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
-#!/usr/bin/env python
-# Aesara tutorial
-# Solution to Exercise in section 'Using the GPU'
-# 1. Raw results
-import numpy as np
-import aesara
-import aesara.tensor as at
-aesara.config.floatX = 'float32'
-rng = np.random
-N = 400
-feats = 784
-D = (rng.randn(N, feats).astype(aesara.config.floatX),
-    rng.randint(size=N, low=0, high=2).astype(aesara.config.floatX))
-training_steps = 10000
-# Declare Aesara symbolic variables
-x = aesara.shared(D[0], name="x")
-y = aesara.shared(D[1], name="y")
-w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
-b = aesara.shared(np.asarray(0., dtype=aesara.config.floatX), name="b")
-x.tag.test_value = D[0]
-y.tag.test_value = D[1]
-#print "Initial model:"
-#print w.get_value(), b.get_value()
-# Construct Aesara expression graph
-p_1 = 1 / (1 + at.exp(-at.dot(x, w) - b))  # Probability of having a one
-prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
-xent = -y * at.log(p_1) - (1 - y) * at.log(1 - p_1)  # Cross-entropy
-cost = at.cast(xent.mean(), 'float32') + \
-    0.01 * (w ** 2).sum()  # The cost to optimize
-gw, gb = at.grad(cost, [w, b])
-# Compile expressions to functions
-train = aesara.function(
-            inputs=[],
-            outputs=[prediction, xent],
-            updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
-            name="train")
-predict = aesara.function(inputs=[], outputs=prediction,
-            name="predict")
-if any(n.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for n in
-train.maker.fgraph.toposort()):
-    print('Used the cpu')
-elif any(n.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for n in
-train.maker.fgraph.toposort()):
-    print('Used the gpu')
-else:
-    print('ERROR, not able to tell if aesara used the cpu or the gpu')
-    print(train.maker.fgraph.toposort())
-for i in range(training_steps):
-    pred, err = train()
-#print "Final model:"
-#print w.get_value(), b.get_value()
-print("target values for D")
-print(D[1])
-print("prediction on D")
-print(predict())
-"""
-# 2. Profiling
-# 2.1 Profiling for CPU computations
-# In your terminal, type:
-$ AESARA_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
-# You'll see first the output of the script:
-Used the cpu
-target values for D
-prediction on D
-# Followed by the output of profiling.. You'll see profiling results for each function
-# in the script, followed by a summary for all functions.
-# We'll show here only the summary:
-Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
-Function profiling
-==================
-  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10001 calls to Function.__call__: 1.300452e+00s
-  Time in Function.fn.__call__: 1.215823e+00s (93.492%)
-  Time in thunks: 1.157602e+00s (89.015%)
-  Total compile time: 8.922548e-01s
-    Number of Apply nodes: 17
-    Aesara Optimizer time: 6.270301e-01s
-       Aesara validate time: 5.993605e-03s
-    Aesara Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
-       Import time 3.543139e-03s
-Time in all call to aesara.grad() 1.848292e-02s
-Time since aesara import 2.864s
-Class
---
-<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  64.5%    64.5%       0.747s       3.73e-05s     C    20001       3   aesara.tensor.blas_c.CGemv
-  33.1%    97.7%       0.384s       4.79e-06s     C    80001       9   aesara.tensor.elemwise.Elemwise
-   1.0%    98.6%       0.011s       1.14e-06s     C    10000       1   aesara.tensor.elemwise.Sum
-   0.7%    99.4%       0.009s       2.85e-07s     C    30001       4   aesara.tensor.elemwise.DimShuffle
-   0.3%    99.7%       0.004s       3.64e-07s     C    10001       2   aesara.tensor.basic.AllocEmpty
-   0.3%   100.0%       0.004s       1.78e-07s     C    20001       3   aesara.compile.ops.Shape_i
-   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
-Ops
---
-<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  64.5%    64.5%       0.747s       3.73e-05s     C     20001        3   CGemv{inplace}
-  18.7%    83.2%       0.217s       2.17e-05s     C     10000        1   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
-   8.9%    92.1%       0.103s       1.03e-05s     C     10000        1   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
-   4.3%    96.4%       0.050s       4.98e-06s     C     10000        1   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
-   1.0%    97.4%       0.011s       1.14e-06s     C     10000        1   Sum{acc_dtype=float64}
-   0.5%    97.9%       0.006s       2.83e-07s     C     20001        3   InplaceDimShuffle{x}
-   0.4%    98.3%       0.004s       4.22e-07s     C     10000        1   Elemwise{sub,no_inplace}
-   0.3%    98.6%       0.004s       3.70e-07s     C     10000        1   Elemwise{neg,no_inplace}
-   0.3%    98.9%       0.004s       3.64e-07s     C     10001        2   AllocEmpty{dtype='float32'}
-   0.3%    99.2%       0.004s       1.78e-07s     C     20001        3   Shape_i{0}
-   0.2%    99.5%       0.003s       2.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
-   0.2%    99.7%       0.003s       2.65e-07s     C     10000        1   Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
-   0.2%    99.9%       0.002s       1.98e-07s     C     10000        1   Elemwise{Cast{float32}}
-   0.1%   100.0%       0.002s       1.54e-07s     C     10000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
-   0.0%   100.0%       0.000s       4.77e-06s     C        1        1   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
-   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
-Apply
------
-<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-  34.0%    34.0%       0.394s       3.94e-05s   10000     7   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-  30.5%    64.5%       0.353s       3.53e-05s   10000    15   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
-  18.7%    83.2%       0.217s       2.17e-05s   10000    12   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
-   8.9%    92.1%       0.103s       1.03e-05s   10000    13   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
-   4.3%    96.4%       0.050s       4.98e-06s   10000    11   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
-   1.0%    97.4%       0.011s       1.14e-06s   10000    14   Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
-   0.4%    97.8%       0.004s       4.22e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
-   0.3%    98.1%       0.004s       3.76e-07s   10000     0   InplaceDimShuffle{x}(b)
-   0.3%    98.4%       0.004s       3.70e-07s   10000    10   Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
-   0.3%    98.7%       0.004s       3.64e-07s   10000     5   AllocEmpty{dtype='float32'}(Shape_i{0}.0)
-   0.2%    99.0%       0.003s       2.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
-   0.2%    99.2%       0.003s       2.65e-07s   10000     9   Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
-   0.2%    99.4%       0.002s       2.21e-07s   10000     1   Shape_i{0}(x)
-   0.2%    99.6%       0.002s       1.98e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
-   0.2%    99.7%       0.002s       1.90e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
-   0.1%    99.9%       0.002s       1.54e-07s   10000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
-   0.1%   100.0%       0.001s       1.34e-07s   10000     3   Shape_i{0}(y)
-   0.0%   100.0%       0.000s       3.89e-05s      1     3   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   0.0%   100.0%       0.000s       4.77e-06s      1     4   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
-   0.0%   100.0%       0.000s       1.19e-06s      1     0   InplaceDimShuffle{x}(b)
-   ... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
-# 2.2 Profiling for GPU computations
-# In your terminal, type:
-$ CUDA_LAUNCH_BLOCKING=1 AESARA_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
-# You'll see first the output of the script:
-Used the gpu
-target values for D
-prediction on D
-Results were produced using a GeForce GTX TITAN X
-# Profiling summary for all functions:
-Function profiling
-==================
-  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
-  Time in 10001 calls to Function.__call__: 4.181247e+00s
-  Time in Function.fn.__call__: 4.081113e+00s (97.605%)
-  Time in thunks: 3.915566e+00s (93.646%)
-  Total compile time: 9.256095e+00s
-    Number of Apply nodes: 21
-    Aesara Optimizer time: 9.996419e-01s
-       Aesara validate time: 6.523132e-03s
-    Aesara Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
-       Import time 4.228115e-03s
-Time in all call to aesara.grad() 3.286195e-02s
-Time since aesara import 15.415s
-Class
---
-<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  59.5%    59.5%       2.329s       1.16e-04s     C    20001       3   aesara.sandbox.gpuarray.blas.GpuGemv
-  29.8%    89.3%       1.166s       1.30e-05s     C    90001      10   aesara.sandbox.gpuarray.elemwise.GpuElemwise
-   4.1%    93.4%       0.162s       8.10e-06s     C    20001       3   aesara.sandbox.gpuarray.basic_ops.HostFromGpu
-   3.3%    96.7%       0.131s       1.31e-05s     C    10000       1   aesara.sandbox.gpuarray.elemwise.GpuCAReduceCuda
-   1.6%    98.3%       0.061s       6.10e-06s     C    10000       1   aesara.sandbox.gpuarray.basic_ops.GpuFromHost
-   0.8%    99.1%       0.033s       1.09e-06s     C    30001       4   aesara.sandbox.gpuarray.elemwise.GpuDimShuffle
-   0.7%    99.8%       0.026s       2.59e-06s     C    10001       2   aesara.sandbox.gpuarray.basic_ops.GpuAllocEmpty
-   0.2%   100.0%       0.008s       3.95e-07s     C    20001       3   aesara.compile.ops.Shape_i
-   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
-Ops
---
-<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  59.5%    59.5%       2.329s       1.16e-04s     C     20001        3   GpuGemv{inplace=True}
-   4.1%    63.6%       0.162s       8.10e-06s     C     20001        3   HostFromGpu(gpuarray)
-   4.0%    67.6%       0.157s       1.57e-05s     C     10000        1   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
-   3.8%    71.4%       0.149s       1.49e-05s     C     10000        1   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
-   3.7%    75.1%       0.144s       1.44e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
-   3.6%    78.7%       0.141s       1.41e-05s     C     10000        1   GpuElemwise{gt,no_inplace}
-   3.4%    82.1%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Cast{float32}}[]<gpuarray>
-   3.4%    85.5%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
-   3.3%    88.8%       0.131s       1.31e-05s     C     10000        1   GpuCAReduceCuda{add}
-   2.9%    91.7%       0.112s       1.12e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
-   2.6%    94.3%       0.102s       1.02e-05s     C     10000        1   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
-   2.5%    96.7%       0.096s       9.63e-06s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
-   1.6%    98.3%       0.061s       6.10e-06s     C     10000        1   GpuFromHost<None>
-   0.7%    99.0%       0.026s       2.59e-06s     C     10001        2   GpuAllocEmpty{dtype='float32', context_name=None}
-   0.5%    99.5%       0.021s       1.06e-06s     C     20001        3   InplaceGpuDimShuffle{x}
-   0.3%    99.8%       0.011s       1.14e-06s     C     10000        1   InplaceGpuDimShuffle{1,0}
-   0.2%   100.0%       0.008s       3.95e-07s     C     20001        3   Shape_i{0}
-   0.0%   100.0%       0.000s       2.00e-05s     C        1        1   GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
-   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
-Apply
------
-<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-  55.0%    55.0%       2.154s       2.15e-04s   10000     7   GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   4.5%    59.5%       0.176s       1.76e-05s   10000    18   GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
-   4.0%    63.5%       0.157s       1.57e-05s   10000    12   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
-   3.8%    67.3%       0.149s       1.49e-05s   10000    15   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
-   3.7%    71.0%       0.144s       1.44e-05s   10000     4   GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
-   3.6%    74.6%       0.141s       1.41e-05s   10000    16   GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
-   3.4%    78.0%       0.133s       1.33e-05s   10000    10   GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
-   3.4%    81.4%       0.133s       1.33e-05s   10000     9   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
-   3.3%    84.7%       0.131s       1.31e-05s   10000    17   GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
-   2.9%    87.5%       0.112s       1.12e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
-   2.6%    90.1%       0.102s       1.02e-05s   10000    20   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
-   2.5%    92.6%       0.096s       9.63e-06s   10000    13   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
-   2.3%    94.9%       0.090s       9.04e-06s   10000    19   HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
-   1.8%    96.7%       0.072s       7.16e-06s   10000    14   HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
-   1.6%    98.3%       0.061s       6.10e-06s   10000     6   GpuFromHost<None>(Shape_i{0}.0)
-   0.7%    99.0%       0.026s       2.59e-06s   10000     5   GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
-   0.3%    99.3%       0.013s       1.33e-06s   10000     0   InplaceGpuDimShuffle{x}(b)
-   0.3%    99.6%       0.011s       1.14e-06s   10000     2   InplaceGpuDimShuffle{1,0}(x)
-   0.2%    99.8%       0.008s       7.94e-07s   10000     8   InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
-   0.1%    99.9%       0.005s       5.27e-07s   10000     1   Shape_i{0}(x)
-   ... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
-# 3. Conclusions
-Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
-consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
-Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
-if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
-"""
--- a/doc/tutorial/using_multi_gpu.rst
+++ b/doc/tutorial/using_multi_gpu.rst
-.. _tut_using_multi_gpu:
-===================
-Using multiple GPUs
-===================
-Aesara has a feature to allow the use of multiple GPUs at the same
-time in one function.  The multiple gpu feature requires the use of
-the :ref:`gpuarray` backend, so make sure that works correctly.
-In order to keep a reasonably high level of abstraction you do not
-refer to device names directly for multiple-gpu use.  You instead
-refer to what we call context names.  These are then mapped to a
-device using the aesara configuration.  This allows portability of
-models between machines.
-.. warning::
-   The code is rather new and is still considered experimental at this
-   point.  It has been tested and seems to perform correctly in all
-   cases observed, but make sure to double-check your results before
-   publishing a paper or anything of the sort.
-.. note::
-   For data-parallelism, you probably are better using `platoon
-   <https://github.com/mila-udem/platoon>`_.
-Defining the context map
------------------------
-The mapping from context names to devices is done through the
-:attr:`config.contexts` option.  The format looks like this::
-    dev0->cuda0;dev1->cuda1
-Let's break it down.  First there is a list of mappings.  Each of
-these mappings is separated by a semicolon ';'.  There can be any
-number of such mappings, but in the example above we have two of them:
-`dev0->cuda0` and `dev1->cuda1`.
-The mappings themselves are composed of a context name followed by the
-two characters '->' and the device name.  The context name is a simple
-string which does not have any special meaning for Aesara.  For
-parsing reasons, the context name cannot contain the sequence '->' or
-';'.  To avoid confusion context names that begin with 'cuda' or
-'opencl' are disallowed.  The device name is a device in the form that
-gpuarray expects like 'cuda0' or 'opencl0:0'.
-.. note::
-   Since there are a bunch of shell special characters in the syntax,
-   defining this on the command-line will require proper quoting, like this:
-   .. code-block:: shell
-       $ AESARA_FLAGS="contexts=dev0->cuda0"
-When you define a context map, if :attr:`config.print_active_device`
-is `True` (the default), Aesara will print the mappings as they are
-defined.  This will look like this:
-.. code-block:: bash
-   $ AESARA_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import aesara'
-   Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
-   Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
-If you don't have enough GPUs for a certain model, you can assign the
-same device to more than one name. You can also assign extra names
-that a model doesn't need to some other devices.  However, a
-proliferation of names is not always a good idea since aesara often
-assumes that different context names will be on different devices and
-will optimize accordingly.  So you may get faster performance for a
-single name and a single device.
-.. note::
-   It is often the case that multi-gpu operation requires or assumes
-   that all the GPUs involved are equivalent.  This is not the case
-   for this implementation.  Since the user has the task of
-   distributing the jobs across the different device a model can be
-   built on the assumption that one of the GPU is slower or has
-   smaller memory.
-A simple graph on two GPUs
--------------------------
-The following simple program works on two GPUs.  It builds a function
-which perform two dot products on two different GPUs.
-.. code-block:: python
-   import numpy
-   import aesara
-   v01 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
-                       target='dev0')
-   v02 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
-                       target='dev0')
-   v11 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
-                       target='dev1')
-   v12 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
-                       target='dev1')
-   f = aesara.function([], [aesara.tensor.dot(v01, v02),
-                            aesara.tensor.dot(v11, v12)])
-   f()
-This model requires a context map with assignations for 'dev0' and
-'dev1'.  It should run twice as fast when the devices are different.
-Explicit transfers of data
--------------------------
-Since operations themselves cannot work on more than one device, they
-will pick a device to work on based on their inputs and automatically
-insert transfers for any input which is not on the right device.
-However you may want some explicit control over where and how these
-transfers are done at some points.  This is done by using the new
-:meth:`transfer` method that is present on variables.  It works for
-moving data between GPUs and also between the host and the GPUs.  Here
-is a example.
-.. code-block:: python
-   import aesara
-   v = aesara.tensor.fmatrix()
-   # Move to the device associated with 'gpudev'
-   gv = v.transfer('gpudev')
-   # Move back to the cpu
-   cv = gv.transfer('cpu')
-Of course you can mix transfers and operations in any order you
-choose. However you should try to minimize transfer operations
-because they will introduce overhead that may reduce performance.