提交 f7a506ff authored 作者: Maxim Kochurov's avatar Maxim Kochurov 提交者: Brandon T. Willard

Remove gpuarray references from documentation

上级 cc584d6c
...@@ -14,11 +14,8 @@ Acknowledgements ...@@ -14,11 +14,8 @@ Acknowledgements
* The developers of `Theano <https://github.com/Theano/Theano>`_ * The developers of `Theano <https://github.com/Theano/Theano>`_
* All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_. * All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_.
* All Theano users that have given us feedback. * All Theano users that have given us feedback.
* The GPU implementation of tensordot is based on code from Tijmen
Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
* Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in: * Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in:
P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689. P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano. We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` was included in Theano in the previous GPU back-end. It is still in the history, but not in the current version. It has the same license.
...@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op` ...@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op`
===================================== =====================================
This tutorial covers how to extend Aesara with an :class:`Op` that offers a C This tutorial covers how to extend Aesara with an :class:`Op` that offers a C
implementation. It does not cover :class:`Op`\s that run on a GPU but it does introduce implementation. This tutorial is aimed at individuals who already know how to
many elements and concepts which are relevant for GPU :class:`Op`\s. This tutorial is extend Aesara (see tutorial :ref:`creating_an_op`) by adding a new :class:`Op`
aimed at individuals who already know how to extend Aesara (see tutorial with a Python implementation and will only cover the additional knowledge
:ref:`creating_an_op`) by adding a new :class:`Op` with a Python implementation required to also produce :class:`Op`\s with C implementations.
and will only cover the additional knowledge required to also produce :class:`Op`\s
with C implementations.
Providing an Aesara :class:`Op` with a C implementation requires to interact with Providing an Aesara :class:`Op` with a C implementation requires to interact with
Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
...@@ -927,7 +925,7 @@ discussed below. ...@@ -927,7 +925,7 @@ discussed below.
further below. further below.
For every input which has a :attr:`dtype` attribute (this means For every input which has a :attr:`dtype` attribute (this means
Tensors, and equivalent types on GPU), the following macros will be Tensors), the following macros will be
defined unless your `Op` class has an :attr:`Op.check_input` attribute defined unless your `Op` class has an :attr:`Op.check_input` attribute
defined to False. In these descrptions 'i' refers to the position defined to False. In these descrptions 'i' refers to the position
(indexed from 0) in the input array. (indexed from 0) in the input array.
...@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled ...@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled
by Aesara. by Aesara.
For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`. For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`.
For the GPU, you must add in this second flag `nvcc.flags=-g` (it slow
down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python Then you must start Python inside GDB and in it start your Python
process: process:
......
...@@ -824,10 +824,10 @@ will not be accepted. ...@@ -824,10 +824,10 @@ will not be accepted.
:class:`NanGuardMode` help users find where in the graph NaN appear. But :class:`NanGuardMode` help users find where in the graph NaN appear. But
sometimes, we want some variables to not be checked. For example, in sometimes, we want some variables to not be checked. For example, in
the old GPU back-end, we use a float32 :class:`CudaNdarray` to store the MRG the old GPU back-end, we used a float32 :class:`CudaNdarray` to store the MRG
random number generator state (they are integers). So if :class:`NanGuardMode` random number generator state (they are integers). So if :class:`NanGuardMode`
check it, it will generate false positive. Another case is related to checked it, it would generate a false positive. Another case is related to
:class:`[Gpu]AllocEmpty` or some computation on it (like done by :class:`Scan`). :class:`AllocEmpty` or some computations on it (like done by :class:`Scan`).
You can tell :class:`NanGuardMode` to do not check a variable with: You can tell :class:`NanGuardMode` to do not check a variable with:
:attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically :attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically
......
...@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for ...@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for
Allows to specify a special compiler. This will force this compiler for Allows to specify a special compiler. This will force this compiler for
the current compilation block (a particular :class:`Op` or the full the current compilation block (a particular :class:`Op` or the full
graph). This is used for the GPU code. graph).
.. method:: c_code_cache_version() .. method:: c_code_cache_version()
...@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f ...@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f
In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent
the C variable names of the :class:`DeepCopyOp` input and output the C variable names of the :class:`DeepCopyOp` input and output
respectively. See an example for the type ``GpuArrayType`` (GPU respectively. The version parameter is what is returned by
array) in the file ``aesara/gpuarray/type.py``. The version :meth:`DeepCopyOp.c_code_cache_version`. By default, it will recompile the C
parameter is what is returned by :meth:`DeepCopyOp.c_code_cache_version`. By code for each process.
default, it will recompile the C code for each process.
:class:`ViewOp` :class:`ViewOp`
=============== ===============
......
...@@ -829,9 +829,9 @@ Explanations: ...@@ -829,9 +829,9 @@ Explanations:
* ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`. * ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`.
* ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph. * ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph.
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /... * ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically /...
* ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase. * ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase.
* ``Aesara Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``. * ``Aesara Linker time (includes C code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module. * ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
* ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``. * ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``.
......
...@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by ...@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by
identifying and replacing certain patterns in the graph with other specialized identifying and replacing certain patterns in the graph with other specialized
patterns that produce the same results but are either faster or more patterns that produce the same results but are either faster or more
stable. Optimizations can also detect identical subgraphs and ensure that the stable. Optimizations can also detect identical subgraphs and ensure that the
same values are not computed twice or reformulate parts of the graph to a GPU same values are not computed twice.
specific version.
For example, one (simple) optimization that Aesara uses is to replace For example, one (simple) optimization that Aesara uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`. the pattern :math:`\frac{xy}{y}` by :math:`x`.
......
...@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused ...@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused
by our elemwise operations on tensors. If the scalar operation has C code, the by our elemwise operations on tensors. If the scalar operation has C code, the
elemwise implementation will automatically have C code too. This elemwise implementation will automatically have C code too. This
will enable the fusion of elemwise operations using your new scalar will enable the fusion of elemwise operations using your new scalar
operation. It can also reuse the GPU elemwise code. It is similar for operation. It is similar for reduction operations.
reduction operations.
Be careful about some possible problems in the definition of the Be careful about some possible problems in the definition of the
``grad`` method, and about dependencies that may not be available. In ``grad`` method, and about dependencies that may not be available. In
...@@ -125,11 +124,7 @@ Random distribution ...@@ -125,11 +124,7 @@ Random distribution
We have 3 base random number generators. One that wraps NumPy's random We have 3 base random number generators. One that wraps NumPy's random
generator, one that implements MRG31k3p and one that wraps CURAND. generator, one that implements MRG31k3p and one that wraps CURAND.
The fastest, but less developed, is CURAND. It works only on CUDA-enabled The recommended and 2nd faster is MRG. It works on the CPU and
GPUs. It does not work on the CPU and it has fewer random distributions
implemented.
The recommended and 2nd faster is MRG. It works on the GPU and CPU and
has more implemented distributions. has more implemented distributions.
The slowest is our wrapper on NumPy's random generator. The slowest is our wrapper on NumPy's random generator.
......
...@@ -194,12 +194,11 @@ default values. ...@@ -194,12 +194,11 @@ default values.
:noindex: :noindex:
If filter_inplace is defined, it will be called instead of If filter_inplace is defined, it will be called instead of
filter() This is to allow reusing the old allocated memory. As filter() This is to allow reusing the old allocated memory. This was used
of this writing this is used only when we transfer new data to a only when new data was transferred to a shared variable on a GPU.
shared variable on the gpu.
``storage`` will be the old value (e.g. the old `ndarray`).
``storage`` will be the old value. i.e. The old numpy array,
CudaNdarray, ...
.. method:: is_valid_value(value) .. method:: is_valid_value(value)
:noindex: :noindex:
......
...@@ -6,17 +6,13 @@ ...@@ -6,17 +6,13 @@
Frequently Asked Questions Frequently Asked Questions
========================== ==========================
Does Aesara support Python 3?
------------------------------
We support both Python 2 >= 2.7 and Python 3 >= 3.4.
Output slight numerical difference Output slight numerical difference
---------------------------------- ----------------------------------
Sometimes when you compare the output of Aesara using different Sometimes when you compare the output of Aesara using different Aesara flags,
Aesara flags, Aesara versions, CPU and GPU or with other software like Aesara versions, CPU and GPU devices, or with other software like NumPy, you
NumPy, you will see small numerical differences. will see small numerical differences.
This is normal. Floating point numbers are approximations of real This is normal. Floating point numbers are approximations of real
numbers. This is why doing a+(b+c) vs (a+b)+c can give small numbers. This is why doing a+(b+c) vs (a+b)+c can give small
...@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most ...@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most
optimizations and disables the generation of any c/cuda code. This is useful optimizations and disables the generation of any c/cuda code. This is useful
for quickly testing a simple idea. for quickly testing a simple idea.
If c/cuda code is necessary, as when using a GPU, the flag If C code is necessary, the flag
``optimizer=fast_compile`` can be used instead. It instructs Aesara to ``optimizer=fast_compile`` can be used instead. It instructs Aesara to
skip time consuming optimizations but still generate c/cuda code. skip time consuming optimizations but still generate C code.
Similarly using the flag ``optimizer_excluding=inplace`` will speed up Similarly using the flag ``optimizer_excluding=inplace`` will speed up
compilation by preventing optimizations that replace operations with a compilation by preventing optimizations that replace operations with a
...@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to ...@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to
reuse them during the next call to the same Aesara function, if they are of the reuse them during the next call to the same Aesara function, if they are of the
correct shape. The shape could change if the shapes of the inputs change. correct shape. The shape could change if the shapes of the inputs change.
.. note::
With :attr:`preallocate <config.gpuarray__preallocate>`, this isn't
very useful with GPU anymore.
.. _unsafe_optimization: .. _unsafe_optimization:
Unsafe optimization Unsafe optimization
...@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation: ...@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation:
f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c')) f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c'))
f(10.) f(10.)
New GPU backend using libgpuarray
---------------------------------
The new aesara GPU backend (:ref:`gpuarray`) uses ``config.gpuarray__preallocate`` for GPU memory allocation.
Related Projects Related Projects
---------------- ----------------
......
...@@ -13,7 +13,4 @@ Supported platforms: ...@@ -13,7 +13,4 @@ Supported platforms:
install_windows install_windows
install_centos6 install_centos6
Once your setup is complete and if you installed the GPU libraries, head to :ref:`testing_the_gpu` to find how to verify
everything is working properly.
To update your current installation see :ref:`updating`. To update your current installation see :ref:`updating`.
...@@ -12,23 +12,14 @@ Stable Installation ...@@ -12,23 +12,14 @@ Stable Installation
With ``conda`` With ``conda``
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
If you use conda, you can directly install both aesara and pygpu. Libgpuarray If you use conda, you can directly install aesara.
will be automatically installed as a dependency of pygpu.
.. code-block:: bash .. code-block:: bash
conda install aesara pygpu conda install aesara
.. warning::
The Aesara developers do not maintain ``pygpu``, so compatibility isn't
guaranteed.
With ``pip`` With ``pip``
^^^^^^^^^^^^ ^^^^^^^^^^^^
If you use pip, you have to install Aesara and libgpuarray separately.
aesara aesara
:::::: ::::::
...@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with: ...@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
:::::::::::
Download it with::
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
Bleeding-Edge Installation (recommended) Bleeding-Edge Installation (recommended)
---------------------------------------- ----------------------------------------
...@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with: ...@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
.. note::
Currently, you need ``libgpuarray`` version ``0.7.X`` that is not in conda default channel.
But you can install it with our own channel ``mila-udem`` (that only supports Python 2.7, 3.5 and 3.6)::
conda install -c mila-udem pygpu
Developer Installation Developer Installation
---------------------- ----------------------
...@@ -116,8 +84,3 @@ Install the developer version of Aesara with: ...@@ -116,8 +84,3 @@ Install the developer version of Aesara with:
source directory. source directory.
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
See instructions for bleeding-edge installation about ``libgpuarray``.
...@@ -17,21 +17,6 @@ details so that we can add alternative instructions. ...@@ -17,21 +17,6 @@ details so that we can add alternative instructions.
.. include:: requirements.inc .. include:: requirements.inc
.. _gpu_macos:
.. attention::
For MacOS you should be able to follow the above instructions to
setup CUDA, but be aware of the following caveats:
* If you want to compile the CUDA SDK code, you may need to temporarily
revert back to Apple's gcc (``sudo port select gcc``) as their Makefiles
are not compatible with MacPort's gcc.
* If CUDA seems unable to find a CUDA-capable GPU, you may need to manually
toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
.. attention:: .. attention::
Aesara officially supports only clang on OS X. This can be installed Aesara officially supports only clang on OS X. This can be installed
......
...@@ -11,8 +11,6 @@ Ubuntu Installation Instructions ...@@ -11,8 +11,6 @@ Ubuntu Installation Instructions
from GitHub, please make sure you are reading `the latest version of this from GitHub, please make sure you are reading `the latest version of this
page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_. page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_.
.. _gpu_linux:
.. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below) .. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below)
.. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2 .. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
.. |CompilerName| replace:: ``g++`` .. |CompilerName| replace:: ``g++``
...@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended) ...@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended)
If you want to acquire the requirements through your system packages If you want to acquire the requirements through your system packages
and install them system wide follow these instructions: and install them system wide follow these instructions:
For Ubuntu 16.04 with cuda 7.5 For Ubuntu 16.04
.. code-block:: bash .. code-block:: bash
sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz
sudo pip install Aesara sudo pip install Aesara
# cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9 sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
......
...@@ -30,26 +30,10 @@ Install requirements and optional packages ...@@ -30,26 +30,10 @@ Install requirements and optional packages
* Arguments between <...> are optional. * Arguments between <...> are optional.
* ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended. * ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended.
* ``git`` package installs git source control through conda, which is required for the development versions of Aesara and libgpuarray * ``git`` package installs git source control through conda, which is required for the development version of Aesara
.. _gpu_windows:
Install and configure the GPU drivers (recommended) .. Installation of Aesara.
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
Install CUDA drivers
^^^^^^^^^^^^^^^^^^^^
Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
You must reboot the computer after the driver installation.
.. Installation of Aesara and libgpuarray.
.. include:: install_generic.inc .. include:: install_generic.inc
:start-after: .. _install_generic: :start-after: .. _install_generic:
...@@ -73,7 +57,3 @@ generic guidelines to get a working environment: ...@@ -73,7 +57,3 @@ generic guidelines to get a working environment:
path`` option. path`` option.
3. Enable OpenMP support by checking the option ``openmp support 3. Enable OpenMP support by checking the option ``openmp support
option``. option``.
* Install CUDA with the same instructions as above.
* Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
...@@ -30,12 +30,12 @@ ...@@ -30,12 +30,12 @@
By default, return a copy of the data. If ``borrow=True`` (and By default, return a copy of the data. If ``borrow=True`` (and
``return_internal_type=False``), maybe it will return a copy. ``return_internal_type=False``), maybe it will return a copy.
For tensor, it will always return a ndarray by default, so if For tensor, it will always return an `ndarray` by default, so if
the data is on the GPU, it will return a copy, but if the data the data is on another device, it will return a copy, but if the data
is on the CPU, it will return the original data. If you do is on the CPU, it will return the original data. If you do
``borrow=True`` and ``return_internal_type=True``, it will ``borrow=True`` and ``return_internal_type=True``, it will
always return the original data, not a copy, but this can be a always return the original data, not a copy, but this can be a non-`ndarray`
GPU object. type of object.
.. method:: set_value(self, new_value, borrow=False) .. method:: set_value(self, new_value, borrow=False)
......
...@@ -51,11 +51,11 @@ Environment Variables ...@@ -51,11 +51,11 @@ Environment Variables
.. code-block:: bash .. code-block:: bash
AESARA_FLAGS='floatX=float32,device=cuda0,gpuarray__preallocate=1' python <myscript>.py AESARA_FLAGS='floatX=float32' python <myscript>.py
If a value is defined several times in ``AESARA_FLAGS``, If a value is defined several times in ``AESARA_FLAGS``,
the right-most definition is used, so, for instance, if the right-most definition is used, so, for instance, if
``AESARA_FLAGS='device=cpu,device=cuda0'`` is set, then ``cuda0`` will be ``AESARA_FLAGS='floatX=float32,floatX=float64'`` is set, then ``float64`` will be
used. used.
.. envvar:: AESARARC .. envvar:: AESARARC
...@@ -72,15 +72,11 @@ Environment Variables ...@@ -72,15 +72,11 @@ Environment Variables
floatX = float32 floatX = float32
device = cuda0 device = cuda0
[gpuarray]
preallocate = 1
Configuration attributes that are available directly in ``config`` Configuration attributes that are available directly in ``config``
(e.g. ``config.device``, ``config.mode``) should be defined in the (e.g. ``config.mode``) should be defined in the ``[global]`` section.
``[global]`` section. Attributes from a subsection of ``config``
Attributes from a subsection of ``config`` (e.g. ``config.gpuarray__preallocate``, (e.g. ``config.dnn__conv__algo_fwd``) should be defined in their
``config.dnn__conv__algo_fwd``) should be defined in their corresponding corresponding section (e.g. ``[dnn.conv]``).
section (e.g. ``[gpuarray]``, ``[dnn.conv]``).
Multiple configuration files can be specified by separating them with ``':'`` Multiple configuration files can be specified by separating them with ``':'``
characters (as in ``$PATH``). Multiple configuration files will be merged, characters (as in ``$PATH``). Multiple configuration files will be merged,
...@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in:
.. attribute:: device .. attribute:: device
String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``, String value: either ``'cpu'``
``'opencl0:0'``, ``'opencl0:1'``, ...
Default device for computations. If ``'cuda*``, change the default to try
to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
the OpenCL libraries will be used. To let the driver select the device,
use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
either we fall back on the CPU, or an error is raised, depending
on the :attr:`force_device` flag.
This flag's value cannot be modified during the program execution.
Do not use upper case letters; only lower case, even if NVIDIA uses
capital letters.
.. attribute:: force_device .. attribute:: force_device
...@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in:
Default: ``False`` Default: ``False``
If ``True`` and ``device=gpu*``, Aesara raises an error when it cannot
use the specified :attr:`device`. If ``True`` and ``device=cpu``,
Aesara disables the GPU. If ``False`` and ``device=gpu*``, and when the
specified device cannot be used, Aesara emits a warning and falls back to
the CPU.
This flag's value cannot be modified during the program execution.
.. attribute:: init_gpu_device
String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Initialize the gpu device to use.
When its value is ``'cuda*'`` or ``'opencl*'``, the Aesara
flag :attr:`device` must be ``'cpu'``.
Unlike :attr:`device`, setting this flag to a specific GPU will not
make Aesara attempt to use the device by default. More specifically, it
will **not** move computations, nor shared variables, to the specified GPU.
This flag can be used to run GPU-specific tests on a particular GPU, instead
of the default one.
This flag's value cannot be modified during the program execution. This flag's value cannot be modified during the program execution.
.. attribute:: print_active_device .. attribute:: print_active_device
...@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``True`` Default: ``True``
Print the active device when the GPU device is initialized. Print the active device when the device is initialized.
.. attribute:: floatX .. attribute:: floatX
...@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``'default'`` Default: ``'default'``
If ``more``, sometimes Aesara will select :class:`Op` implementations that If ``more``, sometimes Aesara will select :class:`Op` implementations that
are more "deterministic", but slower. In particular, on the GPU, are more "deterministic", but slower. See the ``dnn.conv.algo*``
Aesara will avoid using ``AtomicAdd``. Sometimes Aesara will still use
non-deterministic implementations, e.g. when there isn't a GPU :class:`Op`
implementation that is deterministic. See the ``dnn.conv.algo*``
flags for more cases. flags for more cases.
.. attribute:: allow_gc .. attribute:: allow_gc
...@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in:
functions with many fast :class:`Op`\s, but it also increases Aesara's memory functions with many fast :class:`Op`\s, but it also increases Aesara's memory
usage. usage.
.. note:: If :attr:`config.gpuarray__preallocate` is the default value
or not disabled ``(-1)``, this is not useful anymore on the GPU.
.. attribute:: config.scan__allow_output_prealloc .. attribute:: config.scan__allow_output_prealloc
Bool value, either ``True`` or ``False`` Bool value, either ``True`` or ``False``
...@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in:
<https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__ <https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__
library, which is faster than the standard ``libm``. library, which is faster than the standard ``libm``.
.. attribute:: config.gpuarray__preallocate
Float value
Default: 0 (Preallocation of size 0, only cache the allocation)
Controls the preallocation of memory with the gpuarray backend.
This value represents the start size (either in MB or the fraction
of total GPU memory) of the memory pool. If more memory is needed,
Aesara will try to obtain more, but this can cause memory
fragmentation.
A negative value will completely disable the allocation cache.
This can have a severe impact on performance and should not be
used outside of debugging.
* < 0: disabled
* 0 <= N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
* > 1: use this number in megabytes (MB) of memory.
.. note::
This could cause memory fragmentation, so, if you have a memory
error while using the cache, try to allocate more memory at
the start, or disable it.
.. note::
The clipping at 95% can be bypassed by specifying the exact
number of megabytes. If more then 95% are needed, it will try
automatically to get more memory. But this can cause
fragmentation, see note above.
.. attribute:: config.gpuarray__sched
String value: ``'default'``, ``'multi'``, ``'single'``
Default: ``'default'``
Control the stream mode of contexts.
The sched parameter passed for context creation to ``pygpu``. With
CUDA, using ``"multi"`` means using the parameter
``cudaDeviceScheduleBlockingSync``. This is useful to lower the CPU overhead
when waiting for a GPU.
.. attribute:: config.gpuarray__single_stream
Boolean value
Default: ``True``
Control the stream mode of contexts.
If your computations consist of mostly small arrays, using
single-stream will avoid the synchronization overhead and usually
be faster. For larger arrays it does not make a difference yet.
.. attribute:: config.gpuarray__cache_path
Default: ``config.compiledir``/gpuarray_kernels
Directory to cache pre-compiled kernels for the gpuarray backend.
.. attribute:: linker .. attribute:: linker
String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'`` String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'``
......
...@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it. ...@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it.
The sparse module provides the same functionality as the tensor The sparse module provides the same functionality as the tensor
module. The difference lies under the covers because sparse matrices module. The difference lies under the covers because sparse matrices
do not store data in a contiguous array. Note that there are no GPU do not store data in a contiguous array. The sparse module has
implementations for sparse matrices in Aesara. The sparse module has
been used in: been used in:
- NLP: Dense linear transformations of sparse vectors. - NLP: Dense linear transformations of sparse vectors.
......
...@@ -29,51 +29,13 @@ The recommended user interface are: ...@@ -29,51 +29,13 @@ The recommended user interface are:
With those new interface, Aesara will automatically use the fastest With those new interface, Aesara will automatically use the fastest
implementation in many cases. On the CPU, the implementation is a GEMM implementation in many cases. On the CPU, the implementation is a GEMM
based one. On the GPU, there is a GEMM based and :ref:`cuDNN based one.
<libdoc_gpuarray_dnn>` version.
By default on the GPU, if cuDNN is available, it will be used,
otherwise we will fall back to using gemm based version (slower than
cuDNN in most cases and uses more memory). To get an error if cuDNN
can not be used, you can supply the Aesara flag ``dnn.enable=True``.
Either cuDNN and the gemm version can be disabled using the Aesara flags
``optimizer_excluding=conv_dnn`` and ``optimizer_excluding=conv_gemm``,
respectively. If both are disabled, it will raise an error.
For the cuDNN version, there are different algorithms with different
memory/speed trade-offs. Manual selection of the right one is very
difficult as it depends on the shapes and hardware. So it can change
for each layer. An auto-tuning mode exists and can be activated by
those flags: ``dnn__conv__algo_fwd=time_once``,
``dnn__conv__algo_bwd_data=time_once`` and
``dnn__conv__algo_bwd_filter=time_once``. Note, they are good mostly
when the shape do not change.
This auto-tuning has the inconvenience that the first call is much This auto-tuning has the inconvenience that the first call is much
slower as it tries and times each implementation it has. So if you slower as it tries and times each implementation it has. So if you
benchmark, it is important that you remove the first call from your benchmark, it is important that you remove the first call from your
timing. timing.
Also, a meta-optimizer has been implemented for the gpu convolution
implementations to automatically choose the fastest implementation
for each specific convolution in your graph. For each instance, it will
compile and benchmark each applicable implementation and choose the
fastest one. It can be enabled using ``optimizer_including=conv_meta``.
The meta-optimizer can also selectively disable cudnn and gemm version
using the Aesara flag ``metaopt__optimizer_excluding=conv_dnn`` and
``metaopt__optimizer_excluding=conv_gemm`` respectively.
.. note::
Aesara had older user interface like
aesara.tensor.nnet.conv.conv2d. Do not use them anymore. They
will give you slower code and won't allow easy switch between CPU
and GPU computation. They also support less type of convolution.
Implementation Details Implementation Details
====================== ======================
...@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you. ...@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you.
- :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`. - :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
old 2d convolution. DO NOT USE ANYMORE. old 2d convolution. DO NOT USE ANYMORE.
- :func:`GpuCorrMM <aesara.gpuarray.blas.GpuCorrMM>`
This is a GPU-only 2d correlation implementation taken from
`caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_. It does not flip the kernel.
For each element in a batch, it first creates a For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel. `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
...@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you. ...@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you.
This is a CPU-only 2d correlation implementation taken from This is a CPU-only 2d correlation implementation taken from
`caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_. `caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
It does not flip the kernel. It does not flip the kernel.
- :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` GPU-only
convolution using NVIDIA's cuDNN library.
- Implemented operators for neural network 3D / video convolution: - Implemented operators for neural network 3D / video convolution:
- :func:`GpuCorr3dMM <aesara.gpuarray.blas.GpuCorr3dMM>`
This is a GPU-only 3d correlation relying on a Toeplitz matrix
and gemm implementation (see :func:`GpuCorrMM <aesara.sandbox.cuda.blas.GpuCorrMM>`)
It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
``(no of channels * filter width * filter height * filter depth, output width * output height * output depth)``.
- :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>` - :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
This is a CPU-only 3d correlation implementation based on This is a CPU-only 3d correlation implementation based on
the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`). the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
...@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you. ...@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you.
replacement for nnet.conv3d. For convolutions done on CPU, replacement for nnet.conv3d. For convolutions done on CPU,
nnet.conv3d will be replaced by Corr3dMM. nnet.conv3d will be replaced by Corr3dMM.
- :func:`dnn_conv3d <aesara.gpuarray.dnn.dnn_conv3d>` GPU-only
3D convolution using NVIDIA's cuDNN library (as :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` but for 3d).
If cuDNN is available, by default, Aesara will replace all nnet.conv3d
operations with dnn_conv.
- :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>` - :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
Another conv3d implementation that uses the conv2d with data reshaping. Another conv3d implementation that uses the conv2d with data reshaping.
It is faster in some corner cases than conv3d. It flips the kernel. It is faster in some corner cases than conv3d. It flips the kernel.
......
...@@ -14,8 +14,7 @@ ...@@ -14,8 +14,7 @@
.. note:: .. note::
This interface is the preferred interface. It will be moved This interface is the preferred interface.
automatically to the GPU.
.. note:: .. note::
......
...@@ -42,7 +42,6 @@ Optimization o4 o3 o2 ...@@ -42,7 +42,6 @@ Optimization o4 o3 o2
========================================================= ============== === === ================= ============= ====== ========================================================= ============== === === ================= ============= ======
:term:`merge` x x x x x :term:`merge` x x x x x
:term:`constant folding<constant folding>` x x x x x :term:`constant folding<constant folding>` x x x x x
:term:`GPU transfer` x x x x x
:term:`shape promotion<shape promotion>` x x x :term:`shape promotion<shape promotion>` x x x
:term:`fill cut<fill cut>` x x x :term:`fill cut<fill cut>` x x x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x :term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x
...@@ -247,32 +246,10 @@ Optimization o4 o3 o2 ...@@ -247,32 +246,10 @@ Optimization o4 o3 o2
This optimization compresses subgraphs of computationally cheap This optimization compresses subgraphs of computationally cheap
elementwise operations into a single Op that does the whole job in a elementwise operations into a single Op that does the whole job in a
single pass over the inputs (like loop fusion). This is a win when single pass over the inputs (like loop fusion). This is a win when
transfer from main memory to the CPU (or from graphics memory to the transfer from main memory to the CPU is a bottleneck.
GPU) is a bottleneck.
See :class:`FusionOptimizer` See :class:`FusionOptimizer`
GPU transfer
The current strategy for choosing which expressions to evaluate on the
CPU and which to evaluate on the GPU is a greedy one. There are a
number of Ops ***TODO*** with GPU implementations and whenever we find
a graph copying data from GPU to CPU in order to evaluate an
expression that could have been evaluated on the GPU, we substitute
the GPU version of that Op for the CPU version. Likewise if we are
copying the output of a Op with a GPU implementation to the GPU,
then we substitute the GPU version for the CPU version. In this way, if all goes well,
this procedure will result in a graph with the following form:
1. copy non-shared inputs to GPU
2. carry out most/all computations on the GPU
3. copy output back to CPU
When using a GPU, :func:`shared()` will default to GPU storage for
'float32' ndarray arguments, and these shared variables act as seeds
for the greedy algorithm.
See :func:`aesara.sandbox.cuda.opt.*`.
local_log_softmax local_log_softmax
This is a stabilization optimization. This is a stabilization optimization.
It can happen due to rounding errors that the softmax probability of one value gets to 0. It can happen due to rounding errors that the softmax probability of one value gets to 0.
......
...@@ -9,10 +9,6 @@ Requirements ...@@ -9,10 +9,6 @@ Requirements
.. _Python: http://www.python.org/ .. _Python: http://www.python.org/
.. _LaTeX: http://www.latex-project.org/ .. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/ .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
.. _pycuda: https://mathema.tician.de/software/pycuda/
.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
.. _warp-ctc: https://github.com/baidu-research/warp-ctc .. _warp-ctc: https://github.com/baidu-research/warp-ctc
Python_ == >= 3.7 Python_ == >= 3.7
...@@ -42,20 +38,6 @@ Requirements ...@@ -42,20 +38,6 @@ Requirements
`pydot-ng <https://github.com/pydot/pydot-ng>`_ `pydot-ng <https://github.com/pydot/pydot-ng>`_
To handle large picture for gif/images. To handle large picture for gif/images.
`NVIDIA CUDA drivers and SDK`_
**Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
`libgpuarray`_
Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
`pycuda`_ and `skcuda`_
Required for some extra operations on the GPU like fft and
solvers. We use them to wrap cufft and cusolver. Quick install
``pip install pycuda scikit-cuda``. For cuda 8, the dev
version of skcuda (will be released as 0.5.2) is needed for
cusolver: ``pip install pycuda; pip install
git+https://github.com/lebedov/scikit-cuda.git#egg=scikit-cuda``.
`warp-ctc`_ `warp-ctc`_
Required for :ref:`Aesara CTC implementation Required for :ref:`Aesara CTC implementation
<libdoc_tensor_nnet_ctc>`. It is faster then using an <libdoc_tensor_nnet_ctc>`. It is faster then using an
...@@ -84,28 +66,3 @@ Install requirements and optional packages ...@@ -84,28 +66,3 @@ Install requirements and optional packages
conda install numpy scipy mkl pytest <sphinx> <pydot-ng> conda install numpy scipy mkl pytest <sphinx> <pydot-ng>
* Arguments between <...> are optional. * Arguments between <...> are optional.
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
1. Install CUDA drivers
* Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
* You must reboot the computer after the driver installation.
* Test that it was loaded correctly after the reboot, executing the
command `nvidia-smi` from the command line.
.. note::
Sanity check: The *bin* subfolder should contain an *nvcc*
program. This folder is called the *cuda root* directory.
2. Fix 'lib' path
* Add the CUDA 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
64-bit OS) to your ``$LD_LIBRARY_PATH`` environment
variable. Example: ``/usr/local/cuda/lib64``
...@@ -54,7 +54,7 @@ if __name__ == '__main__': ...@@ -54,7 +54,7 @@ if __name__ == '__main__':
pythonpath = os.pathsep.join([throot, pythonpath]) pythonpath = os.pathsep.join([throot, pythonpath])
sys.path[0:0] = [throot] # We must not use os.environ. sys.path[0:0] = [throot] # We must not use os.environ.
# Make sure we don't use gpu to compile documentation # Make sure we don't use other devices to compile documentation
env_th_flags = os.environ.get('AESARA_FLAGS', '') env_th_flags = os.environ.get('AESARA_FLAGS', '')
os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True' os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True'
......
...@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z). ...@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z).
This scenario arises when an operation requires allocation of a large contiguous This scenario arises when an operation requires allocation of a large contiguous
block of memory but no blocks of sufficient size are available. block of memory but no blocks of sufficient size are available.
GPUs do not have virtual memory and as such all allocations must be assigned to
a continuous memory region. CPUs do not have this limitation because or their
support for virtual memory. Multiple allocations on a GPU can result in memory
fragmentation which can makes it more difficult to find contiguous regions
of memory of sufficient size during subsequent memory allocations.
A known example is related to writing data to shared variables. When updating a A known example is related to writing data to shared variables. When updating a
shared variable Aesara will allocate new space if the size of the data does not shared variable Aesara will allocate new space if the size of the data does not
match the size of the space already assigned to the variable. This can lead to match the size of the space already assigned to the variable. This can lead to
...@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64} ...@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64}
It should be noted that using float32 and int{32, 64} together It should be noted that using float32 and int{32, 64} together
inside a function would provide float64 as output. inside a function would provide float64 as output.
Since the GPU can't compute this kind of output, it would be
preferable not to use those dtypes together.
To help you find where float64 are created, see the To help you find where float64 are created, see the
:attr:`warn_float64` Aesara flag. :attr:`warn_float64` Aesara flag.
...@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If ...@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If
some test fails on your machine, you are encouraged to tell us what went some test fails on your machine, you are encouraged to tell us what went
wrong in the GitHub issues. wrong in the GitHub issues.
.. warning::
Aesara's test should **NOT** be run with ``device=cuda``
or they will fail. The tests automatically use the gpu, if any, when
needed. If you don't want Aesara to ever use the gpu when running tests,
you can set :attr:`config.device` to ``cpu`` and
:attr:`config.force_device` to ``True``.
.. _slow_or_memory: .. _slow_or_memory:
Why is my code so slow/uses so much memory Why is my code so slow/uses so much memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is a few things you can easily do to change the trade-off There is a few things you can easily do to change the trade-off
between speed and memory usage. It nothing is said, this affect the between speed and memory usage.
CPU and GPU memory usage.
Could speed up and lower memory usage:
- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
memory then Aesara version. But some flags allow it to use more
memory. GPU only.
Could raise memory usage but speed up computation: Could raise memory usage but speed up computation:
- :attr:`config.gpuarray__preallocate` = 1 # Preallocates the GPU memory
and then manages it in a smart way. Does not raise much the memory
usage, but if you are at the limit of GPU memory available you might
need to specify a lower value. GPU only.
- :attr:`config.allow_gc` =False - :attr:`config.allow_gc` =False
- :attr:`config.optimizer_excluding` =low_memory , GPU only for now.
Could lower the memory usage, but raise computation time: Could lower the memory usage, but raise computation time:
- :attr:`config.scan__allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled - :attr:`config.scan__allow_gc` = True
- :attr:`config.scan__allow_output_prealloc` =False - :attr:`config.scan__allow_output_prealloc` =False
- Use :func:`batch_normalization() - Use :func:`batch_normalization()
<aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory <aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
...@@ -293,7 +265,7 @@ Aesara/BLAS speed test: ...@@ -293,7 +265,7 @@ Aesara/BLAS speed test:
python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py
This will print a table with different versions of BLAS/numbers of This will print a table with different versions of BLAS/numbers of
threads on multiple CPUs and GPUs. It will also print some Aesara/NumPy threads on multiple CPUs. It will also print some Aesara/NumPy
configuration information. Then, it will print the running time of the same configuration information. Then, it will print the running time of the same
benchmarks for your installation. Try to find a CPU similar to yours in benchmarks for your installation. Try to find a CPU similar to yours in
the table, and check that the single-threaded timings are roughly the same. the table, and check that the single-threaded timings are roughly the same.
......
...@@ -194,24 +194,25 @@ makes it possible to expose Aesara's internal variables without a copy, then it ...@@ -194,24 +194,25 @@ makes it possible to expose Aesara's internal variables without a copy, then it
proceeds as fast as an in-place update. proceeds as fast as an in-place update.
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can ..
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth: When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes, * Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily. unnecessarily.
*Solution*: update to a newer version of Aesara. *Solution*: update to a newer version of Aesara.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly, * If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient. faster and more memory efficient.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source *Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size. data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines * It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host. extra copy of the data on the host.
...@@ -219,8 +220,6 @@ be costly. Here are a few tips to ensure fast and efficient use of GPU memory a ...@@ -219,8 +220,6 @@ be costly. Here are a few tips to ensure fast and efficient use of GPU memory a
*Solution*: make sure that the value *Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*. you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
(Further information on the current implementation of the GPU version
of ``set_value()`` can be found here: :ref:`libdoc_gpuarray_type`)
.. _borrowfunction: .. _borrowfunction:
......
...@@ -329,26 +329,6 @@ Tips: ...@@ -329,26 +329,6 @@ Tips:
of type *float64*. of type *float64*.
"Why does my GPU function seem to be slow?"
-------------------------------------------
When you compile an Aesara function, if you do not get the speedup that you expect over the
CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
options:
* ``warn``: Raise a warning
* ``pdb``: Stop with a pdb in the computational graph during the compilation
* ``raise``: Raise an error,
if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in AESARA_FLAGS, such as:
``AESARA_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
But note that this optimization will not catch all the CPU Ops, it might miss some
Ops.
.. _faq_monitormode: .. _faq_monitormode:
"How do I Step through a Compiled Function?" "How do I Step through a Compiled Function?"
......
...@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with ...@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with
them in NumPy as usual. The updates mechanism can be a syntactic convenience, them in NumPy as usual. The updates mechanism can be a syntactic convenience,
but it is mainly there for efficiency. Updates to shared variables can but it is mainly there for efficiency. Updates to shared variables can
sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
updates). Also, Aesara has more control over where and how shared variables are updates).
allocated, which is one of the important elements of getting good performance
on the :ref:`GPU<using_gpu>`.
It may happen that you expressed some formula using a shared variable, but It may happen that you expressed some formula using a shared variable, but
you do *not* want to use its value. In this case, you can use the you do *not* want to use its value. In this case, you can use the
...@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of ...@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are draws from a normal distribution. The distributions that are implemented are
defined as :class:`RandomVariable`\s defined as :class:`RandomVariable`\s
in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU. in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU.
See `Other Implementations`_ for GPU version.
Now let's use these objects. If we call ``f()``, we get random uniform numbers. Now let's use these objects. If we call ``f()``, we get random uniform numbers.
...@@ -502,22 +499,6 @@ Other Random Distributions ...@@ -502,22 +499,6 @@ Other Random Distributions
There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`. There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`.
.. _example_other_random:
Other Implementations
---------------------
There is another implementations based on :ref:`MRG31k3p
<libdoc_rng_mrg>`.
The `RandomStream` only work on the CPU, MRG31k3p work on the CPU and GPU.
.. note::
To use you the MRG version easily, you can just change the import to:
.. code-block:: python
from aesara.sandbox.rng_mrg import MRG_RandomStream as RandomStream
.. _logistic_regression: .. _logistic_regression:
......
...@@ -48,8 +48,6 @@ Advanced ...@@ -48,8 +48,6 @@ Advanced
.. toctree:: .. toctree::
sparse sparse
using_gpu
using_multi_gpu
conv_arithmetic conv_arithmetic
Advanced configuration and debugging Advanced configuration and debugging
......
...@@ -17,7 +17,6 @@ Scan ...@@ -17,7 +17,6 @@ Scan
- Advantages of using ``scan`` over *for* loops: - Advantages of using ``scan`` over *for* loops:
- Number of iterations to be part of the symbolic graph. - Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers (if GPU is involved).
- Computes gradients through sequential steps. - Computes gradients through sequential steps.
- Slightly faster than using a *for* loop in Python with a compiled Aesara function. - Slightly faster than using a *for* loop in Python with a compiled Aesara function.
- Can lower the overall memory usage by detecting the actual amount of memory needed. - Can lower the overall memory usage by detecting the actual amount of memory needed.
......
...@@ -83,11 +83,8 @@ Consider the logistic regression: ...@@ -83,11 +83,8 @@ Consider the logistic regression:
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]): train.maker.fgraph.toposort()]):
print('Used the cpu') print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else: else:
print('ERROR, not able to tell if aesara used the cpu or the gpu') print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort()) print(train.maker.fgraph.toposort())
for i in range(training_steps): for i in range(training_steps):
...@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter. ...@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter.
Aesara defines the following modes by name: Aesara defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled. - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible. - ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible.
- ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python - ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python
implementations. This mode can take much longer than the other modes, but can identify implementations. This mode can take much longer than the other modes, but can identify
......
...@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction, ...@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction,
if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in
train.maker.fgraph.toposort()): train.maker.fgraph.toposort()):
print('Used the cpu') print('Used the cpu')
elif any(x.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for x in
train.maker.fgraph.toposort()):
print('Used the gpu')
else: else:
print('ERROR, not able to tell if aesara used the cpu or the gpu') print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort()) print(train.maker.fgraph.toposort())
for i in range(training_steps): for i in range(training_steps):
......
.. _using_gpu:
=============
Using the GPU
=============
For an introductory discussion of *Graphical Processing Units* (GPU)
and their use for intensive parallel computation purposes, see `GPGPU
<http://en.wikipedia.org/wiki/GPGPU>`_.
One of Aesara's design goals is to specify computations at an abstract
level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on a
graphics card.
Using the GPU in Aesara is as simple as setting the ``device``
configuration flag to ``device=cuda``. You can optionally target a
specific gpu by specifying the number of the gpu as in
e.g. ``device=cuda2``. It is also encouraged to set the floating
point precision to float32 when working on the GPU as that is usually
much faster. For example:
``AESARA_FLAGS='device=cuda,floatX=float32'``. You can also set these
options in the .aesararc file's ``[global]`` section:
.. code-block:: cfg
[global]
device = cuda
floatX = float32
.. note::
* If your computer has multiple GPUs and you use ``device=cuda``,
the driver selects the one to use (usually cuda0).
* You can use the program ``nvidia-smi`` to change this policy.
* By default, when ``device`` indicates preference for GPU computations,
Aesara will fall back to the CPU if there is a problem with the GPU.
You can use the flag ``force_device=True`` to instead raise an error when
Aesara cannot use the GPU.
.. _gpuarray:
GpuArray Backend
----------------
If you have not done so already, you will need to install libgpuarray
as well as at least one computing toolkit (CUDA or OpenCL). Detailed
instructions to accomplish that are provided at
`libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
To install Nvidia's GPU-programming toolchain (CUDA) and configure
Aesara to use it, see the installation instructions for
:ref:`Linux <gpu_linux>`, :ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
.. note::
GpuArray backend uses ``config.gpuarray__preallocate`` for GPU memory
allocation.
.. warning::
The backend was designed to support OpenCL, however current support is
incomplete. A lot of very useful ops still do not support it because they
were ported from the old backend with minimal change.
.. _testing_the_gpu:
Testing Aesara with GPU
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program
into a file and run it.
Use the Aesara flag ``device=cuda`` to require the use of the GPU. Use the flag
``device=cuda{0,1,...}`` to specify which GPU to use.
.. testcode::
from aesara import function, config, shared, tensor as at
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], at.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just computes ``exp()`` of a bunch of random numbers. Note
that we use the :func:`aesara.shared` function to make sure that the
input *x* is stored on the GPU.
.. testoutput::
:hide:
:options: +ELLIPSIS
[Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
Looping 1000 times took ... seconds
Result is ...
Used the cpu
.. code-block:: none
$ AESARA_FLAGS=device=cpu python gpu_tutorial1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
Looping 1000 times took 2.271284 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial1.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.697514 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the device some code runs on by only replacing
the value of the ``device`` flag without touching the code.
If you don't mind a loss of flexibility, you can ask aesara to return
the GPU object directly. The following code is modified to do just that.
.. testcode::
from aesara import function, config, shared, tensor as at
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], at.exp(x).transfer(None))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Here ``at.exp(x).transfer(None)`` means "copy ``exp(x)`` to the GPU",
with ``None`` the default GPU context when not explicitly given.
For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
The output is
.. testoutput::
:hide:
:options: +ELLIPSIS, +SKIP
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.040277 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
.. code-block:: none
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.040277 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
of execution on GPUs, meaning that the work isn't completed yet, just
'launched'. We'll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as ``numpy.asarray()``.
For even more speed, you can play with the ``borrow`` flag. See
:ref:`borrowfunction`.
What Can be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will of course vary from device to
device, and also as we refine our implementation:
* In general, matrix multiplication, convolution, and large element-wise
operations can be accelerated a lot (5-50x) when arguments are large enough
to keep 30 processors busy.
* Indexing, dimension-shuffling and constant-time reshaping will be
equally fast on GPU as on CPU.
* Summation over rows/columns of tensors can be a little slower on the
GPU than on the CPU.
* Copying of large quantities of data to and from a device is relatively slow,
and often cancels most of the advantage of one or two accelerated functions
on that data. Getting GPU performance largely hinges on making data transfer
to the device pay off.
The backend supports all regular aesara data types (float32, float64,
int, ...), however GPU support varies and some units can't deal with
double (float64) or small (less than 32 bits like int16) data types.
You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its ``tag.target`` attribute to
'cpu'.
Complex support is untested and most likely completely broken.
Tips for Improving Performance on GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Consider adding ``floatX=float32`` (or the type you are using) to your
``.aesararc`` file if you plan to do a lot of GPU work.
* The GPU backend supports *float64* variables, but they are still slower
to compute than *float32*. The more *float32*, the better GPU performance
you will get.
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` (which
follow the type set in ``floatX``) to ``dmatrix``, ``dvector`` and
``dscalar``. The latter enforce double precision (*float64* on most
machines), which slows down GPU computations on current hardware.
* Minimize transfers to the GPU device by using ``shared`` variables
to store frequently-accessed data (see :func:`shared()<shared.shared>`).
When using the GPU, tensor ``shared`` variables are stored on
the GPU by default to eliminate transfer time for GPU ops using those
variables.
* If you aren't happy with the performance you see, try running your
script with ``profile=True`` flag. This should print some timing
information at program termination. Is time being used sensibly? If
an op or Apply is taking more time than its share, then if you know
something about GPU programming, have a look at how it's implemented
in aesara.gpuarray. Check the line similar to *Spent Xs(X%) in cpu
op, Xs(X%) in gpu op and Xs(X%) in transfer op*. This can tell you
if not enough of your graph is on the GPU or if there is too much
memory transfer.
* To investigate whether all the Ops in the computational graph are
running on GPU, it is possible to debug or check your code by providing
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
raising an error or `pdb` for putting a breakpoint in the computational
graph if there is a CPU Op.
.. _gpu_async:
GPU Async Capabilities
~~~~~~~~~~~~~~~~~~~~~~
By default, all operations on the GPU are run asynchronously. This
means that they are only scheduled to run and the function returns.
This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory
transfers between device and host.
It is possible to force synchronization for a particular GpuArray by
calling its ``sync()`` method. This is useful to get accurate timings
when doing benchmarks.
Changing the Value of Shared Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To change the value of a ``shared`` variable, e.g. to provide new data
to processes, use ``shared_variable.set_value(new_value)``. For a lot
more detail about this, see :ref:`aliasing`.
Exercise
~~~~~~~~
Consider again the logistic regression:
.. testcode::
import numpy
import aesara
import aesara.tensor as at
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(aesara.config.floatX),
rng.randint(size=N,low=0, high=2).astype(aesara.config.floatX))
training_steps = 10000
# Declare Aesara symbolic variables
x = at.matrix("x")
y = at.vector("y")
w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
b = aesara.shared(numpy.asarray(0., dtype=aesara.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Aesara expression graph
p_1 = 1 / (1 + at.exp(-at.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*at.log(p_1) - (1-y)*at.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = at.grad(cost, [w,b])
# Compile expressions to functions
train = aesara.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
name = "train")
predict = aesara.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
print("floatX=", aesara.config.floatX)
print("device=", aesara.config.device)
.. testoutput::
:hide:
:options: +ELLIPSIS
Used the cpu
target values for D
...
prediction on D
...
Modify and execute this example to run on GPU with ``floatX=float32``
and time it using the command line ``time python file.py``. (Of
course, you may use some of your answer to the exercise in section
:ref:`Configuration Settings and Compiling Mode<using_modes>`.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put
your ideas to test.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
Software for Directly Programming a GPU
---------------------------------------
Leaving aside Aesara which is a meta-programmer, there are:
* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C)
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) are maturing.
* **OpenCL**: multi-vendor version of CUDA
* More general, standardized.
* Fewer libraries, lesser spread.
* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel
computation API from Python
* Convenience:
Makes it easy to do GPU meta-programming from within Python.
Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).
GPU memory buffer (``pycuda.gpuarray.GPUArray``).
Helpful documentation.
* Completeness: Binding to all of CUDA's driver API.
* Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
* Speed: PyCUDA's base layer is written in C++.
* Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').
Makes it much easier to write correct, leak- and crash-free code.
PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
allocated in it is also freed).
(This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_
and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)
* **PyOpenCL**: PyCUDA for OpenCL
Learning to Program with PyCUDA
-------------------------------
If you already enjoy a good proficiency with the C programming language, you
may easily leverage your knowledge by learning, first, to program a GPU with the
CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
API with a Python wrapper.
The following resources will assist you in this learning process:
* **CUDA API and CUDA C: Introductory**
* `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_
* `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
* **CUDA API and CUDA C: Advanced**
* `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
(full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
* `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
(full lectures, Kirk-Hwu textbook)
* `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
(extensive coverage, levels from introductory to advanced)
* `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
(on the relationship between grids, blocks and threads; see also linked and related issues on same page)
* `CUDA optimization <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_
* **PyCUDA: Introductory**
* `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_
* `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_
* **PYCUDA: Advanced**
* `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_
The following examples give a foretaste of programming a GPU with PyCUDA. Once
you feel competent enough, you may try yourself on the corresponding exercises.
**Example: PyCUDA**
.. code-block:: python
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print(dest)
Exercise
~~~~~~~~
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
.. _pyCUDA_aesara:
**Example: Aesara + PyCUDA**
.. code-block:: python
import numpy, aesara
import aesara.misc.pycuda_init
from pycuda.compiler import SourceModule
import aesara.sandbox.cuda as cuda
from aesara.graph.basic import Apply
from aesara.graph.op import Op
class PyCUDADoubleOp(Op):
__props__ = ()
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2, impl):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [storage_map[v] for v in node.inputs]
outputs = [storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape != inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
Use this code to test it:
>>> x = aesara.tensor.type.fmatrix()
>>> f = aesara.function([x], PyCUDADoubleOp()(x)) # doctest: +SKIP
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2) # doctest: +SKIP
>>> print(numpy.asarray(f(xv))) # doctest: +SKIP
Exercise
~~~~~~~~
Run the preceding example.
Modify and execute to multiply two matrices: *x* * *y*.
Modify and execute to return two outputs: *x + y* and *x - y*.
(Notice that Aesara's current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)
Modify and execute to support *stride* (i.e. to avoid constraining the input to be *C-contiguous*).
Note
----
* See :ref:`example_other_random` to know how to handle random numbers
on the GPU.
* The mode `FAST_COMPILE` disables C code, so also disables the GPU. You
can use the Aesara flag optimizer='fast_compile' to speed up
compilation and keep the GPU.
#!/usr/bin/env python
# Aesara tutorial
# Solution to Exercise in section 'Using the GPU'
# 1. Raw results
import numpy as np
import aesara
import aesara.tensor as at
aesara.config.floatX = 'float32'
rng = np.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(aesara.config.floatX),
rng.randint(size=N, low=0, high=2).astype(aesara.config.floatX))
training_steps = 10000
# Declare Aesara symbolic variables
x = aesara.shared(D[0], name="x")
y = aesara.shared(D[1], name="y")
w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
b = aesara.shared(np.asarray(0., dtype=aesara.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Aesara expression graph
p_1 = 1 / (1 + at.exp(-at.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * at.log(p_1) - (1 - y) * at.log(1 - p_1) # Cross-entropy
cost = at.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = at.grad(cost, [w, b])
# Compile expressions to functions
train = aesara.function(
inputs=[],
outputs=[prediction, xent],
updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
name="train")
predict = aesara.function(inputs=[], outputs=prediction,
name="predict")
if any(n.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for n in
train.maker.fgraph.toposort()):
print('Used the cpu')
elif any(n.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for n in
train.maker.fgraph.toposort()):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train()
#print "Final model:"
#print w.get_value(), b.get_value()
print("target values for D")
print(D[1])
print("prediction on D")
print(predict())
"""
# 2. Profiling
# 2.1 Profiling for CPU computations
# In your terminal, type:
$ AESARA_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the cpu
target values for D
prediction on D
# Followed by the output of profiling.. You'll see profiling results for each function
# in the script, followed by a summary for all functions.
# We'll show here only the summary:
Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 1.300452e+00s
Time in Function.fn.__call__: 1.215823e+00s (93.492%)
Time in thunks: 1.157602e+00s (89.015%)
Total compile time: 8.922548e-01s
Number of Apply nodes: 17
Aesara Optimizer time: 6.270301e-01s
Aesara validate time: 5.993605e-03s
Aesara Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
Import time 3.543139e-03s
Time in all call to aesara.grad() 1.848292e-02s
Time since aesara import 2.864s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 aesara.tensor.blas_c.CGemv
33.1% 97.7% 0.384s 4.79e-06s C 80001 9 aesara.tensor.elemwise.Elemwise
1.0% 98.6% 0.011s 1.14e-06s C 10000 1 aesara.tensor.elemwise.Sum
0.7% 99.4% 0.009s 2.85e-07s C 30001 4 aesara.tensor.elemwise.DimShuffle
0.3% 99.7% 0.004s 3.64e-07s C 10001 2 aesara.tensor.basic.AllocEmpty
0.3% 100.0% 0.004s 1.78e-07s C 20001 3 aesara.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 CGemv{inplace}
18.7% 83.2% 0.217s 2.17e-05s C 10000 1 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
8.9% 92.1% 0.103s 1.03e-05s C 10000 1 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
4.3% 96.4% 0.050s 4.98e-06s C 10000 1 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
1.0% 97.4% 0.011s 1.14e-06s C 10000 1 Sum{acc_dtype=float64}
0.5% 97.9% 0.006s 2.83e-07s C 20001 3 InplaceDimShuffle{x}
0.4% 98.3% 0.004s 4.22e-07s C 10000 1 Elemwise{sub,no_inplace}
0.3% 98.6% 0.004s 3.70e-07s C 10000 1 Elemwise{neg,no_inplace}
0.3% 98.9% 0.004s 3.64e-07s C 10001 2 AllocEmpty{dtype='float32'}
0.3% 99.2% 0.004s 1.78e-07s C 20001 3 Shape_i{0}
0.2% 99.5% 0.003s 2.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.2% 99.7% 0.003s 2.65e-07s C 10000 1 Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.2% 99.9% 0.002s 1.98e-07s C 10000 1 Elemwise{Cast{float32}}
0.1% 100.0% 0.002s 1.54e-07s C 10000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 100.0% 0.000s 4.77e-06s C 1 1 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
34.0% 34.0% 0.394s 3.94e-05s 10000 7 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
30.5% 64.5% 0.353s 3.53e-05s 10000 15 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
18.7% 83.2% 0.217s 2.17e-05s 10000 12 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
8.9% 92.1% 0.103s 1.03e-05s 10000 13 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
4.3% 96.4% 0.050s 4.98e-06s 10000 11 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
1.0% 97.4% 0.011s 1.14e-06s 10000 14 Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
0.4% 97.8% 0.004s 4.22e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.3% 98.1% 0.004s 3.76e-07s 10000 0 InplaceDimShuffle{x}(b)
0.3% 98.4% 0.004s 3.70e-07s 10000 10 Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.3% 98.7% 0.004s 3.64e-07s 10000 5 AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.2% 99.0% 0.003s 2.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.2% 99.2% 0.003s 2.65e-07s 10000 9 Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.2% 99.4% 0.002s 2.21e-07s 10000 1 Shape_i{0}(x)
0.2% 99.6% 0.002s 1.98e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.2% 99.7% 0.002s 1.90e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.1% 99.9% 0.002s 1.54e-07s 10000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.1% 100.0% 0.001s 1.34e-07s 10000 3 Shape_i{0}(y)
0.0% 100.0% 0.000s 3.89e-05s 1 3 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
0.0% 100.0% 0.000s 4.77e-06s 1 4 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.000s 1.19e-06s 1 0 InplaceDimShuffle{x}(b)
... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
# 2.2 Profiling for GPU computations
# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 AESARA_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the gpu
target values for D
prediction on D
Results were produced using a GeForce GTX TITAN X
# Profiling summary for all functions:
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 4.181247e+00s
Time in Function.fn.__call__: 4.081113e+00s (97.605%)
Time in thunks: 3.915566e+00s (93.646%)
Total compile time: 9.256095e+00s
Number of Apply nodes: 21
Aesara Optimizer time: 9.996419e-01s
Aesara validate time: 6.523132e-03s
Aesara Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
Import time 4.228115e-03s
Time in all call to aesara.grad() 3.286195e-02s
Time since aesara import 15.415s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 aesara.sandbox.gpuarray.blas.GpuGemv
29.8% 89.3% 1.166s 1.30e-05s C 90001 10 aesara.sandbox.gpuarray.elemwise.GpuElemwise
4.1% 93.4% 0.162s 8.10e-06s C 20001 3 aesara.sandbox.gpuarray.basic_ops.HostFromGpu
3.3% 96.7% 0.131s 1.31e-05s C 10000 1 aesara.sandbox.gpuarray.elemwise.GpuCAReduceCuda
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 aesara.sandbox.gpuarray.basic_ops.GpuFromHost
0.8% 99.1% 0.033s 1.09e-06s C 30001 4 aesara.sandbox.gpuarray.elemwise.GpuDimShuffle
0.7% 99.8% 0.026s 2.59e-06s C 10001 2 aesara.sandbox.gpuarray.basic_ops.GpuAllocEmpty
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 aesara.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 GpuGemv{inplace=True}
4.1% 63.6% 0.162s 8.10e-06s C 20001 3 HostFromGpu(gpuarray)
4.0% 67.6% 0.157s 1.57e-05s C 10000 1 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
3.8% 71.4% 0.149s 1.49e-05s C 10000 1 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
3.7% 75.1% 0.144s 1.44e-05s C 10000 1 GpuElemwise{sub,no_inplace}
3.6% 78.7% 0.141s 1.41e-05s C 10000 1 GpuElemwise{gt,no_inplace}
3.4% 82.1% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Cast{float32}}[]<gpuarray>
3.4% 85.5% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
3.3% 88.8% 0.131s 1.31e-05s C 10000 1 GpuCAReduceCuda{add}
2.9% 91.7% 0.112s 1.12e-05s C 10000 1 GpuElemwise{neg,no_inplace}
2.6% 94.3% 0.102s 1.02e-05s C 10000 1 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
2.5% 96.7% 0.096s 9.63e-06s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 GpuFromHost<None>
0.7% 99.0% 0.026s 2.59e-06s C 10001 2 GpuAllocEmpty{dtype='float32', context_name=None}
0.5% 99.5% 0.021s 1.06e-06s C 20001 3 InplaceGpuDimShuffle{x}
0.3% 99.8% 0.011s 1.14e-06s C 10000 1 InplaceGpuDimShuffle{1,0}
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 Shape_i{0}
0.0% 100.0% 0.000s 2.00e-05s C 1 1 GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
55.0% 55.0% 2.154s 2.15e-04s 10000 7 GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
4.5% 59.5% 0.176s 1.76e-05s 10000 18 GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
4.0% 63.5% 0.157s 1.57e-05s 10000 12 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
3.8% 67.3% 0.149s 1.49e-05s 10000 15 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
3.7% 71.0% 0.144s 1.44e-05s 10000 4 GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
3.6% 74.6% 0.141s 1.41e-05s 10000 16 GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
3.4% 78.0% 0.133s 1.33e-05s 10000 10 GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
3.4% 81.4% 0.133s 1.33e-05s 10000 9 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
3.3% 84.7% 0.131s 1.31e-05s 10000 17 GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
2.9% 87.5% 0.112s 1.12e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
2.6% 90.1% 0.102s 1.02e-05s 10000 20 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
2.5% 92.6% 0.096s 9.63e-06s 10000 13 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
2.3% 94.9% 0.090s 9.04e-06s 10000 19 HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
1.8% 96.7% 0.072s 7.16e-06s 10000 14 HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
1.6% 98.3% 0.061s 6.10e-06s 10000 6 GpuFromHost<None>(Shape_i{0}.0)
0.7% 99.0% 0.026s 2.59e-06s 10000 5 GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
0.3% 99.3% 0.013s 1.33e-06s 10000 0 InplaceGpuDimShuffle{x}(b)
0.3% 99.6% 0.011s 1.14e-06s 10000 2 InplaceGpuDimShuffle{1,0}(x)
0.2% 99.8% 0.008s 7.94e-07s 10000 8 InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
0.1% 99.9% 0.005s 5.27e-07s 10000 1 Shape_i{0}(x)
... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
# 3. Conclusions
Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
"""
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Aesara has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the aesara configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
.. note::
For data-parallelism, you probably are better using `platoon
<https://github.com/mila-udem/platoon>`_.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separated by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Aesara. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ AESARA_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Aesara will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ AESARA_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import aesara'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since aesara often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distributing the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import aesara
v01 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = aesara.function([], [aesara.tensor.dot(v01, v02),
aesara.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import aesara
v = aesara.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead that may reduce performance.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论