提交 f7a506ff authored 作者: Maxim Kochurov's avatar Maxim Kochurov 提交者: Brandon T. Willard

Remove gpuarray references from documentation

上级 cc584d6c
...@@ -14,11 +14,8 @@ Acknowledgements ...@@ -14,11 +14,8 @@ Acknowledgements
* The developers of `Theano <https://github.com/Theano/Theano>`_ * The developers of `Theano <https://github.com/Theano/Theano>`_
* All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_. * All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_.
* All Theano users that have given us feedback. * All Theano users that have given us feedback.
* The GPU implementation of tensordot is based on code from Tijmen
Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
* Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in: * Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in:
P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689. P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano. We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` was included in Theano in the previous GPU back-end. It is still in the history, but not in the current version. It has the same license.
...@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op` ...@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op`
===================================== =====================================
This tutorial covers how to extend Aesara with an :class:`Op` that offers a C This tutorial covers how to extend Aesara with an :class:`Op` that offers a C
implementation. It does not cover :class:`Op`\s that run on a GPU but it does introduce implementation. This tutorial is aimed at individuals who already know how to
many elements and concepts which are relevant for GPU :class:`Op`\s. This tutorial is extend Aesara (see tutorial :ref:`creating_an_op`) by adding a new :class:`Op`
aimed at individuals who already know how to extend Aesara (see tutorial with a Python implementation and will only cover the additional knowledge
:ref:`creating_an_op`) by adding a new :class:`Op` with a Python implementation required to also produce :class:`Op`\s with C implementations.
and will only cover the additional knowledge required to also produce :class:`Op`\s
with C implementations.
Providing an Aesara :class:`Op` with a C implementation requires to interact with Providing an Aesara :class:`Op` with a C implementation requires to interact with
Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
...@@ -927,7 +925,7 @@ discussed below. ...@@ -927,7 +925,7 @@ discussed below.
further below. further below.
For every input which has a :attr:`dtype` attribute (this means For every input which has a :attr:`dtype` attribute (this means
Tensors, and equivalent types on GPU), the following macros will be Tensors), the following macros will be
defined unless your `Op` class has an :attr:`Op.check_input` attribute defined unless your `Op` class has an :attr:`Op.check_input` attribute
defined to False. In these descrptions 'i' refers to the position defined to False. In these descrptions 'i' refers to the position
(indexed from 0) in the input array. (indexed from 0) in the input array.
...@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled ...@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled
by Aesara. by Aesara.
For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`. For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`.
For the GPU, you must add in this second flag `nvcc.flags=-g` (it slow
down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python Then you must start Python inside GDB and in it start your Python
process: process:
......
...@@ -824,10 +824,10 @@ will not be accepted. ...@@ -824,10 +824,10 @@ will not be accepted.
:class:`NanGuardMode` help users find where in the graph NaN appear. But :class:`NanGuardMode` help users find where in the graph NaN appear. But
sometimes, we want some variables to not be checked. For example, in sometimes, we want some variables to not be checked. For example, in
the old GPU back-end, we use a float32 :class:`CudaNdarray` to store the MRG the old GPU back-end, we used a float32 :class:`CudaNdarray` to store the MRG
random number generator state (they are integers). So if :class:`NanGuardMode` random number generator state (they are integers). So if :class:`NanGuardMode`
check it, it will generate false positive. Another case is related to checked it, it would generate a false positive. Another case is related to
:class:`[Gpu]AllocEmpty` or some computation on it (like done by :class:`Scan`). :class:`AllocEmpty` or some computations on it (like done by :class:`Scan`).
You can tell :class:`NanGuardMode` to do not check a variable with: You can tell :class:`NanGuardMode` to do not check a variable with:
:attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically :attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically
......
...@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for ...@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for
Allows to specify a special compiler. This will force this compiler for Allows to specify a special compiler. This will force this compiler for
the current compilation block (a particular :class:`Op` or the full the current compilation block (a particular :class:`Op` or the full
graph). This is used for the GPU code. graph).
.. method:: c_code_cache_version() .. method:: c_code_cache_version()
...@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f ...@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f
In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent
the C variable names of the :class:`DeepCopyOp` input and output the C variable names of the :class:`DeepCopyOp` input and output
respectively. See an example for the type ``GpuArrayType`` (GPU respectively. The version parameter is what is returned by
array) in the file ``aesara/gpuarray/type.py``. The version :meth:`DeepCopyOp.c_code_cache_version`. By default, it will recompile the C
parameter is what is returned by :meth:`DeepCopyOp.c_code_cache_version`. By code for each process.
default, it will recompile the C code for each process.
:class:`ViewOp` :class:`ViewOp`
=============== ===============
......
...@@ -829,9 +829,9 @@ Explanations: ...@@ -829,9 +829,9 @@ Explanations:
* ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`. * ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`.
* ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph. * ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph.
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /... * ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically /...
* ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase. * ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase.
* ``Aesara Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``. * ``Aesara Linker time (includes C code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module. * ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
* ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``. * ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``.
......
...@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by ...@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by
identifying and replacing certain patterns in the graph with other specialized identifying and replacing certain patterns in the graph with other specialized
patterns that produce the same results but are either faster or more patterns that produce the same results but are either faster or more
stable. Optimizations can also detect identical subgraphs and ensure that the stable. Optimizations can also detect identical subgraphs and ensure that the
same values are not computed twice or reformulate parts of the graph to a GPU same values are not computed twice.
specific version.
For example, one (simple) optimization that Aesara uses is to replace For example, one (simple) optimization that Aesara uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`. the pattern :math:`\frac{xy}{y}` by :math:`x`.
......
...@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused ...@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused
by our elemwise operations on tensors. If the scalar operation has C code, the by our elemwise operations on tensors. If the scalar operation has C code, the
elemwise implementation will automatically have C code too. This elemwise implementation will automatically have C code too. This
will enable the fusion of elemwise operations using your new scalar will enable the fusion of elemwise operations using your new scalar
operation. It can also reuse the GPU elemwise code. It is similar for operation. It is similar for reduction operations.
reduction operations.
Be careful about some possible problems in the definition of the Be careful about some possible problems in the definition of the
``grad`` method, and about dependencies that may not be available. In ``grad`` method, and about dependencies that may not be available. In
...@@ -125,11 +124,7 @@ Random distribution ...@@ -125,11 +124,7 @@ Random distribution
We have 3 base random number generators. One that wraps NumPy's random We have 3 base random number generators. One that wraps NumPy's random
generator, one that implements MRG31k3p and one that wraps CURAND. generator, one that implements MRG31k3p and one that wraps CURAND.
The fastest, but less developed, is CURAND. It works only on CUDA-enabled The recommended and 2nd faster is MRG. It works on the CPU and
GPUs. It does not work on the CPU and it has fewer random distributions
implemented.
The recommended and 2nd faster is MRG. It works on the GPU and CPU and
has more implemented distributions. has more implemented distributions.
The slowest is our wrapper on NumPy's random generator. The slowest is our wrapper on NumPy's random generator.
......
...@@ -194,12 +194,11 @@ default values. ...@@ -194,12 +194,11 @@ default values.
:noindex: :noindex:
If filter_inplace is defined, it will be called instead of If filter_inplace is defined, it will be called instead of
filter() This is to allow reusing the old allocated memory. As filter() This is to allow reusing the old allocated memory. This was used
of this writing this is used only when we transfer new data to a only when new data was transferred to a shared variable on a GPU.
shared variable on the gpu.
``storage`` will be the old value (e.g. the old `ndarray`).
``storage`` will be the old value. i.e. The old numpy array,
CudaNdarray, ...
.. method:: is_valid_value(value) .. method:: is_valid_value(value)
:noindex: :noindex:
......
...@@ -6,17 +6,13 @@ ...@@ -6,17 +6,13 @@
Frequently Asked Questions Frequently Asked Questions
========================== ==========================
Does Aesara support Python 3?
------------------------------
We support both Python 2 >= 2.7 and Python 3 >= 3.4.
Output slight numerical difference Output slight numerical difference
---------------------------------- ----------------------------------
Sometimes when you compare the output of Aesara using different Sometimes when you compare the output of Aesara using different Aesara flags,
Aesara flags, Aesara versions, CPU and GPU or with other software like Aesara versions, CPU and GPU devices, or with other software like NumPy, you
NumPy, you will see small numerical differences. will see small numerical differences.
This is normal. Floating point numbers are approximations of real This is normal. Floating point numbers are approximations of real
numbers. This is why doing a+(b+c) vs (a+b)+c can give small numbers. This is why doing a+(b+c) vs (a+b)+c can give small
...@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most ...@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most
optimizations and disables the generation of any c/cuda code. This is useful optimizations and disables the generation of any c/cuda code. This is useful
for quickly testing a simple idea. for quickly testing a simple idea.
If c/cuda code is necessary, as when using a GPU, the flag If C code is necessary, the flag
``optimizer=fast_compile`` can be used instead. It instructs Aesara to ``optimizer=fast_compile`` can be used instead. It instructs Aesara to
skip time consuming optimizations but still generate c/cuda code. skip time consuming optimizations but still generate C code.
Similarly using the flag ``optimizer_excluding=inplace`` will speed up Similarly using the flag ``optimizer_excluding=inplace`` will speed up
compilation by preventing optimizations that replace operations with a compilation by preventing optimizations that replace operations with a
...@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to ...@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to
reuse them during the next call to the same Aesara function, if they are of the reuse them during the next call to the same Aesara function, if they are of the
correct shape. The shape could change if the shapes of the inputs change. correct shape. The shape could change if the shapes of the inputs change.
.. note::
With :attr:`preallocate <config.gpuarray__preallocate>`, this isn't
very useful with GPU anymore.
.. _unsafe_optimization: .. _unsafe_optimization:
Unsafe optimization Unsafe optimization
...@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation: ...@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation:
f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c')) f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c'))
f(10.) f(10.)
New GPU backend using libgpuarray
---------------------------------
The new aesara GPU backend (:ref:`gpuarray`) uses ``config.gpuarray__preallocate`` for GPU memory allocation.
Related Projects Related Projects
---------------- ----------------
......
...@@ -13,7 +13,4 @@ Supported platforms: ...@@ -13,7 +13,4 @@ Supported platforms:
install_windows install_windows
install_centos6 install_centos6
Once your setup is complete and if you installed the GPU libraries, head to :ref:`testing_the_gpu` to find how to verify
everything is working properly.
To update your current installation see :ref:`updating`. To update your current installation see :ref:`updating`.
...@@ -12,23 +12,14 @@ Stable Installation ...@@ -12,23 +12,14 @@ Stable Installation
With ``conda`` With ``conda``
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
If you use conda, you can directly install both aesara and pygpu. Libgpuarray If you use conda, you can directly install aesara.
will be automatically installed as a dependency of pygpu.
.. code-block:: bash .. code-block:: bash
conda install aesara pygpu conda install aesara
.. warning::
The Aesara developers do not maintain ``pygpu``, so compatibility isn't
guaranteed.
With ``pip`` With ``pip``
^^^^^^^^^^^^ ^^^^^^^^^^^^
If you use pip, you have to install Aesara and libgpuarray separately.
aesara aesara
:::::: ::::::
...@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with: ...@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
:::::::::::
Download it with::
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
Bleeding-Edge Installation (recommended) Bleeding-Edge Installation (recommended)
---------------------------------------- ----------------------------------------
...@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with: ...@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
.. note::
Currently, you need ``libgpuarray`` version ``0.7.X`` that is not in conda default channel.
But you can install it with our own channel ``mila-udem`` (that only supports Python 2.7, 3.5 and 3.6)::
conda install -c mila-udem pygpu
Developer Installation Developer Installation
---------------------- ----------------------
...@@ -116,8 +84,3 @@ Install the developer version of Aesara with: ...@@ -116,8 +84,3 @@ Install the developer version of Aesara with:
source directory. source directory.
If you encountered any trouble, head to the :ref:`troubleshooting` page. If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
See instructions for bleeding-edge installation about ``libgpuarray``.
...@@ -17,21 +17,6 @@ details so that we can add alternative instructions. ...@@ -17,21 +17,6 @@ details so that we can add alternative instructions.
.. include:: requirements.inc .. include:: requirements.inc
.. _gpu_macos:
.. attention::
For MacOS you should be able to follow the above instructions to
setup CUDA, but be aware of the following caveats:
* If you want to compile the CUDA SDK code, you may need to temporarily
revert back to Apple's gcc (``sudo port select gcc``) as their Makefiles
are not compatible with MacPort's gcc.
* If CUDA seems unable to find a CUDA-capable GPU, you may need to manually
toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
.. attention:: .. attention::
Aesara officially supports only clang on OS X. This can be installed Aesara officially supports only clang on OS X. This can be installed
......
...@@ -11,8 +11,6 @@ Ubuntu Installation Instructions ...@@ -11,8 +11,6 @@ Ubuntu Installation Instructions
from GitHub, please make sure you are reading `the latest version of this from GitHub, please make sure you are reading `the latest version of this
page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_. page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_.
.. _gpu_linux:
.. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below) .. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below)
.. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2 .. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
.. |CompilerName| replace:: ``g++`` .. |CompilerName| replace:: ``g++``
...@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended) ...@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended)
If you want to acquire the requirements through your system packages If you want to acquire the requirements through your system packages
and install them system wide follow these instructions: and install them system wide follow these instructions:
For Ubuntu 16.04 with cuda 7.5 For Ubuntu 16.04
.. code-block:: bash .. code-block:: bash
sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz
sudo pip install Aesara sudo pip install Aesara
# cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9 sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
......
...@@ -30,26 +30,10 @@ Install requirements and optional packages ...@@ -30,26 +30,10 @@ Install requirements and optional packages
* Arguments between <...> are optional. * Arguments between <...> are optional.
* ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended. * ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended.
* ``git`` package installs git source control through conda, which is required for the development versions of Aesara and libgpuarray * ``git`` package installs git source control through conda, which is required for the development version of Aesara
.. _gpu_windows:
Install and configure the GPU drivers (recommended) .. Installation of Aesara.
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
Install CUDA drivers
^^^^^^^^^^^^^^^^^^^^
Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
You must reboot the computer after the driver installation.
.. Installation of Aesara and libgpuarray.
.. include:: install_generic.inc .. include:: install_generic.inc
:start-after: .. _install_generic: :start-after: .. _install_generic:
...@@ -73,7 +57,3 @@ generic guidelines to get a working environment: ...@@ -73,7 +57,3 @@ generic guidelines to get a working environment:
path`` option. path`` option.
3. Enable OpenMP support by checking the option ``openmp support 3. Enable OpenMP support by checking the option ``openmp support
option``. option``.
* Install CUDA with the same instructions as above.
* Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
...@@ -30,12 +30,12 @@ ...@@ -30,12 +30,12 @@
By default, return a copy of the data. If ``borrow=True`` (and By default, return a copy of the data. If ``borrow=True`` (and
``return_internal_type=False``), maybe it will return a copy. ``return_internal_type=False``), maybe it will return a copy.
For tensor, it will always return a ndarray by default, so if For tensor, it will always return an `ndarray` by default, so if
the data is on the GPU, it will return a copy, but if the data the data is on another device, it will return a copy, but if the data
is on the CPU, it will return the original data. If you do is on the CPU, it will return the original data. If you do
``borrow=True`` and ``return_internal_type=True``, it will ``borrow=True`` and ``return_internal_type=True``, it will
always return the original data, not a copy, but this can be a always return the original data, not a copy, but this can be a non-`ndarray`
GPU object. type of object.
.. method:: set_value(self, new_value, borrow=False) .. method:: set_value(self, new_value, borrow=False)
......
...@@ -51,11 +51,11 @@ Environment Variables ...@@ -51,11 +51,11 @@ Environment Variables
.. code-block:: bash .. code-block:: bash
AESARA_FLAGS='floatX=float32,device=cuda0,gpuarray__preallocate=1' python <myscript>.py AESARA_FLAGS='floatX=float32' python <myscript>.py
If a value is defined several times in ``AESARA_FLAGS``, If a value is defined several times in ``AESARA_FLAGS``,
the right-most definition is used, so, for instance, if the right-most definition is used, so, for instance, if
``AESARA_FLAGS='device=cpu,device=cuda0'`` is set, then ``cuda0`` will be ``AESARA_FLAGS='floatX=float32,floatX=float64'`` is set, then ``float64`` will be
used. used.
.. envvar:: AESARARC .. envvar:: AESARARC
...@@ -72,15 +72,11 @@ Environment Variables ...@@ -72,15 +72,11 @@ Environment Variables
floatX = float32 floatX = float32
device = cuda0 device = cuda0
[gpuarray]
preallocate = 1
Configuration attributes that are available directly in ``config`` Configuration attributes that are available directly in ``config``
(e.g. ``config.device``, ``config.mode``) should be defined in the (e.g. ``config.mode``) should be defined in the ``[global]`` section.
``[global]`` section. Attributes from a subsection of ``config``
Attributes from a subsection of ``config`` (e.g. ``config.gpuarray__preallocate``, (e.g. ``config.dnn__conv__algo_fwd``) should be defined in their
``config.dnn__conv__algo_fwd``) should be defined in their corresponding corresponding section (e.g. ``[dnn.conv]``).
section (e.g. ``[gpuarray]``, ``[dnn.conv]``).
Multiple configuration files can be specified by separating them with ``':'`` Multiple configuration files can be specified by separating them with ``':'``
characters (as in ``$PATH``). Multiple configuration files will be merged, characters (as in ``$PATH``). Multiple configuration files will be merged,
...@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in:
.. attribute:: device .. attribute:: device
String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``, String value: either ``'cpu'``
``'opencl0:0'``, ``'opencl0:1'``, ...
Default device for computations. If ``'cuda*``, change the default to try
to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
the OpenCL libraries will be used. To let the driver select the device,
use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
either we fall back on the CPU, or an error is raised, depending
on the :attr:`force_device` flag.
This flag's value cannot be modified during the program execution.
Do not use upper case letters; only lower case, even if NVIDIA uses
capital letters.
.. attribute:: force_device .. attribute:: force_device
...@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in:
Default: ``False`` Default: ``False``
If ``True`` and ``device=gpu*``, Aesara raises an error when it cannot
use the specified :attr:`device`. If ``True`` and ``device=cpu``,
Aesara disables the GPU. If ``False`` and ``device=gpu*``, and when the
specified device cannot be used, Aesara emits a warning and falls back to
the CPU.
This flag's value cannot be modified during the program execution.
.. attribute:: init_gpu_device
String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Initialize the gpu device to use.
When its value is ``'cuda*'`` or ``'opencl*'``, the Aesara
flag :attr:`device` must be ``'cpu'``.
Unlike :attr:`device`, setting this flag to a specific GPU will not
make Aesara attempt to use the device by default. More specifically, it
will **not** move computations, nor shared variables, to the specified GPU.
This flag can be used to run GPU-specific tests on a particular GPU, instead
of the default one.
This flag's value cannot be modified during the program execution. This flag's value cannot be modified during the program execution.
.. attribute:: print_active_device .. attribute:: print_active_device
...@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``True`` Default: ``True``
Print the active device when the GPU device is initialized. Print the active device when the device is initialized.
.. attribute:: floatX .. attribute:: floatX
...@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in: ...@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``'default'`` Default: ``'default'``
If ``more``, sometimes Aesara will select :class:`Op` implementations that If ``more``, sometimes Aesara will select :class:`Op` implementations that
are more "deterministic", but slower. In particular, on the GPU, are more "deterministic", but slower. See the ``dnn.conv.algo*``
Aesara will avoid using ``AtomicAdd``. Sometimes Aesara will still use
non-deterministic implementations, e.g. when there isn't a GPU :class:`Op`
implementation that is deterministic. See the ``dnn.conv.algo*``
flags for more cases. flags for more cases.
.. attribute:: allow_gc .. attribute:: allow_gc
...@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in:
functions with many fast :class:`Op`\s, but it also increases Aesara's memory functions with many fast :class:`Op`\s, but it also increases Aesara's memory
usage. usage.
.. note:: If :attr:`config.gpuarray__preallocate` is the default value
or not disabled ``(-1)``, this is not useful anymore on the GPU.
.. attribute:: config.scan__allow_output_prealloc .. attribute:: config.scan__allow_output_prealloc
Bool value, either ``True`` or ``False`` Bool value, either ``True`` or ``False``
...@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in: ...@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in:
<https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__ <https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__
library, which is faster than the standard ``libm``. library, which is faster than the standard ``libm``.
.. attribute:: config.gpuarray__preallocate
Float value
Default: 0 (Preallocation of size 0, only cache the allocation)
Controls the preallocation of memory with the gpuarray backend.
This value represents the start size (either in MB or the fraction
of total GPU memory) of the memory pool. If more memory is needed,
Aesara will try to obtain more, but this can cause memory
fragmentation.
A negative value will completely disable the allocation cache.
This can have a severe impact on performance and should not be
used outside of debugging.
* < 0: disabled
* 0 <= N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
* > 1: use this number in megabytes (MB) of memory.
.. note::
This could cause memory fragmentation, so, if you have a memory
error while using the cache, try to allocate more memory at
the start, or disable it.
.. note::
The clipping at 95% can be bypassed by specifying the exact
number of megabytes. If more then 95% are needed, it will try
automatically to get more memory. But this can cause
fragmentation, see note above.
.. attribute:: config.gpuarray__sched
String value: ``'default'``, ``'multi'``, ``'single'``
Default: ``'default'``
Control the stream mode of contexts.
The sched parameter passed for context creation to ``pygpu``. With
CUDA, using ``"multi"`` means using the parameter
``cudaDeviceScheduleBlockingSync``. This is useful to lower the CPU overhead
when waiting for a GPU.
.. attribute:: config.gpuarray__single_stream
Boolean value
Default: ``True``
Control the stream mode of contexts.
If your computations consist of mostly small arrays, using
single-stream will avoid the synchronization overhead and usually
be faster. For larger arrays it does not make a difference yet.
.. attribute:: config.gpuarray__cache_path
Default: ``config.compiledir``/gpuarray_kernels
Directory to cache pre-compiled kernels for the gpuarray backend.
.. attribute:: linker .. attribute:: linker
String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'`` String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'``
......
...@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it. ...@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it.
The sparse module provides the same functionality as the tensor The sparse module provides the same functionality as the tensor
module. The difference lies under the covers because sparse matrices module. The difference lies under the covers because sparse matrices
do not store data in a contiguous array. Note that there are no GPU do not store data in a contiguous array. The sparse module has
implementations for sparse matrices in Aesara. The sparse module has
been used in: been used in:
- NLP: Dense linear transformations of sparse vectors. - NLP: Dense linear transformations of sparse vectors.
......
...@@ -29,51 +29,13 @@ The recommended user interface are: ...@@ -29,51 +29,13 @@ The recommended user interface are:
With those new interface, Aesara will automatically use the fastest With those new interface, Aesara will automatically use the fastest
implementation in many cases. On the CPU, the implementation is a GEMM implementation in many cases. On the CPU, the implementation is a GEMM
based one. On the GPU, there is a GEMM based and :ref:`cuDNN based one.
<libdoc_gpuarray_dnn>` version.
By default on the GPU, if cuDNN is available, it will be used,
otherwise we will fall back to using gemm based version (slower than
cuDNN in most cases and uses more memory). To get an error if cuDNN
can not be used, you can supply the Aesara flag ``dnn.enable=True``.
Either cuDNN and the gemm version can be disabled using the Aesara flags
``optimizer_excluding=conv_dnn`` and ``optimizer_excluding=conv_gemm``,
respectively. If both are disabled, it will raise an error.
For the cuDNN version, there are different algorithms with different
memory/speed trade-offs. Manual selection of the right one is very
difficult as it depends on the shapes and hardware. So it can change
for each layer. An auto-tuning mode exists and can be activated by
those flags: ``dnn__conv__algo_fwd=time_once``,
``dnn__conv__algo_bwd_data=time_once`` and
``dnn__conv__algo_bwd_filter=time_once``. Note, they are good mostly
when the shape do not change.
This auto-tuning has the inconvenience that the first call is much This auto-tuning has the inconvenience that the first call is much
slower as it tries and times each implementation it has. So if you slower as it tries and times each implementation it has. So if you
benchmark, it is important that you remove the first call from your benchmark, it is important that you remove the first call from your
timing. timing.
Also, a meta-optimizer has been implemented for the gpu convolution
implementations to automatically choose the fastest implementation
for each specific convolution in your graph. For each instance, it will
compile and benchmark each applicable implementation and choose the
fastest one. It can be enabled using ``optimizer_including=conv_meta``.
The meta-optimizer can also selectively disable cudnn and gemm version
using the Aesara flag ``metaopt__optimizer_excluding=conv_dnn`` and
``metaopt__optimizer_excluding=conv_gemm`` respectively.
.. note::
Aesara had older user interface like
aesara.tensor.nnet.conv.conv2d. Do not use them anymore. They
will give you slower code and won't allow easy switch between CPU
and GPU computation. They also support less type of convolution.
Implementation Details Implementation Details
====================== ======================
...@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you. ...@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you.
- :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`. - :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
old 2d convolution. DO NOT USE ANYMORE. old 2d convolution. DO NOT USE ANYMORE.
- :func:`GpuCorrMM <aesara.gpuarray.blas.GpuCorrMM>`
This is a GPU-only 2d correlation implementation taken from
`caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_. It does not flip the kernel.
For each element in a batch, it first creates a For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel. `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
...@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you. ...@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you.
This is a CPU-only 2d correlation implementation taken from This is a CPU-only 2d correlation implementation taken from
`caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_. `caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
It does not flip the kernel. It does not flip the kernel.
- :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` GPU-only
convolution using NVIDIA's cuDNN library.
- Implemented operators for neural network 3D / video convolution: - Implemented operators for neural network 3D / video convolution:
- :func:`GpuCorr3dMM <aesara.gpuarray.blas.GpuCorr3dMM>`
This is a GPU-only 3d correlation relying on a Toeplitz matrix
and gemm implementation (see :func:`GpuCorrMM <aesara.sandbox.cuda.blas.GpuCorrMM>`)
It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
``(no of channels * filter width * filter height * filter depth, output width * output height * output depth)``.
- :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>` - :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
This is a CPU-only 3d correlation implementation based on This is a CPU-only 3d correlation implementation based on
the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`). the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
...@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you. ...@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you.
replacement for nnet.conv3d. For convolutions done on CPU, replacement for nnet.conv3d. For convolutions done on CPU,
nnet.conv3d will be replaced by Corr3dMM. nnet.conv3d will be replaced by Corr3dMM.
- :func:`dnn_conv3d <aesara.gpuarray.dnn.dnn_conv3d>` GPU-only
3D convolution using NVIDIA's cuDNN library (as :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` but for 3d).
If cuDNN is available, by default, Aesara will replace all nnet.conv3d
operations with dnn_conv.
- :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>` - :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
Another conv3d implementation that uses the conv2d with data reshaping. Another conv3d implementation that uses the conv2d with data reshaping.
It is faster in some corner cases than conv3d. It flips the kernel. It is faster in some corner cases than conv3d. It flips the kernel.
......
...@@ -14,8 +14,7 @@ ...@@ -14,8 +14,7 @@
.. note:: .. note::
This interface is the preferred interface. It will be moved This interface is the preferred interface.
automatically to the GPU.
.. note:: .. note::
......
...@@ -42,7 +42,6 @@ Optimization o4 o3 o2 ...@@ -42,7 +42,6 @@ Optimization o4 o3 o2
========================================================= ============== === === ================= ============= ====== ========================================================= ============== === === ================= ============= ======
:term:`merge` x x x x x :term:`merge` x x x x x
:term:`constant folding<constant folding>` x x x x x :term:`constant folding<constant folding>` x x x x x
:term:`GPU transfer` x x x x x
:term:`shape promotion<shape promotion>` x x x :term:`shape promotion<shape promotion>` x x x
:term:`fill cut<fill cut>` x x x :term:`fill cut<fill cut>` x x x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x :term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x
...@@ -247,32 +246,10 @@ Optimization o4 o3 o2 ...@@ -247,32 +246,10 @@ Optimization o4 o3 o2
This optimization compresses subgraphs of computationally cheap This optimization compresses subgraphs of computationally cheap
elementwise operations into a single Op that does the whole job in a elementwise operations into a single Op that does the whole job in a
single pass over the inputs (like loop fusion). This is a win when single pass over the inputs (like loop fusion). This is a win when
transfer from main memory to the CPU (or from graphics memory to the transfer from main memory to the CPU is a bottleneck.
GPU) is a bottleneck.
See :class:`FusionOptimizer` See :class:`FusionOptimizer`
GPU transfer
The current strategy for choosing which expressions to evaluate on the
CPU and which to evaluate on the GPU is a greedy one. There are a
number of Ops ***TODO*** with GPU implementations and whenever we find
a graph copying data from GPU to CPU in order to evaluate an
expression that could have been evaluated on the GPU, we substitute
the GPU version of that Op for the CPU version. Likewise if we are
copying the output of a Op with a GPU implementation to the GPU,
then we substitute the GPU version for the CPU version. In this way, if all goes well,
this procedure will result in a graph with the following form:
1. copy non-shared inputs to GPU
2. carry out most/all computations on the GPU
3. copy output back to CPU
When using a GPU, :func:`shared()` will default to GPU storage for
'float32' ndarray arguments, and these shared variables act as seeds
for the greedy algorithm.
See :func:`aesara.sandbox.cuda.opt.*`.
local_log_softmax local_log_softmax
This is a stabilization optimization. This is a stabilization optimization.
It can happen due to rounding errors that the softmax probability of one value gets to 0. It can happen due to rounding errors that the softmax probability of one value gets to 0.
......
...@@ -9,10 +9,6 @@ Requirements ...@@ -9,10 +9,6 @@ Requirements
.. _Python: http://www.python.org/ .. _Python: http://www.python.org/
.. _LaTeX: http://www.latex-project.org/ .. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/ .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
.. _pycuda: https://mathema.tician.de/software/pycuda/
.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
.. _warp-ctc: https://github.com/baidu-research/warp-ctc .. _warp-ctc: https://github.com/baidu-research/warp-ctc
Python_ == >= 3.7 Python_ == >= 3.7
...@@ -42,20 +38,6 @@ Requirements ...@@ -42,20 +38,6 @@ Requirements
`pydot-ng <https://github.com/pydot/pydot-ng>`_ `pydot-ng <https://github.com/pydot/pydot-ng>`_
To handle large picture for gif/images. To handle large picture for gif/images.
`NVIDIA CUDA drivers and SDK`_
**Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
`libgpuarray`_
Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
`pycuda`_ and `skcuda`_
Required for some extra operations on the GPU like fft and
solvers. We use them to wrap cufft and cusolver. Quick install
``pip install pycuda scikit-cuda``. For cuda 8, the dev
version of skcuda (will be released as 0.5.2) is needed for
cusolver: ``pip install pycuda; pip install
git+https://github.com/lebedov/scikit-cuda.git#egg=scikit-cuda``.
`warp-ctc`_ `warp-ctc`_
Required for :ref:`Aesara CTC implementation Required for :ref:`Aesara CTC implementation
<libdoc_tensor_nnet_ctc>`. It is faster then using an <libdoc_tensor_nnet_ctc>`. It is faster then using an
...@@ -84,28 +66,3 @@ Install requirements and optional packages ...@@ -84,28 +66,3 @@ Install requirements and optional packages
conda install numpy scipy mkl pytest <sphinx> <pydot-ng> conda install numpy scipy mkl pytest <sphinx> <pydot-ng>
* Arguments between <...> are optional. * Arguments between <...> are optional.
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
1. Install CUDA drivers
* Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
* You must reboot the computer after the driver installation.
* Test that it was loaded correctly after the reboot, executing the
command `nvidia-smi` from the command line.
.. note::
Sanity check: The *bin* subfolder should contain an *nvcc*
program. This folder is called the *cuda root* directory.
2. Fix 'lib' path
* Add the CUDA 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
64-bit OS) to your ``$LD_LIBRARY_PATH`` environment
variable. Example: ``/usr/local/cuda/lib64``
...@@ -54,7 +54,7 @@ if __name__ == '__main__': ...@@ -54,7 +54,7 @@ if __name__ == '__main__':
pythonpath = os.pathsep.join([throot, pythonpath]) pythonpath = os.pathsep.join([throot, pythonpath])
sys.path[0:0] = [throot] # We must not use os.environ. sys.path[0:0] = [throot] # We must not use os.environ.
# Make sure we don't use gpu to compile documentation # Make sure we don't use other devices to compile documentation
env_th_flags = os.environ.get('AESARA_FLAGS', '') env_th_flags = os.environ.get('AESARA_FLAGS', '')
os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True' os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True'
......
...@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z). ...@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z).
This scenario arises when an operation requires allocation of a large contiguous This scenario arises when an operation requires allocation of a large contiguous
block of memory but no blocks of sufficient size are available. block of memory but no blocks of sufficient size are available.
GPUs do not have virtual memory and as such all allocations must be assigned to
a continuous memory region. CPUs do not have this limitation because or their
support for virtual memory. Multiple allocations on a GPU can result in memory
fragmentation which can makes it more difficult to find contiguous regions
of memory of sufficient size during subsequent memory allocations.
A known example is related to writing data to shared variables. When updating a A known example is related to writing data to shared variables. When updating a
shared variable Aesara will allocate new space if the size of the data does not shared variable Aesara will allocate new space if the size of the data does not
match the size of the space already assigned to the variable. This can lead to match the size of the space already assigned to the variable. This can lead to
...@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64} ...@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64}
It should be noted that using float32 and int{32, 64} together It should be noted that using float32 and int{32, 64} together
inside a function would provide float64 as output. inside a function would provide float64 as output.
Since the GPU can't compute this kind of output, it would be
preferable not to use those dtypes together.
To help you find where float64 are created, see the To help you find where float64 are created, see the
:attr:`warn_float64` Aesara flag. :attr:`warn_float64` Aesara flag.
...@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If ...@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If
some test fails on your machine, you are encouraged to tell us what went some test fails on your machine, you are encouraged to tell us what went
wrong in the GitHub issues. wrong in the GitHub issues.
.. warning::
Aesara's test should **NOT** be run with ``device=cuda``
or they will fail. The tests automatically use the gpu, if any, when
needed. If you don't want Aesara to ever use the gpu when running tests,
you can set :attr:`config.device` to ``cpu`` and
:attr:`config.force_device` to ``True``.
.. _slow_or_memory: .. _slow_or_memory:
Why is my code so slow/uses so much memory Why is my code so slow/uses so much memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is a few things you can easily do to change the trade-off There is a few things you can easily do to change the trade-off
between speed and memory usage. It nothing is said, this affect the between speed and memory usage.
CPU and GPU memory usage.
Could speed up and lower memory usage:
- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
memory then Aesara version. But some flags allow it to use more
memory. GPU only.
Could raise memory usage but speed up computation: Could raise memory usage but speed up computation:
- :attr:`config.gpuarray__preallocate` = 1 # Preallocates the GPU memory
and then manages it in a smart way. Does not raise much the memory
usage, but if you are at the limit of GPU memory available you might
need to specify a lower value. GPU only.
- :attr:`config.allow_gc` =False - :attr:`config.allow_gc` =False
- :attr:`config.optimizer_excluding` =low_memory , GPU only for now.
Could lower the memory usage, but raise computation time: Could lower the memory usage, but raise computation time:
- :attr:`config.scan__allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled - :attr:`config.scan__allow_gc` = True
- :attr:`config.scan__allow_output_prealloc` =False - :attr:`config.scan__allow_output_prealloc` =False
- Use :func:`batch_normalization() - Use :func:`batch_normalization()
<aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory <aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
...@@ -293,7 +265,7 @@ Aesara/BLAS speed test: ...@@ -293,7 +265,7 @@ Aesara/BLAS speed test:
python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py
This will print a table with different versions of BLAS/numbers of This will print a table with different versions of BLAS/numbers of
threads on multiple CPUs and GPUs. It will also print some Aesara/NumPy threads on multiple CPUs. It will also print some Aesara/NumPy
configuration information. Then, it will print the running time of the same configuration information. Then, it will print the running time of the same
benchmarks for your installation. Try to find a CPU similar to yours in benchmarks for your installation. Try to find a CPU similar to yours in
the table, and check that the single-threaded timings are roughly the same. the table, and check that the single-threaded timings are roughly the same.
......
...@@ -194,33 +194,32 @@ makes it possible to expose Aesara's internal variables without a copy, then it ...@@ -194,33 +194,32 @@ makes it possible to expose Aesara's internal variables without a copy, then it
proceeds as fast as an in-place update. proceeds as fast as an in-place update.
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can ..
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth: When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes, * Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily. unnecessarily.
*Solution*: update to a newer version of Aesara. *Solution*: update to a newer version of Aesara.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly, * If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient. faster and more memory efficient.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source *Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size. data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines * It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host. extra copy of the data on the host.
*Solution*: make sure that the value *Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*. you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
(Further information on the current implementation of the GPU version
of ``set_value()`` can be found here: :ref:`libdoc_gpuarray_type`)
.. _borrowfunction: .. _borrowfunction:
......
...@@ -329,26 +329,6 @@ Tips: ...@@ -329,26 +329,6 @@ Tips:
of type *float64*. of type *float64*.
"Why does my GPU function seem to be slow?"
-------------------------------------------
When you compile an Aesara function, if you do not get the speedup that you expect over the
CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
options:
* ``warn``: Raise a warning
* ``pdb``: Stop with a pdb in the computational graph during the compilation
* ``raise``: Raise an error,
if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in AESARA_FLAGS, such as:
``AESARA_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
But note that this optimization will not catch all the CPU Ops, it might miss some
Ops.
.. _faq_monitormode: .. _faq_monitormode:
"How do I Step through a Compiled Function?" "How do I Step through a Compiled Function?"
......
...@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with ...@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with
them in NumPy as usual. The updates mechanism can be a syntactic convenience, them in NumPy as usual. The updates mechanism can be a syntactic convenience,
but it is mainly there for efficiency. Updates to shared variables can but it is mainly there for efficiency. Updates to shared variables can
sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
updates). Also, Aesara has more control over where and how shared variables are updates).
allocated, which is one of the important elements of getting good performance
on the :ref:`GPU<using_gpu>`.
It may happen that you expressed some formula using a shared variable, but It may happen that you expressed some formula using a shared variable, but
you do *not* want to use its value. In this case, you can use the you do *not* want to use its value. In this case, you can use the
...@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of ...@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are draws from a normal distribution. The distributions that are implemented are
defined as :class:`RandomVariable`\s defined as :class:`RandomVariable`\s
in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU. in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU.
See `Other Implementations`_ for GPU version.
Now let's use these objects. If we call ``f()``, we get random uniform numbers. Now let's use these objects. If we call ``f()``, we get random uniform numbers.
...@@ -502,22 +499,6 @@ Other Random Distributions ...@@ -502,22 +499,6 @@ Other Random Distributions
There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`. There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`.
.. _example_other_random:
Other Implementations
---------------------
There is another implementations based on :ref:`MRG31k3p
<libdoc_rng_mrg>`.
The `RandomStream` only work on the CPU, MRG31k3p work on the CPU and GPU.
.. note::
To use you the MRG version easily, you can just change the import to:
.. code-block:: python
from aesara.sandbox.rng_mrg import MRG_RandomStream as RandomStream
.. _logistic_regression: .. _logistic_regression:
......
...@@ -48,8 +48,6 @@ Advanced ...@@ -48,8 +48,6 @@ Advanced
.. toctree:: .. toctree::
sparse sparse
using_gpu
using_multi_gpu
conv_arithmetic conv_arithmetic
Advanced configuration and debugging Advanced configuration and debugging
......
...@@ -17,7 +17,6 @@ Scan ...@@ -17,7 +17,6 @@ Scan
- Advantages of using ``scan`` over *for* loops: - Advantages of using ``scan`` over *for* loops:
- Number of iterations to be part of the symbolic graph. - Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers (if GPU is involved).
- Computes gradients through sequential steps. - Computes gradients through sequential steps.
- Slightly faster than using a *for* loop in Python with a compiled Aesara function. - Slightly faster than using a *for* loop in Python with a compiled Aesara function.
- Can lower the overall memory usage by detecting the actual amount of memory needed. - Can lower the overall memory usage by detecting the actual amount of memory needed.
......
...@@ -83,11 +83,8 @@ Consider the logistic regression: ...@@ -83,11 +83,8 @@ Consider the logistic regression:
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]): train.maker.fgraph.toposort()]):
print('Used the cpu') print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else: else:
print('ERROR, not able to tell if aesara used the cpu or the gpu') print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort()) print(train.maker.fgraph.toposort())
for i in range(training_steps): for i in range(training_steps):
...@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter. ...@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter.
Aesara defines the following modes by name: Aesara defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled. - ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible. - ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible.
- ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python - ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python
implementations. This mode can take much longer than the other modes, but can identify implementations. This mode can take much longer than the other modes, but can identify
......
...@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction, ...@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction,
if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in
train.maker.fgraph.toposort()): train.maker.fgraph.toposort()):
print('Used the cpu') print('Used the cpu')
elif any(x.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for x in
train.maker.fgraph.toposort()):
print('Used the gpu')
else: else:
print('ERROR, not able to tell if aesara used the cpu or the gpu') print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort()) print(train.maker.fgraph.toposort())
for i in range(training_steps): for i in range(training_steps):
......
差异被折叠。
差异被折叠。
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Aesara has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the aesara configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
.. note::
For data-parallelism, you probably are better using `platoon
<https://github.com/mila-udem/platoon>`_.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separated by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Aesara. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ AESARA_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Aesara will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ AESARA_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import aesara'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since aesara often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distributing the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import aesara
v01 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = aesara.function([], [aesara.tensor.dot(v01, v02),
aesara.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import aesara
v = aesara.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead that may reduce performance.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论