提交 f7a506ff authored 作者: Maxim Kochurov's avatar Maxim Kochurov 提交者: Brandon T. Willard

Remove gpuarray references from documentation

上级 cc584d6c
......@@ -14,11 +14,8 @@ Acknowledgements
* The developers of `Theano <https://github.com/Theano/Theano>`_
* All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_.
* All Theano users that have given us feedback.
* The GPU implementation of tensordot is based on code from Tijmen
Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
* Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in:
P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` was included in Theano in the previous GPU back-end. It is still in the history, but not in the current version. It has the same license.
......@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op`
=====================================
This tutorial covers how to extend Aesara with an :class:`Op` that offers a C
implementation. It does not cover :class:`Op`\s that run on a GPU but it does introduce
many elements and concepts which are relevant for GPU :class:`Op`\s. This tutorial is
aimed at individuals who already know how to extend Aesara (see tutorial
:ref:`creating_an_op`) by adding a new :class:`Op` with a Python implementation
and will only cover the additional knowledge required to also produce :class:`Op`\s
with C implementations.
implementation. This tutorial is aimed at individuals who already know how to
extend Aesara (see tutorial :ref:`creating_an_op`) by adding a new :class:`Op`
with a Python implementation and will only cover the additional knowledge
required to also produce :class:`Op`\s with C implementations.
Providing an Aesara :class:`Op` with a C implementation requires to interact with
Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
......@@ -927,7 +925,7 @@ discussed below.
further below.
For every input which has a :attr:`dtype` attribute (this means
Tensors, and equivalent types on GPU), the following macros will be
Tensors), the following macros will be
defined unless your `Op` class has an :attr:`Op.check_input` attribute
defined to False. In these descrptions 'i' refers to the position
(indexed from 0) in the input array.
......@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled
by Aesara.
For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`.
For the GPU, you must add in this second flag `nvcc.flags=-g` (it slow
down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python
process:
......
......@@ -824,10 +824,10 @@ will not be accepted.
:class:`NanGuardMode` help users find where in the graph NaN appear. But
sometimes, we want some variables to not be checked. For example, in
the old GPU back-end, we use a float32 :class:`CudaNdarray` to store the MRG
the old GPU back-end, we used a float32 :class:`CudaNdarray` to store the MRG
random number generator state (they are integers). So if :class:`NanGuardMode`
check it, it will generate false positive. Another case is related to
:class:`[Gpu]AllocEmpty` or some computation on it (like done by :class:`Scan`).
checked it, it would generate a false positive. Another case is related to
:class:`AllocEmpty` or some computations on it (like done by :class:`Scan`).
You can tell :class:`NanGuardMode` to do not check a variable with:
:attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically
......
......@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for
Allows to specify a special compiler. This will force this compiler for
the current compilation block (a particular :class:`Op` or the full
graph). This is used for the GPU code.
graph).
.. method:: c_code_cache_version()
......@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f
In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent
the C variable names of the :class:`DeepCopyOp` input and output
respectively. See an example for the type ``GpuArrayType`` (GPU
array) in the file ``aesara/gpuarray/type.py``. The version
parameter is what is returned by :meth:`DeepCopyOp.c_code_cache_version`. By
default, it will recompile the C code for each process.
respectively. The version parameter is what is returned by
:meth:`DeepCopyOp.c_code_cache_version`. By default, it will recompile the C
code for each process.
:class:`ViewOp`
===============
......
......@@ -829,9 +829,9 @@ Explanations:
* ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`.
* ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph.
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /...
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically /...
* ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase.
* ``Aesara Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Aesara Linker time (includes C code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
* ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``.
......
......@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by
identifying and replacing certain patterns in the graph with other specialized
patterns that produce the same results but are either faster or more
stable. Optimizations can also detect identical subgraphs and ensure that the
same values are not computed twice or reformulate parts of the graph to a GPU
specific version.
same values are not computed twice.
For example, one (simple) optimization that Aesara uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`.
......
......@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused
by our elemwise operations on tensors. If the scalar operation has C code, the
elemwise implementation will automatically have C code too. This
will enable the fusion of elemwise operations using your new scalar
operation. It can also reuse the GPU elemwise code. It is similar for
reduction operations.
operation. It is similar for reduction operations.
Be careful about some possible problems in the definition of the
``grad`` method, and about dependencies that may not be available. In
......@@ -125,11 +124,7 @@ Random distribution
We have 3 base random number generators. One that wraps NumPy's random
generator, one that implements MRG31k3p and one that wraps CURAND.
The fastest, but less developed, is CURAND. It works only on CUDA-enabled
GPUs. It does not work on the CPU and it has fewer random distributions
implemented.
The recommended and 2nd faster is MRG. It works on the GPU and CPU and
The recommended and 2nd faster is MRG. It works on the CPU and
has more implemented distributions.
The slowest is our wrapper on NumPy's random generator.
......
......@@ -194,12 +194,11 @@ default values.
:noindex:
If filter_inplace is defined, it will be called instead of
filter() This is to allow reusing the old allocated memory. As
of this writing this is used only when we transfer new data to a
shared variable on the gpu.
filter() This is to allow reusing the old allocated memory. This was used
only when new data was transferred to a shared variable on a GPU.
``storage`` will be the old value (e.g. the old `ndarray`).
``storage`` will be the old value. i.e. The old numpy array,
CudaNdarray, ...
.. method:: is_valid_value(value)
:noindex:
......
......@@ -6,17 +6,13 @@
Frequently Asked Questions
==========================
Does Aesara support Python 3?
------------------------------
We support both Python 2 >= 2.7 and Python 3 >= 3.4.
Output slight numerical difference
----------------------------------
Sometimes when you compare the output of Aesara using different
Aesara flags, Aesara versions, CPU and GPU or with other software like
NumPy, you will see small numerical differences.
Sometimes when you compare the output of Aesara using different Aesara flags,
Aesara versions, CPU and GPU devices, or with other software like NumPy, you
will see small numerical differences.
This is normal. Floating point numbers are approximations of real
numbers. This is why doing a+(b+c) vs (a+b)+c can give small
......@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most
optimizations and disables the generation of any c/cuda code. This is useful
for quickly testing a simple idea.
If c/cuda code is necessary, as when using a GPU, the flag
If C code is necessary, the flag
``optimizer=fast_compile`` can be used instead. It instructs Aesara to
skip time consuming optimizations but still generate c/cuda code.
skip time consuming optimizations but still generate C code.
Similarly using the flag ``optimizer_excluding=inplace`` will speed up
compilation by preventing optimizations that replace operations with a
......@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to
reuse them during the next call to the same Aesara function, if they are of the
correct shape. The shape could change if the shapes of the inputs change.
.. note::
With :attr:`preallocate <config.gpuarray__preallocate>`, this isn't
very useful with GPU anymore.
.. _unsafe_optimization:
Unsafe optimization
......@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation:
f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c'))
f(10.)
New GPU backend using libgpuarray
---------------------------------
The new aesara GPU backend (:ref:`gpuarray`) uses ``config.gpuarray__preallocate`` for GPU memory allocation.
Related Projects
----------------
......
......@@ -13,7 +13,4 @@ Supported platforms:
install_windows
install_centos6
Once your setup is complete and if you installed the GPU libraries, head to :ref:`testing_the_gpu` to find how to verify
everything is working properly.
To update your current installation see :ref:`updating`.
......@@ -12,23 +12,14 @@ Stable Installation
With ``conda``
^^^^^^^^^^^^^^
If you use conda, you can directly install both aesara and pygpu. Libgpuarray
will be automatically installed as a dependency of pygpu.
If you use conda, you can directly install aesara.
.. code-block:: bash
conda install aesara pygpu
.. warning::
The Aesara developers do not maintain ``pygpu``, so compatibility isn't
guaranteed.
conda install aesara
With ``pip``
^^^^^^^^^^^^
If you use pip, you have to install Aesara and libgpuarray separately.
aesara
::::::
......@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
:::::::::::
Download it with::
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
Bleeding-Edge Installation (recommended)
----------------------------------------
......@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
.. note::
Currently, you need ``libgpuarray`` version ``0.7.X`` that is not in conda default channel.
But you can install it with our own channel ``mila-udem`` (that only supports Python 2.7, 3.5 and 3.6)::
conda install -c mila-udem pygpu
Developer Installation
----------------------
......@@ -116,8 +84,3 @@ Install the developer version of Aesara with:
source directory.
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
See instructions for bleeding-edge installation about ``libgpuarray``.
......@@ -17,21 +17,6 @@ details so that we can add alternative instructions.
.. include:: requirements.inc
.. _gpu_macos:
.. attention::
For MacOS you should be able to follow the above instructions to
setup CUDA, but be aware of the following caveats:
* If you want to compile the CUDA SDK code, you may need to temporarily
revert back to Apple's gcc (``sudo port select gcc``) as their Makefiles
are not compatible with MacPort's gcc.
* If CUDA seems unable to find a CUDA-capable GPU, you may need to manually
toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
.. attention::
Aesara officially supports only clang on OS X. This can be installed
......
......@@ -11,8 +11,6 @@ Ubuntu Installation Instructions
from GitHub, please make sure you are reading `the latest version of this
page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_.
.. _gpu_linux:
.. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below)
.. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
.. |CompilerName| replace:: ``g++``
......@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended)
If you want to acquire the requirements through your system packages
and install them system wide follow these instructions:
For Ubuntu 16.04 with cuda 7.5
For Ubuntu 16.04
.. code-block:: bash
sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz
sudo pip install Aesara
# cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
......
......@@ -30,26 +30,10 @@ Install requirements and optional packages
* Arguments between <...> are optional.
* ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended.
* ``git`` package installs git source control through conda, which is required for the development versions of Aesara and libgpuarray
* ``git`` package installs git source control through conda, which is required for the development version of Aesara
.. _gpu_windows:
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
Install CUDA drivers
^^^^^^^^^^^^^^^^^^^^
Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
You must reboot the computer after the driver installation.
.. Installation of Aesara and libgpuarray.
.. Installation of Aesara.
.. include:: install_generic.inc
:start-after: .. _install_generic:
......@@ -73,7 +57,3 @@ generic guidelines to get a working environment:
path`` option.
3. Enable OpenMP support by checking the option ``openmp support
option``.
* Install CUDA with the same instructions as above.
* Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
......@@ -30,12 +30,12 @@
By default, return a copy of the data. If ``borrow=True`` (and
``return_internal_type=False``), maybe it will return a copy.
For tensor, it will always return a ndarray by default, so if
the data is on the GPU, it will return a copy, but if the data
For tensor, it will always return an `ndarray` by default, so if
the data is on another device, it will return a copy, but if the data
is on the CPU, it will return the original data. If you do
``borrow=True`` and ``return_internal_type=True``, it will
always return the original data, not a copy, but this can be a
GPU object.
always return the original data, not a copy, but this can be a non-`ndarray`
type of object.
.. method:: set_value(self, new_value, borrow=False)
......
......@@ -51,11 +51,11 @@ Environment Variables
.. code-block:: bash
AESARA_FLAGS='floatX=float32,device=cuda0,gpuarray__preallocate=1' python <myscript>.py
AESARA_FLAGS='floatX=float32' python <myscript>.py
If a value is defined several times in ``AESARA_FLAGS``,
the right-most definition is used, so, for instance, if
``AESARA_FLAGS='device=cpu,device=cuda0'`` is set, then ``cuda0`` will be
``AESARA_FLAGS='floatX=float32,floatX=float64'`` is set, then ``float64`` will be
used.
.. envvar:: AESARARC
......@@ -72,15 +72,11 @@ Environment Variables
floatX = float32
device = cuda0
[gpuarray]
preallocate = 1
Configuration attributes that are available directly in ``config``
(e.g. ``config.device``, ``config.mode``) should be defined in the
``[global]`` section.
Attributes from a subsection of ``config`` (e.g. ``config.gpuarray__preallocate``,
``config.dnn__conv__algo_fwd``) should be defined in their corresponding
section (e.g. ``[gpuarray]``, ``[dnn.conv]``).
(e.g. ``config.mode``) should be defined in the ``[global]`` section.
Attributes from a subsection of ``config``
(e.g. ``config.dnn__conv__algo_fwd``) should be defined in their
corresponding section (e.g. ``[dnn.conv]``).
Multiple configuration files can be specified by separating them with ``':'``
characters (as in ``$PATH``). Multiple configuration files will be merged,
......@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in:
.. attribute:: device
String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Default device for computations. If ``'cuda*``, change the default to try
to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
the OpenCL libraries will be used. To let the driver select the device,
use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
either we fall back on the CPU, or an error is raised, depending
on the :attr:`force_device` flag.
This flag's value cannot be modified during the program execution.
Do not use upper case letters; only lower case, even if NVIDIA uses
capital letters.
String value: either ``'cpu'``
.. attribute:: force_device
......@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in:
Default: ``False``
If ``True`` and ``device=gpu*``, Aesara raises an error when it cannot
use the specified :attr:`device`. If ``True`` and ``device=cpu``,
Aesara disables the GPU. If ``False`` and ``device=gpu*``, and when the
specified device cannot be used, Aesara emits a warning and falls back to
the CPU.
This flag's value cannot be modified during the program execution.
.. attribute:: init_gpu_device
String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Initialize the gpu device to use.
When its value is ``'cuda*'`` or ``'opencl*'``, the Aesara
flag :attr:`device` must be ``'cpu'``.
Unlike :attr:`device`, setting this flag to a specific GPU will not
make Aesara attempt to use the device by default. More specifically, it
will **not** move computations, nor shared variables, to the specified GPU.
This flag can be used to run GPU-specific tests on a particular GPU, instead
of the default one.
This flag's value cannot be modified during the program execution.
.. attribute:: print_active_device
......@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``True``
Print the active device when the GPU device is initialized.
Print the active device when the device is initialized.
.. attribute:: floatX
......@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``'default'``
If ``more``, sometimes Aesara will select :class:`Op` implementations that
are more "deterministic", but slower. In particular, on the GPU,
Aesara will avoid using ``AtomicAdd``. Sometimes Aesara will still use
non-deterministic implementations, e.g. when there isn't a GPU :class:`Op`
implementation that is deterministic. See the ``dnn.conv.algo*``
are more "deterministic", but slower. See the ``dnn.conv.algo*``
flags for more cases.
.. attribute:: allow_gc
......@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in:
functions with many fast :class:`Op`\s, but it also increases Aesara's memory
usage.
.. note:: If :attr:`config.gpuarray__preallocate` is the default value
or not disabled ``(-1)``, this is not useful anymore on the GPU.
.. attribute:: config.scan__allow_output_prealloc
Bool value, either ``True`` or ``False``
......@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in:
<https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__
library, which is faster than the standard ``libm``.
.. attribute:: config.gpuarray__preallocate
Float value
Default: 0 (Preallocation of size 0, only cache the allocation)
Controls the preallocation of memory with the gpuarray backend.
This value represents the start size (either in MB or the fraction
of total GPU memory) of the memory pool. If more memory is needed,
Aesara will try to obtain more, but this can cause memory
fragmentation.
A negative value will completely disable the allocation cache.
This can have a severe impact on performance and should not be
used outside of debugging.
* < 0: disabled
* 0 <= N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
* > 1: use this number in megabytes (MB) of memory.
.. note::
This could cause memory fragmentation, so, if you have a memory
error while using the cache, try to allocate more memory at
the start, or disable it.
.. note::
The clipping at 95% can be bypassed by specifying the exact
number of megabytes. If more then 95% are needed, it will try
automatically to get more memory. But this can cause
fragmentation, see note above.
.. attribute:: config.gpuarray__sched
String value: ``'default'``, ``'multi'``, ``'single'``
Default: ``'default'``
Control the stream mode of contexts.
The sched parameter passed for context creation to ``pygpu``. With
CUDA, using ``"multi"`` means using the parameter
``cudaDeviceScheduleBlockingSync``. This is useful to lower the CPU overhead
when waiting for a GPU.
.. attribute:: config.gpuarray__single_stream
Boolean value
Default: ``True``
Control the stream mode of contexts.
If your computations consist of mostly small arrays, using
single-stream will avoid the synchronization overhead and usually
be faster. For larger arrays it does not make a difference yet.
.. attribute:: config.gpuarray__cache_path
Default: ``config.compiledir``/gpuarray_kernels
Directory to cache pre-compiled kernels for the gpuarray backend.
.. attribute:: linker
String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'``
......
......@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it.
The sparse module provides the same functionality as the tensor
module. The difference lies under the covers because sparse matrices
do not store data in a contiguous array. Note that there are no GPU
implementations for sparse matrices in Aesara. The sparse module has
do not store data in a contiguous array. The sparse module has
been used in:
- NLP: Dense linear transformations of sparse vectors.
......
......@@ -29,51 +29,13 @@ The recommended user interface are:
With those new interface, Aesara will automatically use the fastest
implementation in many cases. On the CPU, the implementation is a GEMM
based one. On the GPU, there is a GEMM based and :ref:`cuDNN
<libdoc_gpuarray_dnn>` version.
By default on the GPU, if cuDNN is available, it will be used,
otherwise we will fall back to using gemm based version (slower than
cuDNN in most cases and uses more memory). To get an error if cuDNN
can not be used, you can supply the Aesara flag ``dnn.enable=True``.
Either cuDNN and the gemm version can be disabled using the Aesara flags
``optimizer_excluding=conv_dnn`` and ``optimizer_excluding=conv_gemm``,
respectively. If both are disabled, it will raise an error.
For the cuDNN version, there are different algorithms with different
memory/speed trade-offs. Manual selection of the right one is very
difficult as it depends on the shapes and hardware. So it can change
for each layer. An auto-tuning mode exists and can be activated by
those flags: ``dnn__conv__algo_fwd=time_once``,
``dnn__conv__algo_bwd_data=time_once`` and
``dnn__conv__algo_bwd_filter=time_once``. Note, they are good mostly
when the shape do not change.
based one.
This auto-tuning has the inconvenience that the first call is much
slower as it tries and times each implementation it has. So if you
benchmark, it is important that you remove the first call from your
timing.
Also, a meta-optimizer has been implemented for the gpu convolution
implementations to automatically choose the fastest implementation
for each specific convolution in your graph. For each instance, it will
compile and benchmark each applicable implementation and choose the
fastest one. It can be enabled using ``optimizer_including=conv_meta``.
The meta-optimizer can also selectively disable cudnn and gemm version
using the Aesara flag ``metaopt__optimizer_excluding=conv_dnn`` and
``metaopt__optimizer_excluding=conv_gemm`` respectively.
.. note::
Aesara had older user interface like
aesara.tensor.nnet.conv.conv2d. Do not use them anymore. They
will give you slower code and won't allow easy switch between CPU
and GPU computation. They also support less type of convolution.
Implementation Details
======================
......@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you.
- :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
old 2d convolution. DO NOT USE ANYMORE.
- :func:`GpuCorrMM <aesara.gpuarray.blas.GpuCorrMM>`
This is a GPU-only 2d correlation implementation taken from
`caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_. It does not flip the kernel.
For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
......@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you.
This is a CPU-only 2d correlation implementation taken from
`caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
It does not flip the kernel.
- :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` GPU-only
convolution using NVIDIA's cuDNN library.
- Implemented operators for neural network 3D / video convolution:
- :func:`GpuCorr3dMM <aesara.gpuarray.blas.GpuCorr3dMM>`
This is a GPU-only 3d correlation relying on a Toeplitz matrix
and gemm implementation (see :func:`GpuCorrMM <aesara.sandbox.cuda.blas.GpuCorrMM>`)
It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
``(no of channels * filter width * filter height * filter depth, output width * output height * output depth)``.
- :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
This is a CPU-only 3d correlation implementation based on
the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
......@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you.
replacement for nnet.conv3d. For convolutions done on CPU,
nnet.conv3d will be replaced by Corr3dMM.
- :func:`dnn_conv3d <aesara.gpuarray.dnn.dnn_conv3d>` GPU-only
3D convolution using NVIDIA's cuDNN library (as :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` but for 3d).
If cuDNN is available, by default, Aesara will replace all nnet.conv3d
operations with dnn_conv.
- :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
Another conv3d implementation that uses the conv2d with data reshaping.
It is faster in some corner cases than conv3d. It flips the kernel.
......
......@@ -14,8 +14,7 @@
.. note::
This interface is the preferred interface. It will be moved
automatically to the GPU.
This interface is the preferred interface.
.. note::
......
......@@ -42,7 +42,6 @@ Optimization o4 o3 o2
========================================================= ============== === === ================= ============= ======
:term:`merge` x x x x x
:term:`constant folding<constant folding>` x x x x x
:term:`GPU transfer` x x x x x
:term:`shape promotion<shape promotion>` x x x
:term:`fill cut<fill cut>` x x x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x
......@@ -247,32 +246,10 @@ Optimization o4 o3 o2
This optimization compresses subgraphs of computationally cheap
elementwise operations into a single Op that does the whole job in a
single pass over the inputs (like loop fusion). This is a win when
transfer from main memory to the CPU (or from graphics memory to the
GPU) is a bottleneck.
transfer from main memory to the CPU is a bottleneck.
See :class:`FusionOptimizer`
GPU transfer
The current strategy for choosing which expressions to evaluate on the
CPU and which to evaluate on the GPU is a greedy one. There are a
number of Ops ***TODO*** with GPU implementations and whenever we find
a graph copying data from GPU to CPU in order to evaluate an
expression that could have been evaluated on the GPU, we substitute
the GPU version of that Op for the CPU version. Likewise if we are
copying the output of a Op with a GPU implementation to the GPU,
then we substitute the GPU version for the CPU version. In this way, if all goes well,
this procedure will result in a graph with the following form:
1. copy non-shared inputs to GPU
2. carry out most/all computations on the GPU
3. copy output back to CPU
When using a GPU, :func:`shared()` will default to GPU storage for
'float32' ndarray arguments, and these shared variables act as seeds
for the greedy algorithm.
See :func:`aesara.sandbox.cuda.opt.*`.
local_log_softmax
This is a stabilization optimization.
It can happen due to rounding errors that the softmax probability of one value gets to 0.
......
......@@ -9,10 +9,6 @@ Requirements
.. _Python: http://www.python.org/
.. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/
.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
.. _pycuda: https://mathema.tician.de/software/pycuda/
.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
.. _warp-ctc: https://github.com/baidu-research/warp-ctc
Python_ == >= 3.7
......@@ -42,20 +38,6 @@ Requirements
`pydot-ng <https://github.com/pydot/pydot-ng>`_
To handle large picture for gif/images.
`NVIDIA CUDA drivers and SDK`_
**Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
`libgpuarray`_
Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
`pycuda`_ and `skcuda`_
Required for some extra operations on the GPU like fft and
solvers. We use them to wrap cufft and cusolver. Quick install
``pip install pycuda scikit-cuda``. For cuda 8, the dev
version of skcuda (will be released as 0.5.2) is needed for
cusolver: ``pip install pycuda; pip install
git+https://github.com/lebedov/scikit-cuda.git#egg=scikit-cuda``.
`warp-ctc`_
Required for :ref:`Aesara CTC implementation
<libdoc_tensor_nnet_ctc>`. It is faster then using an
......@@ -84,28 +66,3 @@ Install requirements and optional packages
conda install numpy scipy mkl pytest <sphinx> <pydot-ng>
* Arguments between <...> are optional.
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
1. Install CUDA drivers
* Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
* You must reboot the computer after the driver installation.
* Test that it was loaded correctly after the reboot, executing the
command `nvidia-smi` from the command line.
.. note::
Sanity check: The *bin* subfolder should contain an *nvcc*
program. This folder is called the *cuda root* directory.
2. Fix 'lib' path
* Add the CUDA 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
64-bit OS) to your ``$LD_LIBRARY_PATH`` environment
variable. Example: ``/usr/local/cuda/lib64``
......@@ -54,7 +54,7 @@ if __name__ == '__main__':
pythonpath = os.pathsep.join([throot, pythonpath])
sys.path[0:0] = [throot] # We must not use os.environ.
# Make sure we don't use gpu to compile documentation
# Make sure we don't use other devices to compile documentation
env_th_flags = os.environ.get('AESARA_FLAGS', '')
os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True'
......
......@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z).
This scenario arises when an operation requires allocation of a large contiguous
block of memory but no blocks of sufficient size are available.
GPUs do not have virtual memory and as such all allocations must be assigned to
a continuous memory region. CPUs do not have this limitation because or their
support for virtual memory. Multiple allocations on a GPU can result in memory
fragmentation which can makes it more difficult to find contiguous regions
of memory of sufficient size during subsequent memory allocations.
A known example is related to writing data to shared variables. When updating a
shared variable Aesara will allocate new space if the size of the data does not
match the size of the space already assigned to the variable. This can lead to
......@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64}
It should be noted that using float32 and int{32, 64} together
inside a function would provide float64 as output.
Since the GPU can't compute this kind of output, it would be
preferable not to use those dtypes together.
To help you find where float64 are created, see the
:attr:`warn_float64` Aesara flag.
......@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If
some test fails on your machine, you are encouraged to tell us what went
wrong in the GitHub issues.
.. warning::
Aesara's test should **NOT** be run with ``device=cuda``
or they will fail. The tests automatically use the gpu, if any, when
needed. If you don't want Aesara to ever use the gpu when running tests,
you can set :attr:`config.device` to ``cpu`` and
:attr:`config.force_device` to ``True``.
.. _slow_or_memory:
Why is my code so slow/uses so much memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is a few things you can easily do to change the trade-off
between speed and memory usage. It nothing is said, this affect the
CPU and GPU memory usage.
Could speed up and lower memory usage:
- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
memory then Aesara version. But some flags allow it to use more
memory. GPU only.
between speed and memory usage.
Could raise memory usage but speed up computation:
- :attr:`config.gpuarray__preallocate` = 1 # Preallocates the GPU memory
and then manages it in a smart way. Does not raise much the memory
usage, but if you are at the limit of GPU memory available you might
need to specify a lower value. GPU only.
- :attr:`config.allow_gc` =False
- :attr:`config.optimizer_excluding` =low_memory , GPU only for now.
Could lower the memory usage, but raise computation time:
- :attr:`config.scan__allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled
- :attr:`config.scan__allow_gc` = True
- :attr:`config.scan__allow_output_prealloc` =False
- Use :func:`batch_normalization()
<aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
......@@ -293,7 +265,7 @@ Aesara/BLAS speed test:
python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py
This will print a table with different versions of BLAS/numbers of
threads on multiple CPUs and GPUs. It will also print some Aesara/NumPy
threads on multiple CPUs. It will also print some Aesara/NumPy
configuration information. Then, it will print the running time of the same
benchmarks for your installation. Try to find a CPU similar to yours in
the table, and check that the single-threaded timings are roughly the same.
......
......@@ -194,33 +194,32 @@ makes it possible to expose Aesara's internal variables without a copy, then it
proceeds as fast as an in-place update.
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
..
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily.
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily.
*Solution*: update to a newer version of Aesara.
*Solution*: update to a newer version of Aesara.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host.
* It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host.
*Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
*Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
(Further information on the current implementation of the GPU version
of ``set_value()`` can be found here: :ref:`libdoc_gpuarray_type`)
.. _borrowfunction:
......
......@@ -329,26 +329,6 @@ Tips:
of type *float64*.
"Why does my GPU function seem to be slow?"
-------------------------------------------
When you compile an Aesara function, if you do not get the speedup that you expect over the
CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
options:
* ``warn``: Raise a warning
* ``pdb``: Stop with a pdb in the computational graph during the compilation
* ``raise``: Raise an error,
if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in AESARA_FLAGS, such as:
``AESARA_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
But note that this optimization will not catch all the CPU Ops, it might miss some
Ops.
.. _faq_monitormode:
"How do I Step through a Compiled Function?"
......
......@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with
them in NumPy as usual. The updates mechanism can be a syntactic convenience,
but it is mainly there for efficiency. Updates to shared variables can
sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
updates). Also, Aesara has more control over where and how shared variables are
allocated, which is one of the important elements of getting good performance
on the :ref:`GPU<using_gpu>`.
updates).
It may happen that you expressed some formula using a shared variable, but
you do *not* want to use its value. In this case, you can use the
......@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are
defined as :class:`RandomVariable`\s
in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU.
See `Other Implementations`_ for GPU version.
Now let's use these objects. If we call ``f()``, we get random uniform numbers.
......@@ -502,22 +499,6 @@ Other Random Distributions
There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`.
.. _example_other_random:
Other Implementations
---------------------
There is another implementations based on :ref:`MRG31k3p
<libdoc_rng_mrg>`.
The `RandomStream` only work on the CPU, MRG31k3p work on the CPU and GPU.
.. note::
To use you the MRG version easily, you can just change the import to:
.. code-block:: python
from aesara.sandbox.rng_mrg import MRG_RandomStream as RandomStream
.. _logistic_regression:
......
......@@ -48,8 +48,6 @@ Advanced
.. toctree::
sparse
using_gpu
using_multi_gpu
conv_arithmetic
Advanced configuration and debugging
......
......@@ -17,7 +17,6 @@ Scan
- Advantages of using ``scan`` over *for* loops:
- Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers (if GPU is involved).
- Computes gradients through sequential steps.
- Slightly faster than using a *for* loop in Python with a compiled Aesara function.
- Can lower the overall memory usage by detecting the actual amount of memory needed.
......
......@@ -83,11 +83,8 @@ Consider the logistic regression:
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
......@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter.
Aesara defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled.
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible.
- ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python
implementations. This mode can take much longer than the other modes, but can identify
......
......@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction,
if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in
train.maker.fgraph.toposort()):
print('Used the cpu')
elif any(x.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for x in
train.maker.fgraph.toposort()):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
......
差异被折叠。
差异被折叠。
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Aesara has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the aesara configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
.. note::
For data-parallelism, you probably are better using `platoon
<https://github.com/mila-udem/platoon>`_.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separated by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Aesara. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ AESARA_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Aesara will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ AESARA_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import aesara'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since aesara often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distributing the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import aesara
v01 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = aesara.function([], [aesara.tensor.dot(v01, v02),
aesara.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import aesara
v = aesara.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead that may reduce performance.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论