提交 f7a506ff authored 作者: Maxim Kochurov's avatar Maxim Kochurov 提交者: Brandon T. Willard

Remove gpuarray references from documentation

上级 cc584d6c
......@@ -14,11 +14,8 @@ Acknowledgements
* The developers of `Theano <https://github.com/Theano/Theano>`_
* All `Aesara contributors <https://github.com/aesara-devs/aesara/graphs/contributors>`_.
* All Theano users that have given us feedback.
* The GPU implementation of tensordot is based on code from Tijmen
Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
* Our random number generator implementation on CPU and GPU uses the MRG31k3p algorithm that is described in:
P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` was included in Theano in the previous GPU back-end. It is still in the history, but not in the current version. It has the same license.
......@@ -6,12 +6,10 @@ Extending Aesara with a C :Class:`Op`
=====================================
This tutorial covers how to extend Aesara with an :class:`Op` that offers a C
implementation. It does not cover :class:`Op`\s that run on a GPU but it does introduce
many elements and concepts which are relevant for GPU :class:`Op`\s. This tutorial is
aimed at individuals who already know how to extend Aesara (see tutorial
:ref:`creating_an_op`) by adding a new :class:`Op` with a Python implementation
and will only cover the additional knowledge required to also produce :class:`Op`\s
with C implementations.
implementation. This tutorial is aimed at individuals who already know how to
extend Aesara (see tutorial :ref:`creating_an_op`) by adding a new :class:`Op`
with a Python implementation and will only cover the additional knowledge
required to also produce :class:`Op`\s with C implementations.
Providing an Aesara :class:`Op` with a C implementation requires to interact with
Python's C-API and Numpy's C-API. Thus, the first step of this tutorial is to
......@@ -927,7 +925,7 @@ discussed below.
further below.
For every input which has a :attr:`dtype` attribute (this means
Tensors, and equivalent types on GPU), the following macros will be
Tensors), the following macros will be
defined unless your `Op` class has an :attr:`Op.check_input` attribute
defined to False. In these descrptions 'i' refers to the position
(indexed from 0) in the input array.
......@@ -1035,8 +1033,6 @@ When debugging C code, it can be useful to use GDB for code compiled
by Aesara.
For this, you must enable this Aesara: `cmodule__remove_gxx_opt=True`.
For the GPU, you must add in this second flag `nvcc.flags=-g` (it slow
down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python
process:
......
......@@ -824,10 +824,10 @@ will not be accepted.
:class:`NanGuardMode` help users find where in the graph NaN appear. But
sometimes, we want some variables to not be checked. For example, in
the old GPU back-end, we use a float32 :class:`CudaNdarray` to store the MRG
the old GPU back-end, we used a float32 :class:`CudaNdarray` to store the MRG
random number generator state (they are integers). So if :class:`NanGuardMode`
check it, it will generate false positive. Another case is related to
:class:`[Gpu]AllocEmpty` or some computation on it (like done by :class:`Scan`).
checked it, it would generate a false positive. Another case is related to
:class:`AllocEmpty` or some computations on it (like done by :class:`Scan`).
You can tell :class:`NanGuardMode` to do not check a variable with:
:attr:`variable.tag.nan_guard_mode_check`. Also, this tag automatically
......
......@@ -114,7 +114,7 @@ prefix. The complete list can be found in the documentation for
Allows to specify a special compiler. This will force this compiler for
the current compilation block (a particular :class:`Op` or the full
graph). This is used for the GPU code.
graph).
.. method:: c_code_cache_version()
......@@ -527,10 +527,9 @@ You can implement :meth:`COp.c_code` for this :class:`Op`. It is registered as f
In your C code, you should use ``%(iname)s`` and ``%(oname)s`` to represent
the C variable names of the :class:`DeepCopyOp` input and output
respectively. See an example for the type ``GpuArrayType`` (GPU
array) in the file ``aesara/gpuarray/type.py``. The version
parameter is what is returned by :meth:`DeepCopyOp.c_code_cache_version`. By
default, it will recompile the C code for each process.
respectively. The version parameter is what is returned by
:meth:`DeepCopyOp.c_code_cache_version`. By default, it will recompile the C
code for each process.
:class:`ViewOp`
===============
......
......@@ -829,9 +829,9 @@ Explanations:
* ``Total compile time: 1.131874e+01s`` gives the total time spent inside `aesara.function`.
* ``Number of Apply nodes: 50`` means that after optimization, there are 50 apply node in the graph.
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /...
* ``Aesara Optimizer time: 1.152431e+00s`` means that we spend 1.15s in the ``aesara.function`` phase where we optimize (modify) the graph to make it faster / more stable numerically /...
* ``Aesara validate time: 2.790451e-02s`` means that we spent 2.8e-2s in the *validate* subset of the optimization phase.
* ``Aesara Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Aesara Linker time (includes C code generation/compiling): 7.893991e-02s`` means that we spent 7.9e-2s in *linker* phase of ``aesara.function``.
* ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
* ``Time in all call to aesara.grad() 4.732513e-02s`` tells that we spent a total of 4.7e-2s in all calls to ``aesara.grad``. This is outside of the calls to ``aesara.function``.
......
......@@ -337,8 +337,7 @@ computation is carried out. The way optimizations work in Aesara is by
identifying and replacing certain patterns in the graph with other specialized
patterns that produce the same results but are either faster or more
stable. Optimizations can also detect identical subgraphs and ensure that the
same values are not computed twice or reformulate parts of the graph to a GPU
specific version.
same values are not computed twice.
For example, one (simple) optimization that Aesara uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`.
......
......@@ -20,8 +20,7 @@ Implementing an Aesara scalar Op allows that scalar operation to be reused
by our elemwise operations on tensors. If the scalar operation has C code, the
elemwise implementation will automatically have C code too. This
will enable the fusion of elemwise operations using your new scalar
operation. It can also reuse the GPU elemwise code. It is similar for
reduction operations.
operation. It is similar for reduction operations.
Be careful about some possible problems in the definition of the
``grad`` method, and about dependencies that may not be available. In
......@@ -125,11 +124,7 @@ Random distribution
We have 3 base random number generators. One that wraps NumPy's random
generator, one that implements MRG31k3p and one that wraps CURAND.
The fastest, but less developed, is CURAND. It works only on CUDA-enabled
GPUs. It does not work on the CPU and it has fewer random distributions
implemented.
The recommended and 2nd faster is MRG. It works on the GPU and CPU and
The recommended and 2nd faster is MRG. It works on the CPU and
has more implemented distributions.
The slowest is our wrapper on NumPy's random generator.
......
......@@ -194,12 +194,11 @@ default values.
:noindex:
If filter_inplace is defined, it will be called instead of
filter() This is to allow reusing the old allocated memory. As
of this writing this is used only when we transfer new data to a
shared variable on the gpu.
filter() This is to allow reusing the old allocated memory. This was used
only when new data was transferred to a shared variable on a GPU.
``storage`` will be the old value (e.g. the old `ndarray`).
``storage`` will be the old value. i.e. The old numpy array,
CudaNdarray, ...
.. method:: is_valid_value(value)
:noindex:
......
......@@ -6,17 +6,13 @@
Frequently Asked Questions
==========================
Does Aesara support Python 3?
------------------------------
We support both Python 2 >= 2.7 and Python 3 >= 3.4.
Output slight numerical difference
----------------------------------
Sometimes when you compare the output of Aesara using different
Aesara flags, Aesara versions, CPU and GPU or with other software like
NumPy, you will see small numerical differences.
Sometimes when you compare the output of Aesara using different Aesara flags,
Aesara versions, CPU and GPU devices, or with other software like NumPy, you
will see small numerical differences.
This is normal. Floating point numbers are approximations of real
numbers. This is why doing a+(b+c) vs (a+b)+c can give small
......@@ -53,9 +49,9 @@ the flag ``mode=FAST_COMPILE`` which instructs Aesara to skip most
optimizations and disables the generation of any c/cuda code. This is useful
for quickly testing a simple idea.
If c/cuda code is necessary, as when using a GPU, the flag
If C code is necessary, the flag
``optimizer=fast_compile`` can be used instead. It instructs Aesara to
skip time consuming optimizations but still generate c/cuda code.
skip time consuming optimizations but still generate C code.
Similarly using the flag ``optimizer_excluding=inplace`` will speed up
compilation by preventing optimizations that replace operations with a
......@@ -95,11 +91,6 @@ garbage collection will keep all intermediate results' memory space to allow to
reuse them during the next call to the same Aesara function, if they are of the
correct shape. The shape could change if the shapes of the inputs change.
.. note::
With :attr:`preallocate <config.gpuarray__preallocate>`, this isn't
very useful with GPU anymore.
.. _unsafe_optimization:
Unsafe optimization
......@@ -173,11 +164,6 @@ but requires that all nodes in the graph have a C implementation:
f = function([x], (x + 1.) * 2, mode=aesara.compile.mode.Mode(linker='c'))
f(10.)
New GPU backend using libgpuarray
---------------------------------
The new aesara GPU backend (:ref:`gpuarray`) uses ``config.gpuarray__preallocate`` for GPU memory allocation.
Related Projects
----------------
......
......@@ -13,7 +13,4 @@ Supported platforms:
install_windows
install_centos6
Once your setup is complete and if you installed the GPU libraries, head to :ref:`testing_the_gpu` to find how to verify
everything is working properly.
To update your current installation see :ref:`updating`.
......@@ -12,23 +12,14 @@ Stable Installation
With ``conda``
^^^^^^^^^^^^^^
If you use conda, you can directly install both aesara and pygpu. Libgpuarray
will be automatically installed as a dependency of pygpu.
If you use conda, you can directly install aesara.
.. code-block:: bash
conda install aesara pygpu
.. warning::
The Aesara developers do not maintain ``pygpu``, so compatibility isn't
guaranteed.
conda install aesara
With ``pip``
^^^^^^^^^^^^
If you use pip, you have to install Aesara and libgpuarray separately.
aesara
::::::
......@@ -50,16 +41,6 @@ Install the latest stable version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
:::::::::::
Download it with::
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
and then follow the `Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
Bleeding-Edge Installation (recommended)
----------------------------------------
......@@ -80,19 +61,6 @@ Install the latest, bleeding-edge, development version of Aesara with:
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
.. note::
Currently, you need ``libgpuarray`` version ``0.7.X`` that is not in conda default channel.
But you can install it with our own channel ``mila-udem`` (that only supports Python 2.7, 3.5 and 3.6)::
conda install -c mila-udem pygpu
Developer Installation
----------------------
......@@ -116,8 +84,3 @@ Install the developer version of Aesara with:
source directory.
If you encountered any trouble, head to the :ref:`troubleshooting` page.
libgpuarray
^^^^^^^^^^^
See instructions for bleeding-edge installation about ``libgpuarray``.
......@@ -17,21 +17,6 @@ details so that we can add alternative instructions.
.. include:: requirements.inc
.. _gpu_macos:
.. attention::
For MacOS you should be able to follow the above instructions to
setup CUDA, but be aware of the following caveats:
* If you want to compile the CUDA SDK code, you may need to temporarily
revert back to Apple's gcc (``sudo port select gcc``) as their Makefiles
are not compatible with MacPort's gcc.
* If CUDA seems unable to find a CUDA-capable GPU, you may need to manually
toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
.. attention::
Aesara officially supports only clang on OS X. This can be installed
......
......@@ -11,8 +11,6 @@ Ubuntu Installation Instructions
from GitHub, please make sure you are reading `the latest version of this
page <http://deeplearning.net/software/aesara_versions/dev/install_ubuntu.html>`_.
.. _gpu_linux:
.. |PythonDistRecommended| replace:: The development package (python-dev or python-devel on most Linux distributions) is recommended (see just below)
.. |PlatformCompiler| replace:: ``python-dev``, ``g++`` >= 4.2
.. |CompilerName| replace:: ``g++``
......@@ -28,14 +26,13 @@ Prerequisites through System Packages (not recommended)
If you want to acquire the requirements through your system packages
and install them system wide follow these instructions:
For Ubuntu 16.04 with cuda 7.5
For Ubuntu 16.04
.. code-block:: bash
sudo apt-get install python-numpy python-scipy python-dev python-pip python-pytest g++ libopenblas-dev git graphviz
sudo pip install Aesara
# cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
......
......@@ -30,26 +30,10 @@ Install requirements and optional packages
* Arguments between <...> are optional.
* ``m2w64-toolchain`` package provides a fully-compatible version of GCC and is then highly recommended.
* ``git`` package installs git source control through conda, which is required for the development versions of Aesara and libgpuarray
* ``git`` package installs git source control through conda, which is required for the development version of Aesara
.. _gpu_windows:
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
Install CUDA drivers
^^^^^^^^^^^^^^^^^^^^
Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
You must reboot the computer after the driver installation.
.. Installation of Aesara and libgpuarray.
.. Installation of Aesara.
.. include:: install_generic.inc
:start-after: .. _install_generic:
......@@ -73,7 +57,3 @@ generic guidelines to get a working environment:
path`` option.
3. Enable OpenMP support by checking the option ``openmp support
option``.
* Install CUDA with the same instructions as above.
* Install the latest, development version of libgpuarray following the
`Step-by-step instructions <http://deeplearning.net/software/libgpuarray/installation.html#step-by-step-install>`__.
......@@ -30,12 +30,12 @@
By default, return a copy of the data. If ``borrow=True`` (and
``return_internal_type=False``), maybe it will return a copy.
For tensor, it will always return a ndarray by default, so if
the data is on the GPU, it will return a copy, but if the data
For tensor, it will always return an `ndarray` by default, so if
the data is on another device, it will return a copy, but if the data
is on the CPU, it will return the original data. If you do
``borrow=True`` and ``return_internal_type=True``, it will
always return the original data, not a copy, but this can be a
GPU object.
always return the original data, not a copy, but this can be a non-`ndarray`
type of object.
.. method:: set_value(self, new_value, borrow=False)
......
......@@ -51,11 +51,11 @@ Environment Variables
.. code-block:: bash
AESARA_FLAGS='floatX=float32,device=cuda0,gpuarray__preallocate=1' python <myscript>.py
AESARA_FLAGS='floatX=float32' python <myscript>.py
If a value is defined several times in ``AESARA_FLAGS``,
the right-most definition is used, so, for instance, if
``AESARA_FLAGS='device=cpu,device=cuda0'`` is set, then ``cuda0`` will be
``AESARA_FLAGS='floatX=float32,floatX=float64'`` is set, then ``float64`` will be
used.
.. envvar:: AESARARC
......@@ -72,15 +72,11 @@ Environment Variables
floatX = float32
device = cuda0
[gpuarray]
preallocate = 1
Configuration attributes that are available directly in ``config``
(e.g. ``config.device``, ``config.mode``) should be defined in the
``[global]`` section.
Attributes from a subsection of ``config`` (e.g. ``config.gpuarray__preallocate``,
``config.dnn__conv__algo_fwd``) should be defined in their corresponding
section (e.g. ``[gpuarray]``, ``[dnn.conv]``).
(e.g. ``config.mode``) should be defined in the ``[global]`` section.
Attributes from a subsection of ``config``
(e.g. ``config.dnn__conv__algo_fwd``) should be defined in their
corresponding section (e.g. ``[dnn.conv]``).
Multiple configuration files can be specified by separating them with ``':'``
characters (as in ``$PATH``). Multiple configuration files will be merged,
......@@ -105,20 +101,7 @@ import ``aesara`` and print the config variable, as in:
.. attribute:: device
String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Default device for computations. If ``'cuda*``, change the default to try
to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
the OpenCL libraries will be used. To let the driver select the device,
use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
either we fall back on the CPU, or an error is raised, depending
on the :attr:`force_device` flag.
This flag's value cannot be modified during the program execution.
Do not use upper case letters; only lower case, even if NVIDIA uses
capital letters.
String value: either ``'cpu'``
.. attribute:: force_device
......@@ -126,29 +109,6 @@ import ``aesara`` and print the config variable, as in:
Default: ``False``
If ``True`` and ``device=gpu*``, Aesara raises an error when it cannot
use the specified :attr:`device`. If ``True`` and ``device=cpu``,
Aesara disables the GPU. If ``False`` and ``device=gpu*``, and when the
specified device cannot be used, Aesara emits a warning and falls back to
the CPU.
This flag's value cannot be modified during the program execution.
.. attribute:: init_gpu_device
String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
``'opencl0:0'``, ``'opencl0:1'``, ...
Initialize the gpu device to use.
When its value is ``'cuda*'`` or ``'opencl*'``, the Aesara
flag :attr:`device` must be ``'cpu'``.
Unlike :attr:`device`, setting this flag to a specific GPU will not
make Aesara attempt to use the device by default. More specifically, it
will **not** move computations, nor shared variables, to the specified GPU.
This flag can be used to run GPU-specific tests on a particular GPU, instead
of the default one.
This flag's value cannot be modified during the program execution.
.. attribute:: print_active_device
......@@ -157,7 +117,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``True``
Print the active device when the GPU device is initialized.
Print the active device when the device is initialized.
.. attribute:: floatX
......@@ -186,10 +146,7 @@ import ``aesara`` and print the config variable, as in:
Default: ``'default'``
If ``more``, sometimes Aesara will select :class:`Op` implementations that
are more "deterministic", but slower. In particular, on the GPU,
Aesara will avoid using ``AtomicAdd``. Sometimes Aesara will still use
non-deterministic implementations, e.g. when there isn't a GPU :class:`Op`
implementation that is deterministic. See the ``dnn.conv.algo*``
are more "deterministic", but slower. See the ``dnn.conv.algo*``
flags for more cases.
.. attribute:: allow_gc
......@@ -207,9 +164,6 @@ import ``aesara`` and print the config variable, as in:
functions with many fast :class:`Op`\s, but it also increases Aesara's memory
usage.
.. note:: If :attr:`config.gpuarray__preallocate` is the default value
or not disabled ``(-1)``, this is not useful anymore on the GPU.
.. attribute:: config.scan__allow_output_prealloc
Bool value, either ``True`` or ``False``
......@@ -429,74 +383,6 @@ import ``aesara`` and print the config variable, as in:
<https://developer.amd.com/amd-cpu-libraries/amd-math-library-libm/>`__
library, which is faster than the standard ``libm``.
.. attribute:: config.gpuarray__preallocate
Float value
Default: 0 (Preallocation of size 0, only cache the allocation)
Controls the preallocation of memory with the gpuarray backend.
This value represents the start size (either in MB or the fraction
of total GPU memory) of the memory pool. If more memory is needed,
Aesara will try to obtain more, but this can cause memory
fragmentation.
A negative value will completely disable the allocation cache.
This can have a severe impact on performance and should not be
used outside of debugging.
* < 0: disabled
* 0 <= N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
* > 1: use this number in megabytes (MB) of memory.
.. note::
This could cause memory fragmentation, so, if you have a memory
error while using the cache, try to allocate more memory at
the start, or disable it.
.. note::
The clipping at 95% can be bypassed by specifying the exact
number of megabytes. If more then 95% are needed, it will try
automatically to get more memory. But this can cause
fragmentation, see note above.
.. attribute:: config.gpuarray__sched
String value: ``'default'``, ``'multi'``, ``'single'``
Default: ``'default'``
Control the stream mode of contexts.
The sched parameter passed for context creation to ``pygpu``. With
CUDA, using ``"multi"`` means using the parameter
``cudaDeviceScheduleBlockingSync``. This is useful to lower the CPU overhead
when waiting for a GPU.
.. attribute:: config.gpuarray__single_stream
Boolean value
Default: ``True``
Control the stream mode of contexts.
If your computations consist of mostly small arrays, using
single-stream will avoid the synchronization overhead and usually
be faster. For larger arrays it does not make a difference yet.
.. attribute:: config.gpuarray__cache_path
Default: ``config.compiledir``/gpuarray_kernels
Directory to cache pre-compiled kernels for the gpuarray backend.
.. attribute:: linker
String value: ``'c|py'``, ``'py'``, ``'c'``, ``'c|py_nogc'``
......
......@@ -13,8 +13,7 @@ import ``aesara.sparse`` to enable it.
The sparse module provides the same functionality as the tensor
module. The difference lies under the covers because sparse matrices
do not store data in a contiguous array. Note that there are no GPU
implementations for sparse matrices in Aesara. The sparse module has
do not store data in a contiguous array. The sparse module has
been used in:
- NLP: Dense linear transformations of sparse vectors.
......
......@@ -29,51 +29,13 @@ The recommended user interface are:
With those new interface, Aesara will automatically use the fastest
implementation in many cases. On the CPU, the implementation is a GEMM
based one. On the GPU, there is a GEMM based and :ref:`cuDNN
<libdoc_gpuarray_dnn>` version.
By default on the GPU, if cuDNN is available, it will be used,
otherwise we will fall back to using gemm based version (slower than
cuDNN in most cases and uses more memory). To get an error if cuDNN
can not be used, you can supply the Aesara flag ``dnn.enable=True``.
Either cuDNN and the gemm version can be disabled using the Aesara flags
``optimizer_excluding=conv_dnn`` and ``optimizer_excluding=conv_gemm``,
respectively. If both are disabled, it will raise an error.
For the cuDNN version, there are different algorithms with different
memory/speed trade-offs. Manual selection of the right one is very
difficult as it depends on the shapes and hardware. So it can change
for each layer. An auto-tuning mode exists and can be activated by
those flags: ``dnn__conv__algo_fwd=time_once``,
``dnn__conv__algo_bwd_data=time_once`` and
``dnn__conv__algo_bwd_filter=time_once``. Note, they are good mostly
when the shape do not change.
based one.
This auto-tuning has the inconvenience that the first call is much
slower as it tries and times each implementation it has. So if you
benchmark, it is important that you remove the first call from your
timing.
Also, a meta-optimizer has been implemented for the gpu convolution
implementations to automatically choose the fastest implementation
for each specific convolution in your graph. For each instance, it will
compile and benchmark each applicable implementation and choose the
fastest one. It can be enabled using ``optimizer_including=conv_meta``.
The meta-optimizer can also selectively disable cudnn and gemm version
using the Aesara flag ``metaopt__optimizer_excluding=conv_dnn`` and
``metaopt__optimizer_excluding=conv_gemm`` respectively.
.. note::
Aesara had older user interface like
aesara.tensor.nnet.conv.conv2d. Do not use them anymore. They
will give you slower code and won't allow easy switch between CPU
and GPU computation. They also support less type of convolution.
Implementation Details
======================
......@@ -85,10 +47,6 @@ not need to read it. Aesara will select it for you.
- :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
old 2d convolution. DO NOT USE ANYMORE.
- :func:`GpuCorrMM <aesara.gpuarray.blas.GpuCorrMM>`
This is a GPU-only 2d correlation implementation taken from
`caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_. It does not flip the kernel.
For each element in a batch, it first creates a
`Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
......@@ -100,15 +58,8 @@ not need to read it. Aesara will select it for you.
This is a CPU-only 2d correlation implementation taken from
`caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
It does not flip the kernel.
- :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` GPU-only
convolution using NVIDIA's cuDNN library.
- Implemented operators for neural network 3D / video convolution:
- :func:`GpuCorr3dMM <aesara.gpuarray.blas.GpuCorr3dMM>`
This is a GPU-only 3d correlation relying on a Toeplitz matrix
and gemm implementation (see :func:`GpuCorrMM <aesara.sandbox.cuda.blas.GpuCorrMM>`)
It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
``(no of channels * filter width * filter height * filter depth, output width * output height * output depth)``.
- :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
This is a CPU-only 3d correlation implementation based on
the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
......@@ -116,12 +67,6 @@ not need to read it. Aesara will select it for you.
replacement for nnet.conv3d. For convolutions done on CPU,
nnet.conv3d will be replaced by Corr3dMM.
- :func:`dnn_conv3d <aesara.gpuarray.dnn.dnn_conv3d>` GPU-only
3D convolution using NVIDIA's cuDNN library (as :func:`dnn_conv <aesara.gpuarray.dnn.dnn_conv>` but for 3d).
If cuDNN is available, by default, Aesara will replace all nnet.conv3d
operations with dnn_conv.
- :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
Another conv3d implementation that uses the conv2d with data reshaping.
It is faster in some corner cases than conv3d. It flips the kernel.
......
......@@ -14,8 +14,7 @@
.. note::
This interface is the preferred interface. It will be moved
automatically to the GPU.
This interface is the preferred interface.
.. note::
......
......@@ -42,7 +42,6 @@ Optimization o4 o3 o2
========================================================= ============== === === ================= ============= ======
:term:`merge` x x x x x
:term:`constant folding<constant folding>` x x x x x
:term:`GPU transfer` x x x x x
:term:`shape promotion<shape promotion>` x x x
:term:`fill cut<fill cut>` x x x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x x x
......@@ -247,32 +246,10 @@ Optimization o4 o3 o2
This optimization compresses subgraphs of computationally cheap
elementwise operations into a single Op that does the whole job in a
single pass over the inputs (like loop fusion). This is a win when
transfer from main memory to the CPU (or from graphics memory to the
GPU) is a bottleneck.
transfer from main memory to the CPU is a bottleneck.
See :class:`FusionOptimizer`
GPU transfer
The current strategy for choosing which expressions to evaluate on the
CPU and which to evaluate on the GPU is a greedy one. There are a
number of Ops ***TODO*** with GPU implementations and whenever we find
a graph copying data from GPU to CPU in order to evaluate an
expression that could have been evaluated on the GPU, we substitute
the GPU version of that Op for the CPU version. Likewise if we are
copying the output of a Op with a GPU implementation to the GPU,
then we substitute the GPU version for the CPU version. In this way, if all goes well,
this procedure will result in a graph with the following form:
1. copy non-shared inputs to GPU
2. carry out most/all computations on the GPU
3. copy output back to CPU
When using a GPU, :func:`shared()` will default to GPU storage for
'float32' ndarray arguments, and these shared variables act as seeds
for the greedy algorithm.
See :func:`aesara.sandbox.cuda.opt.*`.
local_log_softmax
This is a stabilization optimization.
It can happen due to rounding errors that the softmax probability of one value gets to 0.
......
......@@ -9,10 +9,6 @@ Requirements
.. _Python: http://www.python.org/
.. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/
.. _NVIDIA CUDA drivers and SDK: http://developer.nvidia.com/object/gpucomputing.html
.. _libgpuarray: http://deeplearning.net/software/libgpuarray/installation.html
.. _pycuda: https://mathema.tician.de/software/pycuda/
.. _skcuda: http://scikit-cuda.readthedocs.io/en/latest/
.. _warp-ctc: https://github.com/baidu-research/warp-ctc
Python_ == >= 3.7
......@@ -42,20 +38,6 @@ Requirements
`pydot-ng <https://github.com/pydot/pydot-ng>`_
To handle large picture for gif/images.
`NVIDIA CUDA drivers and SDK`_
**Highly recommended** Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
`libgpuarray`_
Required for GPU/CPU code generation on CUDA and OpenCL devices (see: :ref:`gpuarray`).
`pycuda`_ and `skcuda`_
Required for some extra operations on the GPU like fft and
solvers. We use them to wrap cufft and cusolver. Quick install
``pip install pycuda scikit-cuda``. For cuda 8, the dev
version of skcuda (will be released as 0.5.2) is needed for
cusolver: ``pip install pycuda; pip install
git+https://github.com/lebedov/scikit-cuda.git#egg=scikit-cuda``.
`warp-ctc`_
Required for :ref:`Aesara CTC implementation
<libdoc_tensor_nnet_ctc>`. It is faster then using an
......@@ -84,28 +66,3 @@ Install requirements and optional packages
conda install numpy scipy mkl pytest <sphinx> <pydot-ng>
* Arguments between <...> are optional.
Install and configure the GPU drivers (recommended)
---------------------------------------------------
.. warning::
OpenCL support is still minimal for now.
1. Install CUDA drivers
* Follow `this link <https://developer.nvidia.com/cuda-downloads>`__
to install the CUDA driver and the CUDA Toolkit.
* You must reboot the computer after the driver installation.
* Test that it was loaded correctly after the reboot, executing the
command `nvidia-smi` from the command line.
.. note::
Sanity check: The *bin* subfolder should contain an *nvcc*
program. This folder is called the *cuda root* directory.
2. Fix 'lib' path
* Add the CUDA 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
64-bit OS) to your ``$LD_LIBRARY_PATH`` environment
variable. Example: ``/usr/local/cuda/lib64``
......@@ -54,7 +54,7 @@ if __name__ == '__main__':
pythonpath = os.pathsep.join([throot, pythonpath])
sys.path[0:0] = [throot] # We must not use os.environ.
# Make sure we don't use gpu to compile documentation
# Make sure we don't use other devices to compile documentation
env_th_flags = os.environ.get('AESARA_FLAGS', '')
os.environ['AESARA_FLAGS'] = 'device=cpu,force_device=True'
......
......@@ -59,12 +59,6 @@ where X is far less than Y and Z (i.e. X << Y < Z).
This scenario arises when an operation requires allocation of a large contiguous
block of memory but no blocks of sufficient size are available.
GPUs do not have virtual memory and as such all allocations must be assigned to
a continuous memory region. CPUs do not have this limitation because or their
support for virtual memory. Multiple allocations on a GPU can result in memory
fragmentation which can makes it more difficult to find contiguous regions
of memory of sufficient size during subsequent memory allocations.
A known example is related to writing data to shared variables. When updating a
shared variable Aesara will allocate new space if the size of the data does not
match the size of the space already assigned to the variable. This can lead to
......@@ -80,9 +74,6 @@ aesara.function returns a float64 when the inputs are float32 and int{32, 64}
It should be noted that using float32 and int{32, 64} together
inside a function would provide float64 as output.
Since the GPU can't compute this kind of output, it would be
preferable not to use those dtypes together.
To help you find where float64 are created, see the
:attr:`warn_float64` Aesara flag.
......@@ -120,40 +111,21 @@ All Aesara tests should pass (skipped tests and known failures are normal). If
some test fails on your machine, you are encouraged to tell us what went
wrong in the GitHub issues.
.. warning::
Aesara's test should **NOT** be run with ``device=cuda``
or they will fail. The tests automatically use the gpu, if any, when
needed. If you don't want Aesara to ever use the gpu when running tests,
you can set :attr:`config.device` to ``cpu`` and
:attr:`config.force_device` to ``True``.
.. _slow_or_memory:
Why is my code so slow/uses so much memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is a few things you can easily do to change the trade-off
between speed and memory usage. It nothing is said, this affect the
CPU and GPU memory usage.
Could speed up and lower memory usage:
- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
memory then Aesara version. But some flags allow it to use more
memory. GPU only.
between speed and memory usage.
Could raise memory usage but speed up computation:
- :attr:`config.gpuarray__preallocate` = 1 # Preallocates the GPU memory
and then manages it in a smart way. Does not raise much the memory
usage, but if you are at the limit of GPU memory available you might
need to specify a lower value. GPU only.
- :attr:`config.allow_gc` =False
- :attr:`config.optimizer_excluding` =low_memory , GPU only for now.
Could lower the memory usage, but raise computation time:
- :attr:`config.scan__allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled
- :attr:`config.scan__allow_gc` = True
- :attr:`config.scan__allow_output_prealloc` =False
- Use :func:`batch_normalization()
<aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
......@@ -293,7 +265,7 @@ Aesara/BLAS speed test:
python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py
This will print a table with different versions of BLAS/numbers of
threads on multiple CPUs and GPUs. It will also print some Aesara/NumPy
threads on multiple CPUs. It will also print some Aesara/NumPy
configuration information. Then, it will print the running time of the same
benchmarks for your installation. Try to find a CPU similar to yours in
the table, and check that the single-threaded timings are roughly the same.
......
......@@ -194,33 +194,32 @@ makes it possible to expose Aesara's internal variables without a copy, then it
proceeds as fast as an in-place update.
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
..
When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily.
* Prior to Aesara 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily.
*Solution*: update to a newer version of Aesara.
*Solution*: update to a newer version of Aesara.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
*Solution*: upgrade to a recent version of Aesara (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host.
* It is also worth mentioning that, current GPU copying routines
support only contiguous memory. So Aesara must make the value you
provide *C-contiguous* prior to copying it. This can require an
extra copy of the data on the host.
*Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
*Solution*: make sure that the value
you assign to a GpuArraySharedVariable is *already* *C-contiguous*.
(Further information on the current implementation of the GPU version
of ``set_value()`` can be found here: :ref:`libdoc_gpuarray_type`)
.. _borrowfunction:
......
......@@ -329,26 +329,6 @@ Tips:
of type *float64*.
"Why does my GPU function seem to be slow?"
-------------------------------------------
When you compile an Aesara function, if you do not get the speedup that you expect over the
CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
options:
* ``warn``: Raise a warning
* ``pdb``: Stop with a pdb in the computational graph during the compilation
* ``raise``: Raise an error,
if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in AESARA_FLAGS, such as:
``AESARA_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
But note that this optimization will not catch all the CPU Ops, it might miss some
Ops.
.. _faq_monitormode:
"How do I Step through a Compiled Function?"
......
......@@ -242,9 +242,7 @@ achieve a similar result by returning the new expressions, and working with
them in NumPy as usual. The updates mechanism can be a syntactic convenience,
but it is mainly there for efficiency. Updates to shared variables can
sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix
updates). Also, Aesara has more control over where and how shared variables are
allocated, which is one of the important elements of getting good performance
on the :ref:`GPU<using_gpu>`.
updates).
It may happen that you expressed some formula using a shared variable, but
you do *not* want to use its value. In this case, you can use the
......@@ -375,7 +373,6 @@ distribution. Likewise, ``rv_n`` represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are
defined as :class:`RandomVariable`\s
in :ref:`basic<libdoc_tensor_random_basic>`. They only work on CPU.
See `Other Implementations`_ for GPU version.
Now let's use these objects. If we call ``f()``, we get random uniform numbers.
......@@ -502,22 +499,6 @@ Other Random Distributions
There are :ref:`other distributions implemented <libdoc_tensor_random_basic>`.
.. _example_other_random:
Other Implementations
---------------------
There is another implementations based on :ref:`MRG31k3p
<libdoc_rng_mrg>`.
The `RandomStream` only work on the CPU, MRG31k3p work on the CPU and GPU.
.. note::
To use you the MRG version easily, you can just change the import to:
.. code-block:: python
from aesara.sandbox.rng_mrg import MRG_RandomStream as RandomStream
.. _logistic_regression:
......
......@@ -48,8 +48,6 @@ Advanced
.. toctree::
sparse
using_gpu
using_multi_gpu
conv_arithmetic
Advanced configuration and debugging
......
......@@ -17,7 +17,6 @@ Scan
- Advantages of using ``scan`` over *for* loops:
- Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers (if GPU is involved).
- Computes gradients through sequential steps.
- Slightly faster than using a *for* loop in Python with a compiled Aesara function.
- Can lower the overall memory usage by detecting the actual amount of memory needed.
......
......@@ -83,11 +83,8 @@ Consider the logistic regression:
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
......@@ -137,7 +134,7 @@ is controlled by the value of the ``mode`` parameter.
Aesara defines the following modes by name:
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled.
- ``'FAST_COMPILE'``: Apply just a few graph optimizations and only use Python implementations.
- ``'FAST_RUN'``: Apply all optimizations and use C implementations where possible.
- ``'DebugMode'``: Verify the correctness of all optimizations, and compare C and Python
implementations. This mode can take much longer than the other modes, but can identify
......
......@@ -47,11 +47,8 @@ predict = aesara.function(inputs=[x], outputs=prediction,
if any(x.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for x in
train.maker.fgraph.toposort()):
print('Used the cpu')
elif any(x.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for x in
train.maker.fgraph.toposort()):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print('ERROR, not able to tell if aesara used the cpu or another device')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
......
.. _using_gpu:
=============
Using the GPU
=============
For an introductory discussion of *Graphical Processing Units* (GPU)
and their use for intensive parallel computation purposes, see `GPGPU
<http://en.wikipedia.org/wiki/GPGPU>`_.
One of Aesara's design goals is to specify computations at an abstract
level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on a
graphics card.
Using the GPU in Aesara is as simple as setting the ``device``
configuration flag to ``device=cuda``. You can optionally target a
specific gpu by specifying the number of the gpu as in
e.g. ``device=cuda2``. It is also encouraged to set the floating
point precision to float32 when working on the GPU as that is usually
much faster. For example:
``AESARA_FLAGS='device=cuda,floatX=float32'``. You can also set these
options in the .aesararc file's ``[global]`` section:
.. code-block:: cfg
[global]
device = cuda
floatX = float32
.. note::
* If your computer has multiple GPUs and you use ``device=cuda``,
the driver selects the one to use (usually cuda0).
* You can use the program ``nvidia-smi`` to change this policy.
* By default, when ``device`` indicates preference for GPU computations,
Aesara will fall back to the CPU if there is a problem with the GPU.
You can use the flag ``force_device=True`` to instead raise an error when
Aesara cannot use the GPU.
.. _gpuarray:
GpuArray Backend
----------------
If you have not done so already, you will need to install libgpuarray
as well as at least one computing toolkit (CUDA or OpenCL). Detailed
instructions to accomplish that are provided at
`libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
To install Nvidia's GPU-programming toolchain (CUDA) and configure
Aesara to use it, see the installation instructions for
:ref:`Linux <gpu_linux>`, :ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
.. note::
GpuArray backend uses ``config.gpuarray__preallocate`` for GPU memory
allocation.
.. warning::
The backend was designed to support OpenCL, however current support is
incomplete. A lot of very useful ops still do not support it because they
were ported from the old backend with minimal change.
.. _testing_the_gpu:
Testing Aesara with GPU
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program
into a file and run it.
Use the Aesara flag ``device=cuda`` to require the use of the GPU. Use the flag
``device=cuda{0,1,...}`` to specify which GPU to use.
.. testcode::
from aesara import function, config, shared, tensor as at
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], at.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just computes ``exp()`` of a bunch of random numbers. Note
that we use the :func:`aesara.shared` function to make sure that the
input *x* is stored on the GPU.
.. testoutput::
:hide:
:options: +ELLIPSIS
[Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
Looping 1000 times took ... seconds
Result is ...
Used the cpu
.. code-block:: none
$ AESARA_FLAGS=device=cpu python gpu_tutorial1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, (None,))>)]
Looping 1000 times took 2.271284 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial1.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.697514 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the device some code runs on by only replacing
the value of the ``device`` flag without touching the code.
If you don't mind a loss of flexibility, you can ask aesara to return
the GPU object directly. The following code is modified to do just that.
.. testcode::
from aesara import function, config, shared, tensor as at
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], at.exp(x).transfer(None))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, aesara.tensor.elemwise.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Here ``at.exp(x).transfer(None)`` means "copy ``exp(x)`` to the GPU",
with ``None`` the default GPU context when not explicitly given.
For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`.
The output is
.. testoutput::
:hide:
:options: +ELLIPSIS, +SKIP
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.040277 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
.. code-block:: none
$ AESARA_FLAGS=device=cuda0 python gpu_tutorial2.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.040277 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
of execution on GPUs, meaning that the work isn't completed yet, just
'launched'. We'll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as ``numpy.asarray()``.
For even more speed, you can play with the ``borrow`` flag. See
:ref:`borrowfunction`.
What Can be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will of course vary from device to
device, and also as we refine our implementation:
* In general, matrix multiplication, convolution, and large element-wise
operations can be accelerated a lot (5-50x) when arguments are large enough
to keep 30 processors busy.
* Indexing, dimension-shuffling and constant-time reshaping will be
equally fast on GPU as on CPU.
* Summation over rows/columns of tensors can be a little slower on the
GPU than on the CPU.
* Copying of large quantities of data to and from a device is relatively slow,
and often cancels most of the advantage of one or two accelerated functions
on that data. Getting GPU performance largely hinges on making data transfer
to the device pay off.
The backend supports all regular aesara data types (float32, float64,
int, ...), however GPU support varies and some units can't deal with
double (float64) or small (less than 32 bits like int16) data types.
You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its ``tag.target`` attribute to
'cpu'.
Complex support is untested and most likely completely broken.
Tips for Improving Performance on GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Consider adding ``floatX=float32`` (or the type you are using) to your
``.aesararc`` file if you plan to do a lot of GPU work.
* The GPU backend supports *float64* variables, but they are still slower
to compute than *float32*. The more *float32*, the better GPU performance
you will get.
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` (which
follow the type set in ``floatX``) to ``dmatrix``, ``dvector`` and
``dscalar``. The latter enforce double precision (*float64* on most
machines), which slows down GPU computations on current hardware.
* Minimize transfers to the GPU device by using ``shared`` variables
to store frequently-accessed data (see :func:`shared()<shared.shared>`).
When using the GPU, tensor ``shared`` variables are stored on
the GPU by default to eliminate transfer time for GPU ops using those
variables.
* If you aren't happy with the performance you see, try running your
script with ``profile=True`` flag. This should print some timing
information at program termination. Is time being used sensibly? If
an op or Apply is taking more time than its share, then if you know
something about GPU programming, have a look at how it's implemented
in aesara.gpuarray. Check the line similar to *Spent Xs(X%) in cpu
op, Xs(X%) in gpu op and Xs(X%) in transfer op*. This can tell you
if not enough of your graph is on the GPU or if there is too much
memory transfer.
* To investigate whether all the Ops in the computational graph are
running on GPU, it is possible to debug or check your code by providing
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
raising an error or `pdb` for putting a breakpoint in the computational
graph if there is a CPU Op.
.. _gpu_async:
GPU Async Capabilities
~~~~~~~~~~~~~~~~~~~~~~
By default, all operations on the GPU are run asynchronously. This
means that they are only scheduled to run and the function returns.
This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory
transfers between device and host.
It is possible to force synchronization for a particular GpuArray by
calling its ``sync()`` method. This is useful to get accurate timings
when doing benchmarks.
Changing the Value of Shared Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To change the value of a ``shared`` variable, e.g. to provide new data
to processes, use ``shared_variable.set_value(new_value)``. For a lot
more detail about this, see :ref:`aliasing`.
Exercise
~~~~~~~~
Consider again the logistic regression:
.. testcode::
import numpy
import aesara
import aesara.tensor as at
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(aesara.config.floatX),
rng.randint(size=N,low=0, high=2).astype(aesara.config.floatX))
training_steps = 10000
# Declare Aesara symbolic variables
x = at.matrix("x")
y = at.vector("y")
w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
b = aesara.shared(numpy.asarray(0., dtype=aesara.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Aesara expression graph
p_1 = 1 / (1 + at.exp(-at.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*at.log(p_1) - (1-y)*at.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = at.grad(cost, [w,b])
# Compile expressions to functions
train = aesara.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
name = "train")
predict = aesara.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
print("floatX=", aesara.config.floatX)
print("device=", aesara.config.device)
.. testoutput::
:hide:
:options: +ELLIPSIS
Used the cpu
target values for D
...
prediction on D
...
Modify and execute this example to run on GPU with ``floatX=float32``
and time it using the command line ``time python file.py``. (Of
course, you may use some of your answer to the exercise in section
:ref:`Configuration Settings and Compiling Mode<using_modes>`.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put
your ideas to test.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
Software for Directly Programming a GPU
---------------------------------------
Leaving aside Aesara which is a meta-programmer, there are:
* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C)
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) are maturing.
* **OpenCL**: multi-vendor version of CUDA
* More general, standardized.
* Fewer libraries, lesser spread.
* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel
computation API from Python
* Convenience:
Makes it easy to do GPU meta-programming from within Python.
Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).
GPU memory buffer (``pycuda.gpuarray.GPUArray``).
Helpful documentation.
* Completeness: Binding to all of CUDA's driver API.
* Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
* Speed: PyCUDA's base layer is written in C++.
* Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').
Makes it much easier to write correct, leak- and crash-free code.
PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
allocated in it is also freed).
(This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_
and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)
* **PyOpenCL**: PyCUDA for OpenCL
Learning to Program with PyCUDA
-------------------------------
If you already enjoy a good proficiency with the C programming language, you
may easily leverage your knowledge by learning, first, to program a GPU with the
CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
API with a Python wrapper.
The following resources will assist you in this learning process:
* **CUDA API and CUDA C: Introductory**
* `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_
* `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
* **CUDA API and CUDA C: Advanced**
* `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
(full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
* `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
(full lectures, Kirk-Hwu textbook)
* `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
(extensive coverage, levels from introductory to advanced)
* `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
(on the relationship between grids, blocks and threads; see also linked and related issues on same page)
* `CUDA optimization <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_
* **PyCUDA: Introductory**
* `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_
* `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_
* **PYCUDA: Advanced**
* `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_
The following examples give a foretaste of programming a GPU with PyCUDA. Once
you feel competent enough, you may try yourself on the corresponding exercises.
**Example: PyCUDA**
.. code-block:: python
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print(dest)
Exercise
~~~~~~~~
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
.. _pyCUDA_aesara:
**Example: Aesara + PyCUDA**
.. code-block:: python
import numpy, aesara
import aesara.misc.pycuda_init
from pycuda.compiler import SourceModule
import aesara.sandbox.cuda as cuda
from aesara.graph.basic import Apply
from aesara.graph.op import Op
class PyCUDADoubleOp(Op):
__props__ = ()
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2, impl):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [storage_map[v] for v in node.inputs]
outputs = [storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape != inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
Use this code to test it:
>>> x = aesara.tensor.type.fmatrix()
>>> f = aesara.function([x], PyCUDADoubleOp()(x)) # doctest: +SKIP
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2) # doctest: +SKIP
>>> print(numpy.asarray(f(xv))) # doctest: +SKIP
Exercise
~~~~~~~~
Run the preceding example.
Modify and execute to multiply two matrices: *x* * *y*.
Modify and execute to return two outputs: *x + y* and *x - y*.
(Notice that Aesara's current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)
Modify and execute to support *stride* (i.e. to avoid constraining the input to be *C-contiguous*).
Note
----
* See :ref:`example_other_random` to know how to handle random numbers
on the GPU.
* The mode `FAST_COMPILE` disables C code, so also disables the GPU. You
can use the Aesara flag optimizer='fast_compile' to speed up
compilation and keep the GPU.
#!/usr/bin/env python
# Aesara tutorial
# Solution to Exercise in section 'Using the GPU'
# 1. Raw results
import numpy as np
import aesara
import aesara.tensor as at
aesara.config.floatX = 'float32'
rng = np.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(aesara.config.floatX),
rng.randint(size=N, low=0, high=2).astype(aesara.config.floatX))
training_steps = 10000
# Declare Aesara symbolic variables
x = aesara.shared(D[0], name="x")
y = aesara.shared(D[1], name="y")
w = aesara.shared(rng.randn(feats).astype(aesara.config.floatX), name="w")
b = aesara.shared(np.asarray(0., dtype=aesara.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Aesara expression graph
p_1 = 1 / (1 + at.exp(-at.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * at.log(p_1) - (1 - y) * at.log(1 - p_1) # Cross-entropy
cost = at.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = at.grad(cost, [w, b])
# Compile expressions to functions
train = aesara.function(
inputs=[],
outputs=[prediction, xent],
updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
name="train")
predict = aesara.function(inputs=[], outputs=prediction,
name="predict")
if any(n.op.__class__.__name__ in ('Gemv', 'CGemv', 'Gemm', 'CGemm') for n in
train.maker.fgraph.toposort()):
print('Used the cpu')
elif any(n.op.__class__.__name__ in ('GpuGemm', 'GpuGemv') for n in
train.maker.fgraph.toposort()):
print('Used the gpu')
else:
print('ERROR, not able to tell if aesara used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train()
#print "Final model:"
#print w.get_value(), b.get_value()
print("target values for D")
print(D[1])
print("prediction on D")
print(predict())
"""
# 2. Profiling
# 2.1 Profiling for CPU computations
# In your terminal, type:
$ AESARA_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the cpu
target values for D
prediction on D
# Followed by the output of profiling.. You'll see profiling results for each function
# in the script, followed by a summary for all functions.
# We'll show here only the summary:
Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 1.300452e+00s
Time in Function.fn.__call__: 1.215823e+00s (93.492%)
Time in thunks: 1.157602e+00s (89.015%)
Total compile time: 8.922548e-01s
Number of Apply nodes: 17
Aesara Optimizer time: 6.270301e-01s
Aesara validate time: 5.993605e-03s
Aesara Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
Import time 3.543139e-03s
Time in all call to aesara.grad() 1.848292e-02s
Time since aesara import 2.864s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 aesara.tensor.blas_c.CGemv
33.1% 97.7% 0.384s 4.79e-06s C 80001 9 aesara.tensor.elemwise.Elemwise
1.0% 98.6% 0.011s 1.14e-06s C 10000 1 aesara.tensor.elemwise.Sum
0.7% 99.4% 0.009s 2.85e-07s C 30001 4 aesara.tensor.elemwise.DimShuffle
0.3% 99.7% 0.004s 3.64e-07s C 10001 2 aesara.tensor.basic.AllocEmpty
0.3% 100.0% 0.004s 1.78e-07s C 20001 3 aesara.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 CGemv{inplace}
18.7% 83.2% 0.217s 2.17e-05s C 10000 1 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
8.9% 92.1% 0.103s 1.03e-05s C 10000 1 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
4.3% 96.4% 0.050s 4.98e-06s C 10000 1 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
1.0% 97.4% 0.011s 1.14e-06s C 10000 1 Sum{acc_dtype=float64}
0.5% 97.9% 0.006s 2.83e-07s C 20001 3 InplaceDimShuffle{x}
0.4% 98.3% 0.004s 4.22e-07s C 10000 1 Elemwise{sub,no_inplace}
0.3% 98.6% 0.004s 3.70e-07s C 10000 1 Elemwise{neg,no_inplace}
0.3% 98.9% 0.004s 3.64e-07s C 10001 2 AllocEmpty{dtype='float32'}
0.3% 99.2% 0.004s 1.78e-07s C 20001 3 Shape_i{0}
0.2% 99.5% 0.003s 2.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.2% 99.7% 0.003s 2.65e-07s C 10000 1 Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.2% 99.9% 0.002s 1.98e-07s C 10000 1 Elemwise{Cast{float32}}
0.1% 100.0% 0.002s 1.54e-07s C 10000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 100.0% 0.000s 4.77e-06s C 1 1 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
34.0% 34.0% 0.394s 3.94e-05s 10000 7 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
30.5% 64.5% 0.353s 3.53e-05s 10000 15 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
18.7% 83.2% 0.217s 2.17e-05s 10000 12 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
8.9% 92.1% 0.103s 1.03e-05s 10000 13 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
4.3% 96.4% 0.050s 4.98e-06s 10000 11 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
1.0% 97.4% 0.011s 1.14e-06s 10000 14 Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
0.4% 97.8% 0.004s 4.22e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.3% 98.1% 0.004s 3.76e-07s 10000 0 InplaceDimShuffle{x}(b)
0.3% 98.4% 0.004s 3.70e-07s 10000 10 Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.3% 98.7% 0.004s 3.64e-07s 10000 5 AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.2% 99.0% 0.003s 2.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.2% 99.2% 0.003s 2.65e-07s 10000 9 Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.2% 99.4% 0.002s 2.21e-07s 10000 1 Shape_i{0}(x)
0.2% 99.6% 0.002s 1.98e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.2% 99.7% 0.002s 1.90e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.1% 99.9% 0.002s 1.54e-07s 10000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.1% 100.0% 0.001s 1.34e-07s 10000 3 Shape_i{0}(y)
0.0% 100.0% 0.000s 3.89e-05s 1 3 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
0.0% 100.0% 0.000s 4.77e-06s 1 4 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.000s 1.19e-06s 1 0 InplaceDimShuffle{x}(b)
... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
# 2.2 Profiling for GPU computations
# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 AESARA_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the gpu
target values for D
prediction on D
Results were produced using a GeForce GTX TITAN X
# Profiling summary for all functions:
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 4.181247e+00s
Time in Function.fn.__call__: 4.081113e+00s (97.605%)
Time in thunks: 3.915566e+00s (93.646%)
Total compile time: 9.256095e+00s
Number of Apply nodes: 21
Aesara Optimizer time: 9.996419e-01s
Aesara validate time: 6.523132e-03s
Aesara Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
Import time 4.228115e-03s
Time in all call to aesara.grad() 3.286195e-02s
Time since aesara import 15.415s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 aesara.sandbox.gpuarray.blas.GpuGemv
29.8% 89.3% 1.166s 1.30e-05s C 90001 10 aesara.sandbox.gpuarray.elemwise.GpuElemwise
4.1% 93.4% 0.162s 8.10e-06s C 20001 3 aesara.sandbox.gpuarray.basic_ops.HostFromGpu
3.3% 96.7% 0.131s 1.31e-05s C 10000 1 aesara.sandbox.gpuarray.elemwise.GpuCAReduceCuda
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 aesara.sandbox.gpuarray.basic_ops.GpuFromHost
0.8% 99.1% 0.033s 1.09e-06s C 30001 4 aesara.sandbox.gpuarray.elemwise.GpuDimShuffle
0.7% 99.8% 0.026s 2.59e-06s C 10001 2 aesara.sandbox.gpuarray.basic_ops.GpuAllocEmpty
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 aesara.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 GpuGemv{inplace=True}
4.1% 63.6% 0.162s 8.10e-06s C 20001 3 HostFromGpu(gpuarray)
4.0% 67.6% 0.157s 1.57e-05s C 10000 1 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
3.8% 71.4% 0.149s 1.49e-05s C 10000 1 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
3.7% 75.1% 0.144s 1.44e-05s C 10000 1 GpuElemwise{sub,no_inplace}
3.6% 78.7% 0.141s 1.41e-05s C 10000 1 GpuElemwise{gt,no_inplace}
3.4% 82.1% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Cast{float32}}[]<gpuarray>
3.4% 85.5% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
3.3% 88.8% 0.131s 1.31e-05s C 10000 1 GpuCAReduceCuda{add}
2.9% 91.7% 0.112s 1.12e-05s C 10000 1 GpuElemwise{neg,no_inplace}
2.6% 94.3% 0.102s 1.02e-05s C 10000 1 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
2.5% 96.7% 0.096s 9.63e-06s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 GpuFromHost<None>
0.7% 99.0% 0.026s 2.59e-06s C 10001 2 GpuAllocEmpty{dtype='float32', context_name=None}
0.5% 99.5% 0.021s 1.06e-06s C 20001 3 InplaceGpuDimShuffle{x}
0.3% 99.8% 0.011s 1.14e-06s C 10000 1 InplaceGpuDimShuffle{1,0}
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 Shape_i{0}
0.0% 100.0% 0.000s 2.00e-05s C 1 1 GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
55.0% 55.0% 2.154s 2.15e-04s 10000 7 GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
4.5% 59.5% 0.176s 1.76e-05s 10000 18 GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
4.0% 63.5% 0.157s 1.57e-05s 10000 12 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
3.8% 67.3% 0.149s 1.49e-05s 10000 15 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
3.7% 71.0% 0.144s 1.44e-05s 10000 4 GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
3.6% 74.6% 0.141s 1.41e-05s 10000 16 GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
3.4% 78.0% 0.133s 1.33e-05s 10000 10 GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
3.4% 81.4% 0.133s 1.33e-05s 10000 9 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
3.3% 84.7% 0.131s 1.31e-05s 10000 17 GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
2.9% 87.5% 0.112s 1.12e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
2.6% 90.1% 0.102s 1.02e-05s 10000 20 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
2.5% 92.6% 0.096s 9.63e-06s 10000 13 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
2.3% 94.9% 0.090s 9.04e-06s 10000 19 HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
1.8% 96.7% 0.072s 7.16e-06s 10000 14 HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
1.6% 98.3% 0.061s 6.10e-06s 10000 6 GpuFromHost<None>(Shape_i{0}.0)
0.7% 99.0% 0.026s 2.59e-06s 10000 5 GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
0.3% 99.3% 0.013s 1.33e-06s 10000 0 InplaceGpuDimShuffle{x}(b)
0.3% 99.6% 0.011s 1.14e-06s 10000 2 InplaceGpuDimShuffle{1,0}(x)
0.2% 99.8% 0.008s 7.94e-07s 10000 8 InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
0.1% 99.9% 0.005s 5.27e-07s 10000 1 Shape_i{0}(x)
... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
# 3. Conclusions
Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
"""
.. _tut_using_multi_gpu:
===================
Using multiple GPUs
===================
Aesara has a feature to allow the use of multiple GPUs at the same
time in one function. The multiple gpu feature requires the use of
the :ref:`gpuarray` backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not
refer to device names directly for multiple-gpu use. You instead
refer to what we call context names. These are then mapped to a
device using the aesara configuration. This allows portability of
models between machines.
.. warning::
The code is rather new and is still considered experimental at this
point. It has been tested and seems to perform correctly in all
cases observed, but make sure to double-check your results before
publishing a paper or anything of the sort.
.. note::
For data-parallelism, you probably are better using `platoon
<https://github.com/mila-udem/platoon>`_.
Defining the context map
------------------------
The mapping from context names to devices is done through the
:attr:`config.contexts` option. The format looks like this::
dev0->cuda0;dev1->cuda1
Let's break it down. First there is a list of mappings. Each of
these mappings is separated by a semicolon ';'. There can be any
number of such mappings, but in the example above we have two of them:
`dev0->cuda0` and `dev1->cuda1`.
The mappings themselves are composed of a context name followed by the
two characters '->' and the device name. The context name is a simple
string which does not have any special meaning for Aesara. For
parsing reasons, the context name cannot contain the sequence '->' or
';'. To avoid confusion context names that begin with 'cuda' or
'opencl' are disallowed. The device name is a device in the form that
gpuarray expects like 'cuda0' or 'opencl0:0'.
.. note::
Since there are a bunch of shell special characters in the syntax,
defining this on the command-line will require proper quoting, like this:
.. code-block:: shell
$ AESARA_FLAGS="contexts=dev0->cuda0"
When you define a context map, if :attr:`config.print_active_device`
is `True` (the default), Aesara will print the mappings as they are
defined. This will look like this:
.. code-block:: bash
$ AESARA_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import aesara'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
If you don't have enough GPUs for a certain model, you can assign the
same device to more than one name. You can also assign extra names
that a model doesn't need to some other devices. However, a
proliferation of names is not always a good idea since aesara often
assumes that different context names will be on different devices and
will optimize accordingly. So you may get faster performance for a
single name and a single device.
.. note::
It is often the case that multi-gpu operation requires or assumes
that all the GPUs involved are equivalent. This is not the case
for this implementation. Since the user has the task of
distributing the jobs across the different device a model can be
built on the assumption that one of the GPU is slower or has
smaller memory.
A simple graph on two GPUs
--------------------------
The following simple program works on two GPUs. It builds a function
which perform two dot products on two different GPUs.
.. code-block:: python
import numpy
import aesara
v01 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = aesara.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = aesara.function([], [aesara.tensor.dot(v01, v02),
aesara.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for 'dev0' and
'dev1'. It should run twice as fast when the devices are different.
Explicit transfers of data
--------------------------
Since operations themselves cannot work on more than one device, they
will pick a device to work on based on their inputs and automatically
insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
:meth:`transfer` method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
.. code-block:: python
import aesara
v = aesara.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you
choose. However you should try to minimize transfer operations
because they will introduce overhead that may reduce performance.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论