提交 bd544674 authored 作者: slefrancois's avatar slefrancois

Update doc with instructions for using new gpu backend

上级 319382b5
...@@ -37,3 +37,4 @@ Theano.suo ...@@ -37,3 +37,4 @@ Theano.suo
.ipynb_checkpoints .ipynb_checkpoints
.pydevproject .pydevproject
.ropeproject .ropeproject
core
\ No newline at end of file
...@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this: ...@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
Testing GPU Ops Testing GPU Ops
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
Ops to be executed on the GPU should inherit from the When using the old GPU backend, Ops to be executed on the GPU should inherit
``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
Theano to distinguish them. Currently, we use this to test if the Theano to distinguish them. Currently, we use this to test if the
NVIDIA driver works correctly with our sum reduction code on the GPU. NVIDIA driver works correctly with our sum reduction code on the GPU.
......
...@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add ...@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
If you want GPU-related tests to run on a specific GPU device, and not If you want GPU-related tests to run on a specific GPU device, and not
the default one, you should use :attr:`~config.init_gpu_device`. the default one, you should use :attr:`~config.init_gpu_device`.
For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``. For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
See :ref:`libdoc_config` for more information on how to change these See :ref:`libdoc_config` for more information on how to change these
configuration options. configuration options.
...@@ -508,25 +508,25 @@ Any one of them is enough. ...@@ -508,25 +508,25 @@ Any one of them is enough.
:ref:`Ubuntu instructions <install_ubuntu_gpu>`. :ref:`Ubuntu instructions <install_ubuntu_gpu>`.
Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
computer, and set the default floating point computations to float32. computer, and set the default floating point computations to float32.
For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``. For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
You can also set these options in the .theanorc file's ``[global]`` section: You can also set these options in the .theanorc file's ``[global]`` section:
.. code-block:: cfg .. code-block:: cfg
[global] [global]
device = gpu device = cuda
floatX = float32 floatX = float32
Note that: Note that:
* If your computer has multiple GPUs and you use 'device=gpu', the driver * If your computer has multiple GPUs and you use 'device=cuda', the driver
selects the one to use (usually gpu0). selects the one to use (usually gpu0).
* You can use the program nvida-smi to change this policy. * You can use the program nvida-smi to change this policy.
* You can choose one specific GPU by specifying 'device=gpuX', with X the * You can choose one specific GPU by specifying 'device=cudaX', with X the
the corresponding GPU index (0, 1, 2, ...) the corresponding GPU index (0, 1, 2, ...)
* By default, when ``device`` indicates preference for GPU computations, * By default, when ``device`` indicates preference for GPU computations,
Theano will fall back to the CPU if there is a problem with the GPU. Theano will fall back to the CPU if there is a problem with the GPU.
...@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats: ...@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
toggle your GPU on, which can be done with toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__. `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
Once your setup is complete, head to :ref:`using_gpu` to find how to verify Once your setup is complete, head to :ref:`using_gpu` to find how to verify
everything is working properly. everything is working properly.
......
...@@ -303,7 +303,7 @@ Test GPU configuration ...@@ -303,7 +303,7 @@ Test GPU configuration
.. code-block:: bash .. code-block:: bash
THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
.. note:: .. note::
......
...@@ -445,6 +445,8 @@ routine for matrix multiplication) ...@@ -445,6 +445,8 @@ routine for matrix multiplication)
Configure Theano for GPU use Configure Theano for GPU use
############################ ############################
Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
Theano can be configured with a ``.theanorc`` text file (or Theano can be configured with a ``.theanorc`` text file (or
``.theanorc.txt``, whichever is easier for you to create under ``.theanorc.txt``, whichever is easier for you to create under
Windows). It should be placed in the directory pointed to by the Windows). It should be placed in the directory pointed to by the
...@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file: ...@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
.. code-block:: cfg .. code-block:: cfg
[global] [global]
device = gpu device = cuda
floatX = float32 floatX = float32
[nvcc] [nvcc]
......
...@@ -32,6 +32,7 @@ Optimization FAST_RUN FAST_COMPILE ...@@ -32,6 +32,7 @@ Optimization FAST_RUN FAST_COMPILE
========================================================= ========= ============ ============= ========================================================= ========= ============ =============
:term:`merge` x x :term:`merge` x x
:term:`constant folding<constant folding>` x x :term:`constant folding<constant folding>` x x
:term:`GPU transfer` x x
:term:`shape promotion<shape promotion>` x :term:`shape promotion<shape promotion>` x
:term:`fill cut<fill cut>` x :term:`fill cut<fill cut>` x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x :term:`inc_subtensor srlz.<inc_subtensor serialization>` x
...@@ -52,7 +53,6 @@ Optimization FAST_RUN FAST_COMPILE ...@@ -52,7 +53,6 @@ Optimization FAST_RUN FAST_COMPILE
:term:`inplace_elemwise` x :term:`inplace_elemwise` x
:term:`inplace_random` x :term:`inplace_random` x
:term:`elemwise fusion` x :term:`elemwise fusion` x
:term:`GPU transfer` x
:term:`local_log_softmax` x x :term:`local_log_softmax` x x
:term:`local_remove_all_assert` :term:`local_remove_all_assert`
========================================================= ========= ============ ============= ========================================================= ========= ============ =============
......
...@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to ...@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
hints that give more flexibility to the compilation and optimization of the hints that give more flexibility to the compilation and optimization of the
graph. graph.
For GPU graphs, this borrowing can have a major speed impact. See the following code:
.. code-block:: python
from theano import function, config, shared, sandbox, tensor, Out
import numpy
import time
vlen = 10 * 30 * 768 # 10 x # cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
f2 = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
borrow=True))
t0 = time.time()
for i in range(iters):
r = f1()
t1 = time.time()
no_borrow = t1 - t0
t0 = time.time()
for i in range(iters):
r = f2()
t1 = time.time()
print(
"Looping %s times took %s seconds without borrow "
"and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f1.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Which produces this output:
.. code-block:: none
$ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
Using gpu device 0: GeForce GTX 275
Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
Used the gpu
*Take home message:* *Take home message:*
When an input *x* to a function is not needed after the function When an input *x* to a function is not needed after the function
...@@ -317,4 +271,3 @@ requirement. When a return value *y* is large (in terms of memory ...@@ -317,4 +271,3 @@ requirement. When a return value *y* is large (in terms of memory
footprint), and you only need to read from it once, right away when footprint), and you only need to read from it once, right away when
it's returned, then consider marking it with an ``Out(y, it's returned, then consider marking it with an ``Out(y,
borrow=True)``. borrow=True)``.
...@@ -15,28 +15,41 @@ about how to carry out those computations. One of the ways we take ...@@ -15,28 +15,41 @@ about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on a advantage of this flexibility is in carrying out calculations on a
graphics card. graphics card.
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`). There are two ways currently to use a gpu, on that should support any OpenCL
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend which
only supports NVIDIA cards (:ref:`cuda`).
.. _cuda: .. _gpuarray:
CUDA backend GpuArray Backend
------------ ----------------
If you have not done so already, you will need to install Nvidia's If you have not done so already, you will need to install libgpuarray
GPU-programming toolchain (CUDA) and configure Theano to use it. as well as at least one computing toolkit. Instructions for doing so
We provide installation instructions for :ref:`Linux <gpu_linux>`, are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
.. warning::
The backend was designed to support OpenCL, however current support is
incomplete. A lot of very useful ops still do not support it because they
were ported from the old backend with minimal change.
Testing Theano with GPU Testing Theano with GPU
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program into a To see if your GPU is being used, cut and paste the following program
file and run it. into a file and run it.
Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
``device=cuda{0,1,...}`` to specify which GPU to use.
.. testcode:: .. testcode::
from theano import function, config, shared, sandbox from theano import function, config, shared, tensor
import theano.tensor as T
import numpy import numpy
import time import time
...@@ -45,7 +58,7 @@ file and run it. ...@@ -45,7 +58,7 @@ file and run it.
rng = numpy.random.RandomState(22) rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x)) f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort()) print(f.maker.fgraph.toposort())
t0 = time.time() t0 = time.time()
for i in range(iters): for i in range(iters):
...@@ -53,20 +66,16 @@ file and run it. ...@@ -53,20 +66,16 @@ file and run it.
t1 = time.time() t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0)) print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,)) print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu') print('Used the cpu')
else: else:
print('Used the gpu') print('Used the gpu')
The program just computes the ``exp()`` of a bunch of random numbers. The program just compute ``exp()`` of a bunch of random numbers. Note
Note that we use the ``shared`` function to that we use the :func:`theano.shared` function to make sure that the
make sure that the input *x* is stored on the graphics device. input *x* is stored on the GPU.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. testoutput:: .. testoutput::
:hide: :hide:
...@@ -79,40 +88,36 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump ...@@ -79,40 +88,36 @@ same floating-point numbers as the CPU. As a benchmark, a loop that calls ``nump
.. code-block:: none .. code-block:: none
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py $ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.06635117531 seconds Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323284] 1.62323285]
Used the cpu Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py $ THEANO_FLAGS=device=cuda0 python check1.py
Using gpu device 0: GeForce GTX 580 Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323296] 1.62323285]
Used the gpu Used the gpu
Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
Returning a Handle to Device-Allocated Data Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The speedup is not greater in the preceding example because the function is By default functions that execute on the GPU still return a standard
returning its result as a NumPy ndarray which has already been copied from the numpy ndarray. A transfer operation is inserted just before the
device to the host for your convenience. This is what makes it so easy to swap in ``device=gpu``, but results are returned to ensure a consistent interface with CPU code.
if you don't mind less portability, you might gain a bigger speedup by changing This allows changing the device some code runs on by only replacing
the graph to express a computation with a GPU-stored result. The ``gpu_from_host`` the value of the ``device`` flag without touching the code.
op means "copy the input from the host to the GPU" and it is optimized away
after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``. If you don't mind a loss of flexibility, you can ask theano to return
the GPU object directly. The following code is modified to do just that.
.. testcode:: .. testcode::
from theano import function, config, shared, sandbox from theano import function, config, shared, tensor, gpuarray
import theano.sandbox.cuda.basic_ops
import theano.tensor as T
import numpy import numpy
import time import time
...@@ -120,139 +125,146 @@ after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``. ...@@ -120,139 +125,146 @@ after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
iters = 1000 iters = 1000
rng = numpy.random.RandomState(22) rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), 'float32')) x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x))) f = function([], gpuarray.basic_ops.GpuFromHost(None)(tensor.exp(x)))
print(f.maker.fgraph.toposort()) print(f.maker.fgraph.toposort())
t0 = time.time() t0 = time.time()
for i in range(iters): for i in range(iters):
r = f() r = f()
t1 = time.time() t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0)) print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,)) print("Result is %s" % (numpy.asarray(r),))
print("Numpy result is %s" % (numpy.asarray(r),)) if numpy.any([isinstance(x.op, tensor.Elemwise) and
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): ('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu') print('Used the cpu')
else: else:
print('Used the gpu') print('Used the gpu')
The output from this program is Here the :func:`theano.gpuarray.basic_ops.GpuFromHost(None)` call
means "copy input to the GPU", with ``None`` the default GPU context when not
explicitly given. However during the optimization phase,
since the result will already be on the gpu, it will be removed. It is
used here to tell theano that we want the result on the GPU.
The output is
.. testoutput:: .. testoutput::
:hide: :hide:
:options: +ELLIPSIS, +SKIP :options: +ELLIPSIS, +SKIP
Using gpu device 0: GeForce GTX 580 Using device cuda0: ...
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)] [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took ... seconds Looping 1000 times took ... seconds
Result is <CudaNdarray object at 0x...> Result is ...
Numpy result is ...
Used the gpu Used the gpu
.. code-block:: none .. code-block:: none
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py $ THEANO_FLAGS=device=cuda0 python check2.py
Using gpu device 0: GeForce GTX 580 Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)] [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.34898686409 seconds Looping 1000 times took 0.455810785294 seconds
Result is <CudaNdarray object at 0x6a7a5f0> Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323285]
1.62323296]
Used the gpu Used the gpu
Here we've shaved off about 50% of the run-time by simply not copying
the resulting array back to the host. The object returned by each
function call is now not a NumPy array but a "CudaNdarray" which can
be converted to a NumPy ndarray by the normal NumPy casting mechanism
using something like ``numpy.asarray()``.
For even more speed you can play with the ``borrow`` flag. See While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
of execution on GPUs, meaning that the work isn't completed yet, just
'launched'. We'll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as ``numpy.asarray()``.
For even more speed, you can play with the ``borrow`` flag. See
:ref:`borrowfunction`. :ref:`borrowfunction`.
What Can Be Accelerated on the GPU What Can be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will change as we continue to optimize our The performance characteristics will of course vary from device to
implementations, and vary from device to device, but to give a rough idea of device, and also as we refine our implementation:
what to expect right now:
* In general, matrix multiplication, convolution, and large element-wise
* Only computations operations can be accelerated a lot (5-50x) when arguments are large enough
with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but to keep 30 processors busy.
*float64* computations are still relatively slow (Jan 2010). * Indexing, dimension-shuffling and constant-time reshaping will be equally fast
* Matrix on GPU as on CPU.
multiplication, convolution, and large element-wise operations can be * Summation over rows/columns of tensors can be a little slower on the
accelerated a lot (5-50x) when arguments are large enough to keep 30 GPU than on the CPU.
processors busy. * Copying of large quantities of data to and from a device is relatively slow,
* Indexing, and often cancels most of the advantage of one or two accelerated functions
dimension-shuffling and constant-time reshaping will be equally fast on GPU on that data. Getting GPU performance largely hinges on making data transfer
as on CPU. to the device pay off.
* Summation
over rows/columns of tensors can be a little slower on the GPU than on the CPU. The backend supports all regular theano data types (float32, float64,
* Copying int, ...), however GPU support varies and some units can't deal with
of large quantities of data to and from a device is relatively slow, and double (float64) or small (less than 32 bits like int16) data types.
often cancels most of the advantage of one or two accelerated functions on You will get an error at compile time or runtime if this is the case.
that data. Getting GPU performance largely hinges on making data transfer to
the device pay off. By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its tag.target attribute to
'cpu'.
Complex support is untested and most likely completely broken.
Tips for Improving Performance on GPU Tips for Improving Performance on GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Consider * Consider adding ``floatX=float32`` (or the type you are using) to your
adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of ``.theanorc`` file if you plan to do a lot of GPU work.
GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async` * Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* Prefer * Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and ``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
``dscalar`` because the former will give you *float32* variables when you *float32* variables when ``floatX=float32``.
``floatX=float32``. * Ensure that your output variables have a *float32* dtype and not *float64*.
* Ensure The more *float32* variables are in your graph, the more work the GPU can do for
that your output variables have a *float32* dtype and not *float64*. The
more *float32* variables are in your graph, the more work the GPU can do for
you. you.
* Minimize * Minimize transfers to the GPU device by using ``shared`` *float32* variables
tranfers to the GPU device by using ``shared`` *float32* variables to store to store frequently-accessed data (see :func:`shared()<shared.shared>`).
frequently-accessed data (see :func:`shared()<shared.shared>`). When using When using the GPU, *float32* tensor ``shared`` variables are stored on
the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to the GPU by default to eliminate transfer time for GPU ops using those variables.
eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try running your script with * If you aren't happy with the performance you see, try running your script with
``profile=True`` flag. This should print some timing information at program ``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.sandbox.cuda. programming, have a look at how it's implemented in theano.sandbox.cuda.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*. Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
This can tell you if not enough of your graph is on the GPU or if there Xs(X%) in transfer op*. This can tell you if not enough of your graph is
is too much memory transfer. on the GPU or if there is too much memory transfer.
* Use nvcc options. nvcc supports those options to speed up some * Use nvcc options. nvcc supports those options to speed up some computations:
computations: `-ftz=true` to `flush denormals values to `-ftz=true` to `flush denormals values to zeros.
zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_, <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
`--prec-div=false` and `--prec-sqrt=false` options to speed up `--prec-div=false` and `--prec-sqrt=false` options to speed up
division and square root operation by being less precise. You can division and square root operation by being less precise. You can
enable all of them with the `nvcc.flags=--use_fast_math` Theano enable all of them with the `nvcc.flags=--use_fast_math` Theano
flag or you can enable them individually as in this example: flag or you can enable them individually as in this example:
`nvcc.flags=-ftz=true --prec-div=false`. `nvcc.flags=-ftz=true --prec-div=false`.
* To investigate whether if all the Ops in the computational graph are running on GPU. * To investigate whether all the Ops in the computational graph are
It is possible to debug or check your code by providing a value to `assert_no_cpu_op` running on GPU, it is possible to debug or check your code by providing
flag, i.e. `warn`, for warning `raise` for raising an error or `pdb` for putting a breakpoint a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
in the computational graph if there is a CPU Op. raising an error or `pdb` for putting a breakpoint in the computational
graph if there is a CPU Op.
.. _gpu_async:
GPU Async capabilities GPU Async Capabilities
~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
Ever since Theano 0.6 we started to use the asynchronous capability of By default, all operations on the GPU are run asynchronously. This
GPUs. This allows us to be faster but with the possibility that some means that they are only scheduled to run and the function returns.
errors may be raised later than when they should occur. This can cause This is made somewhat transparently by the underlying libgpuarray.
difficulties when profiling Theano apply nodes. There is a NVIDIA
driver feature to help with these issues. If you set the environment A forced synchronization point is introduced when doing memory
variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be transfers between device and host.
automatically synchronized. This reduces performance but provides good
profiling and appropriately placed error messages. It is possible to force synchronization for a particular GpuArray by
calling its ``sync()`` method. This is useful to get accurate timings
This feature interacts with Theano garbage collection of intermediate when doing benchmarks.
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
``allow_gc=False`` to get even faster speed! This will raise the memory
usage.
Changing the Value of Shared Variables Changing the Value of Shared Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -261,9 +273,8 @@ To change the value of a ``shared`` variable, e.g. to provide new data to proces ...@@ -261,9 +273,8 @@ To change the value of a ``shared`` variable, e.g. to provide new data to proces
use ``shared_variable.set_value(new_value)``. For a lot more detail about this, use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`. see :ref:`aliasing`.
Exercise Exercise
++++++++ ~~~~~~~~
Consider again the logistic regression: Consider again the logistic regression:
...@@ -343,15 +354,29 @@ Where does it come from? (Use ``profile=True`` flag.) ...@@ -343,15 +354,29 @@ Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put your ideas to test. What can be done to further increase the speed of the GPU version? Put your ideas to test.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
We provide installation instructions for :ref:`Linux <gpu_linux>`,
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
The old CUDA backend can be activated using the flags ``device=gpu`` or
``device=gpu{0,1,...}``
.. Note:: .. Note::
* Only 32 bit floats are currently supported (development is in progress). * Only 32 bit floats are supported.
* ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space. * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
* There is a limit of one GPU per process. * There is a limit of one GPU per process.
* Use the Theano flag ``device=gpu`` to require use of the GPU device.
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
* Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code. * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
* ``Cast`` inputs before storing them into a ``shared`` variable. * ``Cast`` inputs before storing them into a ``shared`` variable.
...@@ -361,211 +386,6 @@ What can be done to further increase the speed of the GPU version? Put your idea ...@@ -361,211 +386,6 @@ What can be done to further increase the speed of the GPU version? Put your idea
* Insert manual cast around the mean operator (this involves division by length, which is an *int64*). * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
* Notice that a new casting mechanism is being developed. * Notice that a new casting mechanism is being developed.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
.. _gpuarray:
GpuArray Backend
----------------
If you have not done so already, you will need to install libgpuarray
as well as at least one computing toolkit. Instructions for doing so
are provided at `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
While all types of devices are supported if using OpenCL, for the
remainder of this section, whatever compute device you are using will
be referred to as GPU.
.. warning::
While it is fully our intention to support OpenCL, as of May 2014
this support is still in its infancy. A lot of very useful ops
still do not support it because they were ported from the old
backend with minimal change.
Testing Theano with GPU
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program
into a file and run it.
.. testcode::
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just compute ``exp()`` of a bunch of random numbers. Note
that we use the :func:`theano.shared` function to make sure that the
input *x* is stored on the GPU.
.. testoutput::
:hide:
:options: +ELLIPSIS
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took ... seconds
Result is ...
Used the cpu
.. code-block:: none
$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the deivce some code runs on by only replacing
the value of the ``device`` flag without touching the code.
If you don't mind a loss of flexibility, you can ask theano to return
the GPU object directly. The following code is modifed to do just that.
.. testcode::
from theano import function, config, shared, tensor, gpuarray
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], gpuarray.basic_ops.GpuFromHost(None)(tensor.exp(x)))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Here the :func:`theano.gpuarray.basic_ops.GpuFromHost(None)` call
means "copy input to the GPU", with ``None`` the default GPU context when not
explicitly given. However during the optimization phase,
since the result will already be on the gpu, it will be removed. It is
used here to tell theano that we want the result on the GPU.
The output is
.. testoutput::
:hide:
:options: +ELLIPSIS, +SKIP
Using device cuda0: ...
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took ... seconds
Result is ...
Used the gpu
.. code-block:: none
$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous
invocations (and should indeed be lower, since we avoid a transfer)
the massive speedup we obtained is in part due to asynchronous nature
of execution on GPUs, meaning that the work isn't completed yet, just
'launched'. We'll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as ``numpy.asarray()``.
For even more speed, you can play with the ``borrow`` flag. See
:ref:`borrowfunction`.
What Can be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will of course vary from device to
device, and also as we refine our implementation.
This backend supports all regular theano data types (float32, float64,
int, ...) however GPU support varies and some units can't deal with
double (float64) or small (less than 32 bits like int16) data types.
You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its tag.target attribute to
'cpu'.
Complex support is untested and most likely completely broken.
In general, large operations like matrix multiplication, or
element-wise operations with large inputs, will be significatly
faster.
GPU Async Capabilities
~~~~~~~~~~~~~~~~~~~~~~
By default, all operations on the GPU are run asynchronously. This
means that they are only scheduled to run and the function returns.
This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory
transfers between device and host.
It is possible to force synchronization for a particular GpuArray by
calling its ``sync()`` method. This is useful to get accurate timings
when doing benchmarks.
------------------------------------------- -------------------------------------------
......
...@@ -11,8 +11,6 @@ import numpy ...@@ -11,8 +11,6 @@ import numpy
import theano import theano
import theano.tensor as tt import theano.tensor as tt
from theano import sandbox, Out
theano.config.floatX = 'float32' theano.config.floatX = 'float32'
rng = numpy.random rng = numpy.random
...@@ -20,7 +18,7 @@ rng = numpy.random ...@@ -20,7 +18,7 @@ rng = numpy.random
N = 400 N = 400
feats = 784 feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX), D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX)) rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000 training_steps = 10000
# Declare Theano symbolic variables # Declare Theano symbolic variables
...@@ -41,30 +39,19 @@ cost = tt.cast(xent.mean(), 'float32') + \ ...@@ -41,30 +39,19 @@ cost = tt.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize 0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = tt.grad(cost, [w, b]) gw, gb = tt.grad(cost, [w, b])
"""
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
name="predict")
"""
# Compile expressions to functions # Compile expressions to functions
train = theano.function( train = theano.function(
inputs=[], inputs=[],
outputs=[prediction, xent], outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb}, updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
name="train") name="train")
predict = theano.function(inputs=[], outputs=prediction, predict = theano.function(inputs=[], outputs=prediction,
name="predict") name="predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in if any([n.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for n in
train.maker.fgraph.toposort()]): train.maker.fgraph.toposort()]):
print('Used the cpu') print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in elif any([n.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for n in
train.maker.fgraph.toposort()]): train.maker.fgraph.toposort()]):
print('Used the gpu') print('Used the gpu')
else: else:
...@@ -101,171 +88,171 @@ prediction on D ...@@ -101,171 +88,171 @@ prediction on D
# in the script, followed by a summary for all functions. # in the script, followed by a summary for all functions.
# We'll show here only the summary: # We'll show here only the summary:
Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Function profiling Function profiling
================== ==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile. Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 1.590916e+00s Time in 10001 calls to Function.__call__: 1.300452e+00s
Time in Function.fn.__call__: 1.492365e+00s (93.805%) Time in Function.fn.__call__: 1.215823e+00s (93.492%)
Time in thunks: 1.408159e+00s (88.512%) Time in thunks: 1.157602e+00s (89.015%)
Total compile time: 6.309664e+00s Total compile time: 8.922548e-01s
Number of Apply nodes: 25 Number of Apply nodes: 17
Theano Optimizer time: 4.848340e-01s Theano Optimizer time: 6.270301e-01s
Theano validate time: 5.454302e-03s Theano validate time: 5.993605e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s Theano Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
Import time 3.543139e-03s
Time in all call to theano.grad() 1.848292e-02s
Time since theano import 2.864s
Class Class
--- ---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv 64.5% 64.5% 0.747s 3.73e-05s C 20001 3 theano.tensor.blas_c.CGemv
30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise 33.1% 97.7% 0.384s 4.79e-06s C 80001 9 theano.tensor.elemwise.Elemwise
5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv 1.0% 98.6% 0.011s 1.14e-06s C 10000 1 theano.tensor.elemwise.Sum
1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc 0.7% 99.4% 0.009s 2.85e-07s C 30001 4 theano.tensor.elemwise.DimShuffle
1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum 0.3% 99.7% 0.004s 3.64e-07s C 10001 2 theano.tensor.basic.AllocEmpty
1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle 0.3% 100.0% 0.004s 1.78e-07s C 20001 3 theano.compile.ops.Shape_i
0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops Ops
--- ---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace} 64.5% 64.5% 0.747s 3.73e-05s C 20001 3 CGemv{inplace}
15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)] 18.7% 83.2% 0.217s 2.17e-05s C 10000 1 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)] 8.9% 92.1% 0.103s 1.03e-05s C 10000 1 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace} 4.3% 96.4% 0.050s 4.98e-06s C 10000 1 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}} 1.0% 97.4% 0.011s 1.14e-06s C 10000 1 Sum{acc_dtype=float64}
1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc 0.5% 97.9% 0.006s 2.83e-07s C 20001 3 InplaceDimShuffle{x}
1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64} 0.4% 98.3% 0.004s 4.22e-07s C 10000 1 Elemwise{sub,no_inplace}
0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x} 0.3% 98.6% 0.004s 3.70e-07s C 10000 1 Elemwise{neg,no_inplace}
0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace} 0.3% 98.9% 0.004s 3.64e-07s C 10001 2 AllocEmpty{dtype='float32'}
0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0} 0.3% 99.2% 0.004s 1.78e-07s C 20001 3 Shape_i{0}
0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 0.2% 99.5% 0.003s 2.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace} 0.2% 99.7% 0.003s 2.65e-07s C 10000 1 Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0} 0.2% 99.9% 0.002s 1.98e-07s C 10000 1 Elemwise{Cast{float32}}
0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace} 0.1% 100.0% 0.002s 1.54e-07s C 10000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}} 0.0% 100.0% 0.000s 4.77e-06s C 1 1 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply Apply
------ ------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 34.0% 34.0% 0.394s 3.94e-05s 10000 7 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 30.5% 64.5% 0.353s 3.53e-05s 10000 15 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 18.7% 83.2% 0.217s 2.17e-05s 10000 12 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0) 8.9% 92.1% 0.103s 1.03e-05s 10000 13 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0}) 4.3% 96.4% 0.050s 4.98e-06s 10000 11 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5}) 1.0% 97.4% 0.011s 1.14e-06s 10000 14 Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0) 0.4% 97.8% 0.004s 4.22e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0) 0.3% 98.1% 0.004s 3.76e-07s 10000 0 InplaceDimShuffle{x}(b)
0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) 0.3% 98.4% 0.004s 3.70e-07s 10000 10 Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y) 0.3% 98.7% 0.004s 3.64e-07s 10000 5 AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0) 0.2% 99.0% 0.003s 2.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 0.2% 99.2% 0.003s 2.65e-07s 10000 9 Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b) 0.2% 99.4% 0.002s 2.21e-07s 10000 1 Shape_i{0}(x)
0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x) 0.2% 99.6% 0.002s 1.98e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x) 0.2% 99.7% 0.002s 1.90e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0) 0.1% 99.9% 0.002s 1.54e-07s 10000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0) 0.1% 100.0% 0.001s 1.34e-07s 10000 3 Shape_i{0}(y)
0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0) 0.0% 100.0% 0.000s 3.89e-05s 1 3 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0) 0.0% 100.0% 0.000s 4.77e-06s 1 4 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y) 0.0% 100.0% 0.000s 1.19e-06s 1 0 InplaceDimShuffle{x}(b)
... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime) ... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
# 2.2 Profiling for GPU computations # 2.2 Profiling for GPU computations
# In your terminal, type: # In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
# You'll see first the output of the script: # You'll see first the output of the script:
Used the gpu Used the gpu
target values for D target values for D
prediction on D prediction on D
Results were produced using a GeForce GTX TITAN Results were produced using a GeForce GTX TITAN X
# Profiling summary for all functions: # Profiling summary for all functions:
Function profiling Function profiling
================== ==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile. Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 3.535239e+00s Time in 10001 calls to Function.__call__: 4.181247e+00s
Time in Function.fn.__call__: 3.420863e+00s (96.765%) Time in Function.fn.__call__: 4.081113e+00s (97.605%)
Time in thunks: 2.865905e+00s (81.067%) Time in thunks: 3.915566e+00s (93.646%)
Total compile time: 4.728150e-01s Total compile time: 9.256095e+00s
Number of Apply nodes: 36 Number of Apply nodes: 21
Theano Optimizer time: 4.283385e-01s Theano Optimizer time: 9.996419e-01s
Theano validate time: 7.687330e-03s Theano validate time: 6.523132e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s Theano Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
Import time 4.228115e-03s
Time in all call to theano.grad() 3.286195e-02s
Time since theano import 15.415s
Class Class
--- ---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise 59.5% 59.5% 2.329s 1.16e-04s C 20001 3 theano.sandbox.gpuarray.blas.GpuGemv
17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv 29.8% 89.3% 1.166s 1.30e-05s C 90001 10 theano.sandbox.gpuarray.elemwise.GpuElemwise
15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc 4.1% 93.4% 0.162s 8.10e-06s C 20001 3 theano.sandbox.gpuarray.basic_ops.HostFromGpu
8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu 3.3% 96.7% 0.131s 1.31e-05s C 10000 1 theano.sandbox.gpuarray.elemwise.GpuCAReduceCuda
7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce 1.6% 98.3% 0.061s 6.10e-06s C 10000 1 theano.sandbox.gpuarray.basic_ops.GpuFromHost
4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost 0.8% 99.1% 0.033s 1.09e-06s C 30001 4 theano.sandbox.gpuarray.elemwise.GpuDimShuffle
0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle 0.7% 99.8% 0.026s 2.59e-06s C 10001 2 theano.sandbox.gpuarray.basic_ops.GpuAllocEmpty
0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise 0.2% 100.0% 0.008s 3.95e-07s C 20001 3 theano.compile.ops.Shape_i
0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i
0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops Ops
--- ---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace} 59.5% 59.5% 2.329s 1.16e-04s C 20001 3 GpuGemv{inplace=True}
8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu 4.1% 63.6% 0.162s 8.10e-06s C 20001 3 HostFromGpu(gpuarray)
8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True} 4.0% 67.6% 0.157s 1.57e-05s C 10000 1 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace} 3.8% 71.4% 0.149s 1.49e-05s C 10000 1 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1} 3.7% 75.1% 0.144s 1.44e-05s C 10000 1 GpuElemwise{sub,no_inplace}
7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc 3.6% 78.7% 0.141s 1.41e-05s C 10000 1 GpuElemwise{gt,no_inplace}
6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace} 3.4% 82.1% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Cast{float32}}[]<gpuarray>
6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace} 3.4% 85.5% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace} 3.3% 88.8% 0.131s 1.31e-05s C 10000 1 GpuCAReduceCuda{add}
5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)] 2.9% 91.7% 0.112s 1.12e-05s C 10000 1 GpuElemwise{neg,no_inplace}
4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost 2.6% 94.3% 0.102s 1.02e-05s C 10000 1 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 2.5% 96.7% 0.096s 9.63e-06s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)] 1.6% 98.3% 0.061s 6.10e-06s C 10000 1 GpuFromHost<None>
4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 0.7% 99.0% 0.026s 2.59e-06s C 10001 2 GpuAllocEmpty{dtype='float32', context_name=None}
0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0} 0.5% 99.5% 0.021s 1.06e-06s C 20001 3 InplaceGpuDimShuffle{x}
0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace} 0.3% 99.8% 0.011s 1.14e-06s C 10000 1 InplaceGpuDimShuffle{1,0}
0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0} 0.2% 100.0% 0.008s 3.95e-07s C 20001 3 Shape_i{0}
0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0} 0.0% 100.0% 0.000s 2.00e-05s C 1 1 GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x} ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}}
... (remaining 3 Ops account for 0.20%(0.01s) of the runtime)
Apply Apply
------ ------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 55.0% 55.0% 2.154s 2.15e-04s 10000 7 GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 4.5% 59.5% 0.176s 1.76e-05s 10000 18 GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0) 4.0% 63.5% 0.157s 1.57e-05s 10000 12 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0) 3.8% 67.3% 0.149s 1.49e-05s 10000 15 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0) 3.7% 71.0% 0.144s 1.44e-05s 10000 4 GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0) 3.6% 74.6% 0.141s 1.41e-05s 10000 16 GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y) 3.4% 78.0% 0.133s 1.33e-05s 10000 10 GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0) 3.4% 81.4% 0.133s 1.33e-05s 10000 9 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 3.3% 84.7% 0.131s 1.31e-05s 10000 17 GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0) 2.9% 87.5% 0.112s 1.12e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0) 2.6% 90.1% 0.102s 1.02e-05s 10000 20 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0) 2.5% 92.6% 0.096s 9.63e-06s 10000 13 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0) 2.3% 94.9% 0.090s 9.04e-06s 10000 19 HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0) 1.8% 96.7% 0.072s 7.16e-06s 10000 14 HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0) 1.6% 98.3% 0.061s 6.10e-06s 10000 6 GpuFromHost<None>(Shape_i{0}.0)
3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0) 0.7% 99.0% 0.026s 2.59e-06s 10000 5 GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5}) 0.3% 99.3% 0.013s 1.33e-06s 10000 0 InplaceGpuDimShuffle{x}(b)
0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x) 0.3% 99.6% 0.011s 1.14e-06s 10000 2 InplaceGpuDimShuffle{1,0}(x)
0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x) 0.2% 99.8% 0.008s 7.94e-07s 10000 8 InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0) 0.1% 99.9% 0.005s 5.27e-07s 10000 1 Shape_i{0}(x)
... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime) ... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
# 3. Conclusions # 3. Conclusions
......
...@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000, ...@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
t0 = 0 t0 = 0
t1 = -1 t1 = -1
f() # Ignore first function call to get representative time.
if execute: if execute:
sync = (hasattr(theano, "sandbox") and sync = (hasattr(theano, "sandbox") and
hasattr(theano.sandbox, "cuda") and hasattr(theano.sandbox, "cuda") and
theano.sandbox.cuda.cuda_available) theano.sandbox.cuda.cuda_available)
sync2 = (hasattr(theano, "gpuarray") and
theano.gpuarray.pygpu_activated)
t0 = time.time() t0 = time.time()
for i in range(iters): for i in range(iters):
f() f()
if sync: if sync:
theano.sandbox.cuda.synchronize() theano.sandbox.cuda.synchronize()
if sync2:
c.get_value(borrow=True, return_internal_type=True).sync()
t1 = time.time() t1 = time.time()
return t1 - t0, impl return t1 - t0, impl
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论