提交 bd544674 authored 作者: slefrancois's avatar slefrancois

Update doc with instructions for using new gpu backend

上级 319382b5
......@@ -37,3 +37,4 @@ Theano.suo
.ipynb_checkpoints
.pydevproject
.ropeproject
core
\ No newline at end of file
......@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
Testing GPU Ops
^^^^^^^^^^^^^^^
Ops to be executed on the GPU should inherit from the
``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
When using the old GPU backend, Ops to be executed on the GPU should inherit
from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
Theano to distinguish them. Currently, we use this to test if the
NVIDIA driver works correctly with our sum reduction code on the GPU.
......
......@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
If you want GPU-related tests to run on a specific GPU device, and not
the default one, you should use :attr:`~config.init_gpu_device`.
For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
See :ref:`libdoc_config` for more information on how to change these
configuration options.
......@@ -508,25 +508,25 @@ Any one of them is enough.
:ref:`Ubuntu instructions <install_ubuntu_gpu>`.
Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
computer, and set the default floating point computations to float32.
For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
You can also set these options in the .theanorc file's ``[global]`` section:
.. code-block:: cfg
[global]
device = gpu
device = cuda
floatX = float32
Note that:
* If your computer has multiple GPUs and you use 'device=gpu', the driver
* If your computer has multiple GPUs and you use 'device=cuda', the driver
selects the one to use (usually gpu0).
* You can use the program nvida-smi to change this policy.
* You can choose one specific GPU by specifying 'device=gpuX', with X the
* You can choose one specific GPU by specifying 'device=cudaX', with X the
the corresponding GPU index (0, 1, 2, ...)
* By default, when ``device`` indicates preference for GPU computations,
Theano will fall back to the CPU if there is a problem with the GPU.
......@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
toggle your GPU on, which can be done with
`gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
Once your setup is complete, head to :ref:`using_gpu` to find how to verify
everything is working properly.
......
......@@ -43,7 +43,7 @@ For Ubuntu 11.10 through 14.04:
sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
sudo pip install Theano
On 14.04, this will install Python 2 by default. If you want to use Python 3:
.. code-block:: bash
......@@ -104,30 +104,30 @@ For Ubuntu 11.04:
The development version of Theano supports Python 3.3 and
probably supports Python 3.2, but we do not test on it.
Bleeding Edge Installs
----------------------
If you would like, instead, to install the bleeding edge Theano (from github)
such that you can edit and contribute to Theano, replace the `pip install Theano`
If you would like, instead, to install the bleeding edge Theano (from github)
such that you can edit and contribute to Theano, replace the `pip install Theano`
command with:
.. code-block:: bash
git clone git://github.com/Theano/Theano.git
cd Theano
cd Theano
python setup.py develop --user
cd ..
VirtualEnv
----------
If you would like to install Theano in a VirtualEnv, you will want to pass the
`--system-site-packages` flag when creating the VirtualEnv so that it will pick up
If you would like to install Theano in a VirtualEnv, you will want to pass the
`--system-site-packages` flag when creating the VirtualEnv so that it will pick up
the system-provided `Numpy` and `SciPy`.
.. code-block:: bash
virtualenv --system-site-packages -p python2.7 theano-env
source theano-env/bin/activate
pip install Theano
......@@ -208,7 +208,7 @@ Updating Bleeding Edge Installs
Change to the Theano directory and run:
.. code-block:: bash
git pull
......@@ -303,7 +303,7 @@ Test GPU configuration
.. code-block:: bash
THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
.. note::
......
......@@ -423,16 +423,16 @@ Create a test file containing:
print("NP time: %f[s], theano time: %f[s] (times should be close when run on CPU!)" %(
np_end-np_start, t_end-t_start))
print("Result difference: %f" % (np.abs(AB-tAB).max(), ))
.. testoutput::
:hide:
:options: +ELLIPSIS
NP time: ...[s], theano time: ...[s] (times should be close when run on CPU!)
Result difference: ...
.. code-block:: none
NP time: 1.480863[s], theano time: 1.475381[s] (times should be close when run on CPU!)
Result difference: 0.000000
......@@ -445,6 +445,8 @@ routine for matrix multiplication)
Configure Theano for GPU use
############################
Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
Theano can be configured with a ``.theanorc`` text file (or
``.theanorc.txt``, whichever is easier for you to create under
Windows). It should be placed in the directory pointed to by the
......@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
.. code-block:: cfg
[global]
device = gpu
device = cuda
floatX = float32
[nvcc]
......@@ -498,7 +500,7 @@ within an MSYS shell if you installed Nose manually as described above.
Compiling a faster BLAS
~~~~~~~~~~~~~~~~~~~~~~~
If you installed Python through WinPython or EPD, Theano will automatically
If you installed Python through WinPython or EPD, Theano will automatically
link with the MKL library, so you should not need to compile your own BLAS.
.. note::
......
......@@ -32,6 +32,7 @@ Optimization FAST_RUN FAST_COMPILE
========================================================= ========= ============ =============
:term:`merge` x x
:term:`constant folding<constant folding>` x x
:term:`GPU transfer` x x
:term:`shape promotion<shape promotion>` x
:term:`fill cut<fill cut>` x
:term:`inc_subtensor srlz.<inc_subtensor serialization>` x
......@@ -52,7 +53,6 @@ Optimization FAST_RUN FAST_COMPILE
:term:`inplace_elemwise` x
:term:`inplace_random` x
:term:`elemwise fusion` x
:term:`GPU transfer` x
:term:`local_log_softmax` x x
:term:`local_remove_all_assert`
========================================================= ========= ============ =============
......
......@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
hints that give more flexibility to the compilation and optimization of the
graph.
For GPU graphs, this borrowing can have a major speed impact. See the following code:
.. code-block:: python
from theano import function, config, shared, sandbox, tensor, Out
import numpy
import time
vlen = 10 * 30 * 768 # 10 x # cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
f2 = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
borrow=True))
t0 = time.time()
for i in range(iters):
r = f1()
t1 = time.time()
no_borrow = t1 - t0
t0 = time.time()
for i in range(iters):
r = f2()
t1 = time.time()
print(
"Looping %s times took %s seconds without borrow "
"and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f1.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Which produces this output:
.. code-block:: none
$ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
Using gpu device 0: GeForce GTX 275
Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
Used the gpu
*Take home message:*
When an input *x* to a function is not needed after the function
......@@ -317,4 +271,3 @@ requirement. When a return value *y* is large (in terms of memory
footprint), and you only need to read from it once, right away when
it's returned, then consider marking it with an ``Out(y,
borrow=True)``.
......@@ -15,355 +15,9 @@ about how to carry out those computations. One of the ways we take
advantage of this flexibility is in carrying out calculations on a
graphics card.
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
We provide installation instructions for :ref:`Linux <gpu_linux>`,
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
Testing Theano with GPU
~~~~~~~~~~~~~~~~~~~~~~~
To see if your GPU is being used, cut and paste the following program into a
file and run it.
.. testcode::
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just computes the ``exp()`` of a bunch of random numbers.
Note that we use the ``shared`` function to
make sure that the input *x* is stored on the graphics device.
.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
.. testoutput::
:hide:
:options: +ELLIPSIS
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took ... seconds
Result is ...
Used the cpu
.. code-block:: none
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
Returning a Handle to Device-Allocated Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
device to the host for your convenience. This is what makes it so easy to swap in ``device=gpu``, but
if you don't mind less portability, you might gain a bigger speedup by changing
the graph to express a computation with a GPU-stored result. The ``gpu_from_host``
op means "copy the input from the host to the GPU" and it is optimized away
after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
.. testcode::
from theano import function, config, shared, sandbox
import theano.sandbox.cuda.basic_ops
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), 'float32'))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
print("Numpy result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The output from this program is
.. testoutput::
:hide:
:options: +ELLIPSIS, +SKIP
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took ... seconds
Result is <CudaNdarray object at 0x...>
Numpy result is ...
Used the gpu
.. code-block:: none
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Here we've shaved off about 50% of the run-time by simply not copying
the resulting array back to the host. The object returned by each
function call is now not a NumPy array but a "CudaNdarray" which can
be converted to a NumPy ndarray by the normal NumPy casting mechanism
using something like ``numpy.asarray()``.
For even more speed you can play with the ``borrow`` flag. See
:ref:`borrowfunction`.
What Can Be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will change as we continue to optimize our
implementations, and vary from device to device, but to give a rough idea of
what to expect right now:
* Only computations
with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
*float64* computations are still relatively slow (Jan 2010).
* Matrix
multiplication, convolution, and large element-wise operations can be
accelerated a lot (5-50x) when arguments are large enough to keep 30
processors busy.
* Indexing,
dimension-shuffling and constant-time reshaping will be equally fast on GPU
as on CPU.
* Summation
over rows/columns of tensors can be a little slower on the GPU than on the CPU.
* Copying
of large quantities of data to and from a device is relatively slow, and
often cancels most of the advantage of one or two accelerated functions on
that data. Getting GPU performance largely hinges on making data transfer to
the device pay off.
Tips for Improving Performance on GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Consider
adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* Prefer
constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
``dscalar`` because the former will give you *float32* variables when
``floatX=float32``.
* Ensure
that your output variables have a *float32* dtype and not *float64*. The
more *float32* variables are in your graph, the more work the GPU can do for
you.
* Minimize
tranfers to the GPU device by using ``shared`` *float32* variables to store
frequently-accessed data (see :func:`shared()<shared.shared>`). When using
the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try running your script with
``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.sandbox.cuda.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
This can tell you if not enough of your graph is on the GPU or if there
is too much memory transfer.
* Use nvcc options. nvcc supports those options to speed up some
computations: `-ftz=true` to `flush denormals values to
zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
`--prec-div=false` and `--prec-sqrt=false` options to speed up
division and square root operation by being less precise. You can
enable all of them with the `nvcc.flags=--use_fast_math` Theano
flag or you can enable them individually as in this example:
`nvcc.flags=-ftz=true --prec-div=false`.
* To investigate whether if all the Ops in the computational graph are running on GPU.
It is possible to debug or check your code by providing a value to `assert_no_cpu_op`
flag, i.e. `warn`, for warning `raise` for raising an error or `pdb` for putting a breakpoint
in the computational graph if there is a CPU Op.
.. _gpu_async:
GPU Async capabilities
~~~~~~~~~~~~~~~~~~~~~~
Ever since Theano 0.6 we started to use the asynchronous capability of
GPUs. This allows us to be faster but with the possibility that some
errors may be raised later than when they should occur. This can cause
difficulties when profiling Theano apply nodes. There is a NVIDIA
driver feature to help with these issues. If you set the environment
variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
automatically synchronized. This reduces performance but provides good
profiling and appropriately placed error messages.
This feature interacts with Theano garbage collection of intermediate
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
``allow_gc=False`` to get even faster speed! This will raise the memory
usage.
Changing the Value of Shared Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To change the value of a ``shared`` variable, e.g. to provide new data to processes,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.
Exercise
++++++++
Consider again the logistic regression:
.. testcode::
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
.. testoutput::
:hide:
:options: + ELLIPSIS
Used the cpu
target values for D
...
prediction on D
...
Modify and execute this example to run on GPU with ``floatX=float32`` and
time it using the command line ``time python file.py``. (Of course, you may use some of your answer
to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put your ideas to test.
.. Note::
* Only 32 bit floats are currently supported (development is in progress).
* ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
* There is a limit of one GPU per process.
* Use the Theano flag ``device=gpu`` to require use of the GPU device.
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
* Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
* ``Cast`` inputs before storing them into a ``shared`` variable.
* Circumvent the automatic cast of *int32* with *float32* to *float64*:
* Insert manual cast in your code or use *[u]int{8,16}*.
* Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
* Notice that a new casting mechanism is being developed.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
There are two ways currently to use a gpu, on that should support any OpenCL
device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend which
only supports NVIDIA cards (:ref:`cuda`).
.. _gpuarray:
......@@ -380,10 +34,9 @@ be referred to as GPU.
.. warning::
While it is fully our intention to support OpenCL, as of May 2014
this support is still in its infancy. A lot of very useful ops
still do not support it because they were ported from the old
backend with minimal change.
The backend was designed to support OpenCL, however current support is
incomplete. A lot of very useful ops still do not support it because they
were ported from the old backend with minimal change.
Testing Theano with GPU
~~~~~~~~~~~~~~~~~~~~~~~
......@@ -391,6 +44,9 @@ Testing Theano with GPU
To see if your GPU is being used, cut and paste the following program
into a file and run it.
Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
``device=cuda{0,1,...}`` to specify which GPU to use.
.. testcode::
from theano import function, config, shared, tensor
......@@ -453,11 +109,11 @@ Returning a Handle to Device-Allocated Data
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the deivce some code runs on by only replacing
This allows changing the device some code runs on by only replacing
the value of the ``device`` flag without touching the code.
If you don't mind a loss of flexibility, you can ask theano to return
the GPU object directly. The following code is modifed to do just that.
the GPU object directly. The following code is modified to do just that.
.. testcode::
......@@ -532,23 +188,68 @@ What Can be Accelerated on the GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The performance characteristics will of course vary from device to
device, and also as we refine our implementation.
This backend supports all regular theano data types (float32, float64,
int, ...) however GPU support varies and some units can't deal with
device, and also as we refine our implementation:
* In general, matrix multiplication, convolution, and large element-wise
operations can be accelerated a lot (5-50x) when arguments are large enough
to keep 30 processors busy.
* Indexing, dimension-shuffling and constant-time reshaping will be equally fast
on GPU as on CPU.
* Summation over rows/columns of tensors can be a little slower on the
GPU than on the CPU.
* Copying of large quantities of data to and from a device is relatively slow,
and often cancels most of the advantage of one or two accelerated functions
on that data. Getting GPU performance largely hinges on making data transfer
to the device pay off.
The backend supports all regular theano data types (float32, float64,
int, ...), however GPU support varies and some units can't deal with
double (float64) or small (less than 32 bits like int16) data types.
You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an
By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its tag.target attribute to
'cpu'.
Complex support is untested and most likely completely broken.
In general, large operations like matrix multiplication, or
element-wise operations with large inputs, will be significatly
faster.
Tips for Improving Performance on GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Consider adding ``floatX=float32`` (or the type you are using) to your
``.theanorc`` file if you plan to do a lot of GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
you *float32* variables when ``floatX=float32``.
* Ensure that your output variables have a *float32* dtype and not *float64*.
The more *float32* variables are in your graph, the more work the GPU can do for
you.
* Minimize transfers to the GPU device by using ``shared`` *float32* variables
to store frequently-accessed data (see :func:`shared()<shared.shared>`).
When using the GPU, *float32* tensor ``shared`` variables are stored on
the GPU by default to eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try running your script with
``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.sandbox.cuda.
Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
Xs(X%) in transfer op*. This can tell you if not enough of your graph is
on the GPU or if there is too much memory transfer.
* Use nvcc options. nvcc supports those options to speed up some computations:
`-ftz=true` to `flush denormals values to zeros.
<https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
`--prec-div=false` and `--prec-sqrt=false` options to speed up
division and square root operation by being less precise. You can
enable all of them with the `nvcc.flags=--use_fast_math` Theano
flag or you can enable them individually as in this example:
`nvcc.flags=-ftz=true --prec-div=false`.
* To investigate whether all the Ops in the computational graph are
running on GPU, it is possible to debug or check your code by providing
a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
raising an error or `pdb` for putting a breakpoint in the computational
graph if there is a CPU Op.
GPU Async Capabilities
~~~~~~~~~~~~~~~~~~~~~~
......@@ -565,6 +266,125 @@ calling its ``sync()`` method. This is useful to get accurate timings
when doing benchmarks.
Changing the Value of Shared Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To change the value of a ``shared`` variable, e.g. to provide new data to processes,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.
Exercise
~~~~~~~~
Consider again the logistic regression:
.. testcode::
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
.. testoutput::
:hide:
:options: + ELLIPSIS
Used the cpu
target values for D
...
prediction on D
...
Modify and execute this example to run on GPU with ``floatX=float32`` and
time it using the command line ``time python file.py``. (Of course, you may use some of your answer
to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put your ideas to test.
:download:`Solution<using_gpu_solution_1.py>`
-------------------------------------------
.. _cuda:
CUDA backend
------------
If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
We provide installation instructions for :ref:`Linux <gpu_linux>`,
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
The old CUDA backend can be activated using the flags ``device=gpu`` or
``device=gpu{0,1,...}``
.. Note::
* Only 32 bit floats are supported.
* ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
* There is a limit of one GPU per process.
* Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
* ``Cast`` inputs before storing them into a ``shared`` variable.
* Circumvent the automatic cast of *int32* with *float32* to *float64*:
* Insert manual cast in your code or use *[u]int{8,16}*.
* Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
* Notice that a new casting mechanism is being developed.
-------------------------------------------
......
......@@ -11,8 +11,6 @@ import numpy
import theano
import theano.tensor as tt
from theano import sandbox, Out
theano.config.floatX = 'float32'
rng = numpy.random
......@@ -20,7 +18,7 @@ rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
......@@ -38,33 +36,22 @@ p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = tt.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = tt.grad(cost, [w, b])
"""
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
name="predict")
"""
# Compile expressions to functions
train = theano.function(
inputs=[],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
name="train")
predict = theano.function(inputs=[], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
if any([n.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for n in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
elif any([n.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for n in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
......@@ -101,171 +88,171 @@ prediction on D
# in the script, followed by a summary for all functions.
# We'll show here only the summary:
Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 1.590916e+00s
Time in Function.fn.__call__: 1.492365e+00s (93.805%)
Time in thunks: 1.408159e+00s (88.512%)
Total compile time: 6.309664e+00s
Number of Apply nodes: 25
Theano Optimizer time: 4.848340e-01s
Theano validate time: 5.454302e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 1.300452e+00s
Time in Function.fn.__call__: 1.215823e+00s (93.492%)
Time in thunks: 1.157602e+00s (89.015%)
Total compile time: 8.922548e-01s
Number of Apply nodes: 17
Theano Optimizer time: 6.270301e-01s
Theano validate time: 5.993605e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
Import time 3.543139e-03s
Time in all call to theano.grad() 1.848292e-02s
Time since theano import 2.864s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv
30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise
5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv
1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc
1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum
1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle
0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 theano.tensor.blas_c.CGemv
33.1% 97.7% 0.384s 4.79e-06s C 80001 9 theano.tensor.elemwise.Elemwise
1.0% 98.6% 0.011s 1.14e-06s C 10000 1 theano.tensor.elemwise.Sum
0.7% 99.4% 0.009s 2.85e-07s C 30001 4 theano.tensor.elemwise.DimShuffle
0.3% 99.7% 0.004s 3.64e-07s C 10001 2 theano.tensor.basic.AllocEmpty
0.3% 100.0% 0.004s 1.78e-07s C 20001 3 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace}
15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace}
4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc
1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64}
0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x}
0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace}
0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0}
0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace}
0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace}
0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}}
0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
64.5% 64.5% 0.747s 3.73e-05s C 20001 3 CGemv{inplace}
18.7% 83.2% 0.217s 2.17e-05s C 10000 1 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
8.9% 92.1% 0.103s 1.03e-05s C 10000 1 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
4.3% 96.4% 0.050s 4.98e-06s C 10000 1 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
1.0% 97.4% 0.011s 1.14e-06s C 10000 1 Sum{acc_dtype=float64}
0.5% 97.9% 0.006s 2.83e-07s C 20001 3 InplaceDimShuffle{x}
0.4% 98.3% 0.004s 4.22e-07s C 10000 1 Elemwise{sub,no_inplace}
0.3% 98.6% 0.004s 3.70e-07s C 10000 1 Elemwise{neg,no_inplace}
0.3% 98.9% 0.004s 3.64e-07s C 10001 2 AllocEmpty{dtype='float32'}
0.3% 99.2% 0.004s 1.78e-07s C 20001 3 Shape_i{0}
0.2% 99.5% 0.003s 2.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.2% 99.7% 0.003s 2.65e-07s C 10000 1 Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.2% 99.9% 0.002s 1.98e-07s C 10000 1 Elemwise{Cast{float32}}
0.1% 100.0% 0.002s 1.54e-07s C 10000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 100.0% 0.000s 4.77e-06s C 1 1 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0)
0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b)
0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x)
0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y)
... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
34.0% 34.0% 0.394s 3.94e-05s 10000 7 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
30.5% 64.5% 0.353s 3.53e-05s 10000 15 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
18.7% 83.2% 0.217s 2.17e-05s 10000 12 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
8.9% 92.1% 0.103s 1.03e-05s 10000 13 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
4.3% 96.4% 0.050s 4.98e-06s 10000 11 Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
1.0% 97.4% 0.011s 1.14e-06s 10000 14 Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
0.4% 97.8% 0.004s 4.22e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.3% 98.1% 0.004s 3.76e-07s 10000 0 InplaceDimShuffle{x}(b)
0.3% 98.4% 0.004s 3.70e-07s 10000 10 Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.3% 98.7% 0.004s 3.64e-07s 10000 5 AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.2% 99.0% 0.003s 2.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.2% 99.2% 0.003s 2.65e-07s 10000 9 Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.2% 99.4% 0.002s 2.21e-07s 10000 1 Shape_i{0}(x)
0.2% 99.6% 0.002s 1.98e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.2% 99.7% 0.002s 1.90e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.1% 99.9% 0.002s 1.54e-07s 10000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.1% 100.0% 0.001s 1.34e-07s 10000 3 Shape_i{0}(y)
0.0% 100.0% 0.000s 3.89e-05s 1 3 CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
0.0% 100.0% 0.000s 4.77e-06s 1 4 Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.000s 1.19e-06s 1 0 InplaceDimShuffle{x}(b)
... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
# 2.2 Profiling for GPU computations
# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the gpu
target values for D
prediction on D
Results were produced using a GeForce GTX TITAN
Results were produced using a GeForce GTX TITAN X
# Profiling summary for all functions:
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 3.535239e+00s
Time in Function.fn.__call__: 3.420863e+00s (96.765%)
Time in thunks: 2.865905e+00s (81.067%)
Total compile time: 4.728150e-01s
Number of Apply nodes: 36
Theano Optimizer time: 4.283385e-01s
Theano validate time: 7.687330e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 4.181247e+00s
Time in Function.fn.__call__: 4.081113e+00s (97.605%)
Time in thunks: 3.915566e+00s (93.646%)
Total compile time: 9.256095e+00s
Number of Apply nodes: 21
Theano Optimizer time: 9.996419e-01s
Theano validate time: 6.523132e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
Import time 4.228115e-03s
Time in all call to theano.grad() 3.286195e-02s
Time since theano import 15.415s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise
17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv
15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc
8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu
7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise
0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i
0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 theano.sandbox.gpuarray.blas.GpuGemv
29.8% 89.3% 1.166s 1.30e-05s C 90001 10 theano.sandbox.gpuarray.elemwise.GpuElemwise
4.1% 93.4% 0.162s 8.10e-06s C 20001 3 theano.sandbox.gpuarray.basic_ops.HostFromGpu
3.3% 96.7% 0.131s 1.31e-05s C 10000 1 theano.sandbox.gpuarray.elemwise.GpuCAReduceCuda
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 theano.sandbox.gpuarray.basic_ops.GpuFromHost
0.8% 99.1% 0.033s 1.09e-06s C 30001 4 theano.sandbox.gpuarray.elemwise.GpuDimShuffle
0.7% 99.8% 0.026s 2.59e-06s C 10001 2 theano.sandbox.gpuarray.basic_ops.GpuAllocEmpty
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace}
8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu
8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True}
7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1}
7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc
6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace}
6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace}
6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace}
5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost
4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0}
0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace}
0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0}
0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0}
0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x}
0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}}
... (remaining 3 Ops account for 0.20%(0.01s) of the runtime)
59.5% 59.5% 2.329s 1.16e-04s C 20001 3 GpuGemv{inplace=True}
4.1% 63.6% 0.162s 8.10e-06s C 20001 3 HostFromGpu(gpuarray)
4.0% 67.6% 0.157s 1.57e-05s C 10000 1 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
3.8% 71.4% 0.149s 1.49e-05s C 10000 1 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
3.7% 75.1% 0.144s 1.44e-05s C 10000 1 GpuElemwise{sub,no_inplace}
3.6% 78.7% 0.141s 1.41e-05s C 10000 1 GpuElemwise{gt,no_inplace}
3.4% 82.1% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Cast{float32}}[]<gpuarray>
3.4% 85.5% 0.133s 1.33e-05s C 10000 1 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
3.3% 88.8% 0.131s 1.31e-05s C 10000 1 GpuCAReduceCuda{add}
2.9% 91.7% 0.112s 1.12e-05s C 10000 1 GpuElemwise{neg,no_inplace}
2.6% 94.3% 0.102s 1.02e-05s C 10000 1 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
2.5% 96.7% 0.096s 9.63e-06s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
1.6% 98.3% 0.061s 6.10e-06s C 10000 1 GpuFromHost<None>
0.7% 99.0% 0.026s 2.59e-06s C 10001 2 GpuAllocEmpty{dtype='float32', context_name=None}
0.5% 99.5% 0.021s 1.06e-06s C 20001 3 InplaceGpuDimShuffle{x}
0.3% 99.8% 0.011s 1.14e-06s C 10000 1 InplaceGpuDimShuffle{1,0}
0.2% 100.0% 0.008s 3.95e-07s C 20001 3 Shape_i{0}
0.0% 100.0% 0.000s 2.00e-05s C 1 1 GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0)
6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0)
4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x)
0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x)
0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
55.0% 55.0% 2.154s 2.15e-04s 10000 7 GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
4.5% 59.5% 0.176s 1.76e-05s 10000 18 GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
4.0% 63.5% 0.157s 1.57e-05s 10000 12 GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
3.8% 67.3% 0.149s 1.49e-05s 10000 15 GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
3.7% 71.0% 0.144s 1.44e-05s 10000 4 GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
3.6% 74.6% 0.141s 1.41e-05s 10000 16 GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
3.4% 78.0% 0.133s 1.33e-05s 10000 10 GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
3.4% 81.4% 0.133s 1.33e-05s 10000 9 GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
3.3% 84.7% 0.131s 1.31e-05s 10000 17 GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
2.9% 87.5% 0.112s 1.12e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
2.6% 90.1% 0.102s 1.02e-05s 10000 20 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
2.5% 92.6% 0.096s 9.63e-06s 10000 13 GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
2.3% 94.9% 0.090s 9.04e-06s 10000 19 HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
1.8% 96.7% 0.072s 7.16e-06s 10000 14 HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
1.6% 98.3% 0.061s 6.10e-06s 10000 6 GpuFromHost<None>(Shape_i{0}.0)
0.7% 99.0% 0.026s 2.59e-06s 10000 5 GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
0.3% 99.3% 0.013s 1.33e-06s 10000 0 InplaceGpuDimShuffle{x}(b)
0.3% 99.6% 0.011s 1.14e-06s 10000 2 InplaceGpuDimShuffle{1,0}(x)
0.2% 99.8% 0.008s 7.94e-07s 10000 8 InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
0.1% 99.9% 0.005s 5.27e-07s 10000 1 Shape_i{0}(x)
... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
# 3. Conclusions
......
......@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
t0 = 0
t1 = -1
f() # Ignore first function call to get representative time.
if execute:
sync = (hasattr(theano, "sandbox") and
hasattr(theano.sandbox, "cuda") and
theano.sandbox.cuda.cuda_available)
sync2 = (hasattr(theano, "gpuarray") and
theano.gpuarray.pygpu_activated)
t0 = time.time()
for i in range(iters):
f()
if sync:
theano.sandbox.cuda.synchronize()
if sync2:
c.get_value(borrow=True, return_internal_type=True).sync()
t1 = time.time()
return t1 - t0, impl
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论