提交 3bffa49b authored 作者: Eric Larsen's avatar Eric Larsen 提交者: Frederic

Correct Theano's tutorial: overhaul logic of exposition and integrate exercises…

Correct Theano's tutorial: overhaul logic of exposition and integrate exercises and useful info found in CIFAR SC2011 accordingly
上级 a68ec1de
......@@ -13,7 +13,7 @@
Guide
=====
The config module contains many attributes that modify Theano's behavior. Many of these
The config module contains many ``attributes`` that modify Theano's behavior. Many of these
attributes are consulted during the import of the ``theano`` module and many are assumed to be
read-only.
......
.. _adding:
========================================
Baby steps - Adding two numbers together
========================================
====================
Baby steps - Algebra
====================
Adding two scalars
==================
......@@ -57,8 +56,6 @@ instruction. Behind the scenes, ``f`` was being compiled into C code.
type of both ``x`` and ``y`` is ``theano.tensor.ivector``.
-------------------------------------------
**Step 1**
>>> x = T.dscalar('x')
......@@ -91,8 +88,6 @@ given name. If you provide no argument, the symbol will be unnamed. Names
are not required, but they can help debugging.
-------------------------------------------
**Step 2**
The second step is to combine ``x`` and ``y`` into their sum ``z``:
......@@ -106,7 +101,6 @@ function to pretty-print out the computation associated to ``z``.
>>> print pp(z)
(x + y)
-------------------------------------------
**Step 3**
......@@ -174,3 +168,19 @@ with numpy arrays may be found :ref:`here <libdoc_tensor_creation>`.
You, the user---not the system architecture---have to choose whether your
program will use 32- or 64-bit integers (``i`` prefix vs. the ``l`` prefix)
and floats (``f`` prefix vs. the ``d`` prefix).
-------------------------------------------
**Exercise**
.. code-block:: python
import theano
a = theano.tensor.vector() # declare variable
out = a + a**10 # build symbolic expression
f = theano.function([a], out) # compile function
print f([0,1,2]) # prints `array([0,2,1026])`
Modify and execute this code to compute this expression: a**2 + b**2 + 2*a*b.
-------------------------------------------
......@@ -4,7 +4,9 @@
Conditions
==========
**IfElse vs switch**
IfElse vs switch
================
- Build condition over symbolic variables.
- IfElse Op takes a `boolean` condition and two variables to compute as input.
......@@ -15,6 +17,7 @@ Conditions
**Example**
.. code-block:: python
from theano import tensor as T
......@@ -49,7 +52,7 @@ Conditions
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
In this example, IfElse Op spend less time (about an half) than Switch
In this example, IfElse Op spends less time (about an half) than Switch
since it computes only one variable instead of both.
.. code-block:: python
......
......@@ -94,8 +94,6 @@ was reformatted for readability):
[ 1., 4.]])]
Setting a default value for an argument
=======================================
......@@ -368,3 +366,58 @@ Others Random Distributions
---------------------------
There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`.
.. _logistic_regression:
A Real example: Logistic Regression
===================================
The preceding elements are put to work in this more realistic example. It will be used repeatedly.
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01*(w**2).sum() # The cost to minimize
gw,gb = T.grad(cost, [w,b]) # Compute the gradient of cost
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.1*gw, b:b-0.1*gb})
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
......@@ -310,8 +310,10 @@ You can also add this at the end of the test file:
t.setUp()
t.test_double_rop()
Exercises 8
-----------
-------------------------------------------
**Exercise**
- Run the code in the file double_op.py.
- Modify and execute to compute: x * y
......
......@@ -7,23 +7,23 @@ PyCUDA/CUDAMat/Gnumpy compatibility
PyCUDA
======
Currently PyCUDA and Theano have different object to store GPU
Currently PyCUDA and Theano have different objects to store GPU
data. The two implementations do not support the same set of features.
Theano's implementation is called CudaNdarray and supports
strides. It support only the float32 dtype. PyCUDA's implementation
is called GPUArray and doesn't support strides. Instead it can deal with all numpy and Cuda dtypes.
strides. It supports only the float32 dtype. PyCUDA's implementation
is called GPUArray and doesn't support strides. However, it can deal with all Numpy and Cuda dtypes.
We are currently working on having the same base object that will
mimic numpy. Until this is ready, here is some information on how to
use both Project in the same script.
mimic Numpy. Until this is ready, here is some information on how to
use both objects in the same script.
Transfer
--------
You can use the `theano.misc.pycuda_utils` module to convert GPUArray to and
from CudaNdarray. The function `to_cudandarray(x, copyif=False)` and
`to_gpuarray(x)` return a new object that share the same memory space
as the original. Otherwise it raise an ValueError. Because GPUArray don't
from CudaNdarray. The functions `to_cudandarray(x, copyif=False)` and
`to_gpuarray(x)` return a new object that occupies the same memory space
as the original. Otherwise it raises a ValueError. Because GPUArray don't
support strides, if the CudaNdarray is strided, we could copy it to
have a non-strided copy. The resulting GPUArray won't share the same
memory region. If you want this behavior, set `copyif=True` in
......@@ -32,29 +32,102 @@ memory region. If you want this behavior, set `copyif=True` in
Compiling with PyCUDA
---------------------
You can use PyCUDA to compile some CUDA function that work directly on
CudaNdarray. There is an example in the function `test_pycuda_theano`
in the file `theano/misc/tests/test_pycuda_theano_simple.py`. Also,
there is an example that shows how to make an op that calls a pycuda
function :ref:`here <pyCUDA_theano>`.
You can use PyCUDA to compile CUDA functions that work directly on
CudaNdarray. Here is an example from the file `theano/misc/tests/test_pycuda_theano_simple.py`
Theano op using PyCUDA function
-------------------------------
.. code-block:: python
import sys
import numpy
import theano
import theano.sandbox.cuda as cuda_ndarray
import theano.misc.pycuda_init
import pycuda
import pycuda.driver as drv
import pycuda.gpuarray
def test_pycuda_theano():
"""Simple example with pycuda function and Theano CudaNdarray object."""
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
You can use gpu function compiled with PyCUDA in a Theano op. Look
into the `HPCS2011 tutorial
<http://www.iro.umontreal.ca/~lisa/pointeurs/tutorial_hpcs2011_fixed.pdf>`_ for an example.
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(100).astype(numpy.float32)
b = numpy.random.randn(100).astype(numpy.float32)
# Test with Theano object
ga = cuda_ndarray.CudaNdarray(a)
gb = cuda_ndarray.CudaNdarray(b)
dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
multiply_them(dest, ga, gb,
block=(400, 1, 1), grid=(1, 1))
assert (numpy.asarray(dest) == a * b).all()
Theano op using PyCUDA function
-------------------------------
You can use gpu function compiled with PyCUDA in a Theano op. Here is an example..
.. code-block:: python
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512,1,1), grid=grid)
return thunk
CUDAMat
=======
There is conversion function between CUDAMat object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.cudamat_utils.py
There are functions for conversion between CUDAMat and Theano CudaNdArray objects.
They obey the same principles as PyCUDA's functions and can be found in
theano.misc.cudamat_utils.py
WARNING: there is a strange problem with stride/shape with those converter. The test to work need a transpose and reshape...
WARNING: There is a strange problem associated with stride/shape with those converters.
To work, the test needs a transpose and reshape...
Gnumpy
======
There is conversion function between gnumpy garray object and Theano CudaNdArray. They are with the same principe as PyCUDA one's. They are in theano.misc.gnumpy_utils.py
There are conversion functions between gnumpy garray object and Theano CudaNdArray.
They are also similar to PyCUDA's and can be found in theano.misc.gnumpy_utils.py
......@@ -264,3 +264,19 @@ or, making use of the *R-operator*:
>>> f([4,4],[2,2])
array([ 4., 4.])
Final notes
===========
* T.grad works symbolically: takes and returns a Theano variable.
* Can be compared to a macro: can be applied multiple times.
* Handles scalar costs only.
* However, a simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian.
* Work is in progress on the missing optimizations to be able to compute efficiently the full
Jacobian and Hessian matrices and Jacobian times x vector.
......@@ -27,10 +27,11 @@ you out.
numpy
adding
examples
gradients
loading_and_saving
symbolic_graphs
printing_drawing
gradients
modes
loading_and_saving
aliasing
conditions
loop
......
......@@ -4,4 +4,85 @@
Loop
====
You can use :ref:`Scan <lib_scan>` to do all type of loop in Theano. All the documentation about it is in the library for now.
Scan
====
- A general form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of scan.
- You 'scan' a function along some input sequence, producing an output at each time-step.
- The function can see the **previous K time-steps** of your function.
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- Advantages of using ``scan`` over for loops:
- Number of iterations to be part of the symbolic graph.
- Minimizes GPU transfers if GPU is involved.
- Compute gradients through sequential steps.
- Slightly faster than using a for loop in Python with a compiled Theano function.
- Can lower the overall memory usage by detecting the actual amount of memory needed.
The full documentation can be found in the library: :ref:`Scan <lib_scan>`.
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
-------------------------------------------
**Exercise**
- Run both examples.
- Modify and execute the polynomial example to have the reduction done by scan.
-------------------------------------------
.. _using_modes:
===============================
Using different compiling modes
===============================
==========================================
Configuration settings and Compiling modes
==========================================
Configuration
=============
The config module contains many ``attributes`` that modify Theano's behavior. Many of these
attributes are consulted during the import of the ``theano`` module and many are assumed to be
read-only.
*As a rule, the attributes in this module should not be modified by user code.*
Theano's code comes with default values for these attributes, but you can
override them from your .theanorc file, and override those values in turn by
the :envvar:`THEANO_FLAGS` environment variable.
The order of precedence is:
1. an assignment to theano.config.<property>
2. an assignment in :envvar:`THEANO_FLAGS`
3. an assignment in the .theanorc file (or the file indicated in :envvar:`THEANORC`)
You can print out the current/effective configuration at any time by printing
theano.config. For example, to see a list of all active configuration
variables, type this from the command-line:
.. code-block:: bash
python -c 'import theano; print theano.config' | less
-------------------------------------------
**Exercise**
Consider once again the logistic regression:
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any( [x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any( [x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
Modify and execute this example to run on CPU (the default) with floatX=float32 and
time with ``time python file.py``.
????You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``
.. Note::
* Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
* Cast inputs before putting them into a shared variable.
* Circumvent the automatic cast of int32 with float32 to float64:
* Insert manual cast in your code or use [u]int{8,16}.
* Insert manual cast around the mean operator (this involves division by length, which is an int64).
* A new casting mechanism is being developed.
-------------------------------------------
Mode
====
......
......@@ -5,6 +5,10 @@
Graph Structures
================
Theano Graphs
=============
Debugging or profiling code written in Theano is not that simple if you
do not know what goes on under the hood. This chapter is meant to
introduce you to a required minimum of the inner workings of Theano,
......@@ -136,3 +140,23 @@ twice or reformulate parts of the graph to a GPU specific version.
For example, one (simple) optimization that Theano uses is to replace
the pattern :math:`\frac{xy}{y}` by :math:`x`.
**Example**
Consider the following example of optimization:
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a**10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print f([0,1,2]) # prints `array([0,2,1026])`
====================================================== =====================================================
Unoptimized graph Optimized graph
====================================================== =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
====================================================== =====================================================
Symbolic programming involves a paradigm shift: people need to use it to understand it.
......@@ -191,7 +191,7 @@ mistake by failing to account for the resulting memory aliasing.
What can be accelerated on the GPU?
------------------------------------
-----------------------------------
The performance characteristics will change as we continue to optimize our
implementations, and vary from device to device, but to give a rough idea of
......@@ -217,7 +217,7 @@ what to expect right now:
Tips for improving performance on GPU
--------------------------------------
-------------------------------------
* Consider
adding ``floatX = float32`` to your .theanorc file if you plan to do a lot of
......@@ -251,3 +251,261 @@ Changing the value of shared variables
To change the value of a shared variable, e.g. to provide new data to process,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.
-------------------------------------------
**Exercise**
Consider the logistic regression:
.. code-block:: python
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any( [x.op.__class__.__name__=='Gemv' for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any( [x.op.__class__.__name__=='GpuGemm' for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
* Modify and execute this example to run on GPU with floatX=float32 and
time with "time python file.py".
* Is there an increase in speed from CPU to GPU?
* Where does it come from? (Use ProfileMode)
* What can be done to further increase the speed of the GPU version?
.. Note::
* Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process
* Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space
* Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable
* Circumvent the automatic cast of int32 with float32 to float64:
* A new casting mechanism is being developed
* Insert manual cast in your code or use [u]int{8,16}
* Insert manual cast around the mean operator (which involves a division by the length, which is an int64)
-------------------------------------------
Software for Directly Programming a GPU
---------------------------------------
Theano is a meta-programmer, doesn't really count.
* CUDA: C extension by NVIDIA
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) maturing.
* OpenCL: multi-vendor version of CUDA
* More general, standardized
* Fewer libraries, less adoption.
* PyCUDA: python bindings to CUDA driver interface
* Python interface to CUDA:
Access Nvidia's CUDA parallel computation API from Python.
Convenience: Makes it easy to do GPU meta-programming from within Python. Helpful documentation.
(i.e. abstractions to compile low-level CUDA code from Python: ``pycuda.driver.SourceModule``).
Completeness: Binding to all of CUDA's driver API.
Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
Speed: PyCUDA's base layer is written in C++.
* Memory management of GPU objects:
GPU memory buffer: \texttt{pycuda.gpuarray.GPUArray}
Object cleanup tied to lifetime of objects (RAII, Resource Acquisition Is Initialization).
Makes it much easier to write correct, leak- and crash-free code.
PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory allocated in it is also freed).
* PyOpenCL: PyCUDA for OpenCL
Example: PyCUDA
---------------
.. code-block:: python
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print dest
-------------------------------------------
**Exercise**
- Run the preceding example.
- Modify and execute it to work for a matrix of 20 x 10.
-------------------------------------------
.. _pyCUDA_theano:
Example: Theano + PyCUDA
------------------------
.. code-block:: python
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512,1,1), grid=grid)
return thunk
Test it!:
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv=numpy.ones((4,5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))
-------------------------------------------
**Exercise**
- Run the preceding example.
- Modify and execute the example to multiple two matrix: x * y.
- Modify and execute the example to return 2 outputs: x + y and x - y.
- Our current elemwise fusion generates computation with only 1 output.
- Modify and execute the example to support stride (i.e. so as not constrain the input to be c contiguous).
-------------------------------------------
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论