Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
4fd68401
提交
4fd68401
authored
8月 02, 2011
作者:
James Bergstra
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
cifarSC2011 editing by Fred and James
上级
215328da
隐藏空白字符变更
内嵌
并排
正在显示
3 个修改的文件
包含
273 行增加
和
243 行删除
+273
-243
advanced_theano.txt
doc/cifarSC2011/advanced_theano.txt
+141
-143
introduction.txt
doc/cifarSC2011/introduction.txt
+20
-7
theano.txt
doc/cifarSC2011/theano.txt
+112
-93
没有找到文件。
doc/cifarSC2011/advanced_theano.txt
浏览文件 @
4fd68401
.. _advanced_theano:
***************
Advanced Theano
***************
Conditions
----------
**IfElse**
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
**IfElse Example: Comparison with Switch**
.. code-block:: python
from theano import tensor as T
from theano.lazycond import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.
>>> python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Loops
-----
**Scan**
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
- The number of iterations to be part of the symbolic graph
- Minimizes GPU transfers if GPU is involved
- Compute gradients through sequential steps
- Slightly faster then using a for loop in Python with a compiled Theano function
- Can lower the overall memory usage by detecting the actual amount of memory needed
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
Exercise 4
-----------
- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan
Compilation pipeline
--------------------
...
...
@@ -113,7 +252,7 @@ Theano output:
- Try the Theano flag floatX=float32
"""
Exercise
4
Exercise
5
-----------
- In the last exercises, do you see a speed up with the GPU?
...
...
@@ -206,145 +345,6 @@ Debugging
- Few optimizations
- Run Python code (better error messages and can be debugged interactively in the Python debugger)
Conditions
----------
**IfElse**
- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
evaluates one variable respect to the condition.
**IfElse Example: Comparison with Switch**
.. code-block:: python
from theano import tensor as T
from theano.lazycond import ifelse
import theano, time, numpy
a,b = T.scalars('a','b')
x,y = T.matrices('x','y')
z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
f_switch = theano.function([a,b,x,y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a,b,x,y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000,1000))
big_mat2 = numpy.ones((10000,1000))
n_times = 10
tic = time.clock()
for i in xrange(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating both values %f sec'%(time.clock()-tic)
tic = time.clock()
for i in xrange(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'time spent evaluating one value %f sec'%(time.clock()-tic)
IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.
>>> python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.
It is actually important to use ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.
Loops
-----
**Scan**
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
- The number of iterations to be part of the symbolic graph
- Minimizes GPU transfers if GPU is involved
- Compute gradients through sequential steps
- Slightly faster then using a for loop in Python with a compiled Theano function
- Can lower the overall memory usage by detecting the actual amount of memory needed
**Scan Example: Computing pow(A,k)**
.. code-block:: python
import theano
import theano.tensor as T
k = T.iscalar("k"); A = T.vector("A")
def inner_fct(prior_result, A): return prior_result * A
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result,
updates=updates)
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
**Scan Example: Calculating a Polynomial**
.. code-block:: python
import theano
import theano.tensor as T
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print calculate_polynomial(test_coeff, 3)
# 19.0
Exercise 5
-----------
- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan
Known limitations
-----------------
...
...
@@ -364,5 +364,3 @@ Known limitations
- Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph
- Lazy evaluation in a branch (We will try to merge this summer)
doc/cifarSC2011/introduction.txt
浏览文件 @
4fd68401
...
...
@@ -41,6 +41,8 @@ Background Questionaire
* Using OpenCL / PyOpenCL ?
* Using cudamat / gnumpy ?
* Other?
* Who has used Cython?
...
...
@@ -98,17 +100,21 @@ Python in one slide
print b[1:3] # slicing syntax
class Foo(object): # Defining a class
a = 1
def __init__(self):
self.a = 5
def hello(self):
return self.a
f = Foo() # Creating a class instance
print f.hello() # Calling methods of objects
# -> 5
class Bar(Foo): # Defining a subclass
def __init__(self):
self.a =
6
def __init__(self
, a
):
self.a =
a
f = Foo() # Creating a class instance
b = Bar() # Creating an instance of Bar
f.hello(); b.hello() # Calling methods of objects
print Bar(99).hello() # Creating an instance of Bar
# -> 99
Numpy in one slide
------------------
...
...
@@ -308,7 +314,14 @@ Project status
* Unofficial RPMs for Mandriva
* Downloads (January 2011 - June 8 2011): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
* Downloads (January 2011 - June 8 2011):
* Pypi 780
* MLOSS: 483
* Assembla (`bleeding edge` repository): unknown
Why scripting for GPUs?
...
...
doc/cifarSC2011/theano.txt
浏览文件 @
4fd68401
...
...
@@ -8,46 +8,46 @@ Theano
Pointers
--------
-
http://deeplearning.net/software/theano/
-
Announcements mailing list: http://groups.google.com/group/theano-announce
-
User mailing list: http://groups.google.com/group/theano-users
-
Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
-
Installation: https://deeplearning.net/software/theano/install.html
*
http://deeplearning.net/software/theano/
*
Announcements mailing list: http://groups.google.com/group/theano-announce
*
User mailing list: http://groups.google.com/group/theano-users
*
Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
*
Installation: https://deeplearning.net/software/theano/install.html
Description
-----------
-
Mathematical symbolic expression compiler
-
Dynamic C/CUDA code generation
-
Efficient symbolic differentiation
*
Mathematical symbolic expression compiler
*
Dynamic C/CUDA code generation
*
Efficient symbolic differentiation
-
Theano computes derivatives of functions with one or many inputs.
*
Theano computes derivatives of functions with one or many inputs.
-
Speed and stability optimizations
*
Speed and stability optimizations
-
Gives the right answer for ``log(1+x)`` even if x is really tiny.
*
Gives the right answer for ``log(1+x)`` even if x is really tiny.
-
Works on Linux, Mac and Windows
-
Transparent use of a GPU
*
Works on Linux, Mac and Windows
*
Transparent use of a GPU
-
float32 only for now (working on other data types)
-
Doesn't work on Windows for now
-
On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
*
float32 only for now (working on other data types)
*
Doesn't work on Windows for now
*
On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
-
Extensive unit-testing and self-verification
*
Extensive unit-testing and self-verification
-
Detects and diagnoses many types of errors
*
Detects and diagnoses many types of errors
-
On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
*
On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
-
including specialized implementations in C/C++, NumPy, SciPy, and Matlab
*
including specialized implementations in C/C++, NumPy, SciPy, and Matlab
-
Expressions mimic NumPy's syntax & semantics
-
Statically typed and purely functional
-
Some sparse operations (CPU only)
-
The project was started by James Bergstra and Olivier Breuleux
-
For the past 1-2 years, I have replaced Olivier as lead contributor
*
Expressions mimic NumPy's syntax & semantics
*
Statically typed and purely functional
*
Some sparse operations (CPU only)
*
The project was started by James Bergstra and Olivier Breuleux
*
For the past 1-2 years, I have replaced Olivier as lead contributor
Simple example
--------------
...
...
@@ -59,15 +59,13 @@ Simple example
>>> print f([0,1,2]) # prints `array([0,2,1026])`
==================================
==================================
Unoptimized graph Optimized graph
==================================
==================================
.. image::
pics/f_unoptimized.png .. image::
pics/f_optimized.png
==================================
==================================
==================================
==================== ===================
==================================
Unoptimized graph
Optimized graph
==================================
==================== ===================
==================================
.. image::
../hpcs2011_tutorial/pics/f_unoptimized.png .. image:: ../hpcs2011_tutorial/
pics/f_optimized.png
==================================
==================== ===================
==================================
Symbolic programming
- Paradigm shift: people need to use it to understand it
Symbolic programming = *Paradigm shift*: people need to use it to understand it.
Exercise 1
-----------
...
...
@@ -91,10 +89,10 @@ Real example
**Logistic Regression**
-
GPU-ready
-
Symbolic differentiation
-
Speed optimizations
-
Stability optimizations
*
GPU-ready
*
Symbolic differentiation
*
Speed optimizations
*
Stability optimizations
.. code-block:: python
...
...
@@ -142,6 +140,19 @@ Real example
**Optimizations:**
Where are those optimization applied?
* ``log(1+exp(x))``
* ``1 / (1 + T.exp(var))`` (sigmoid)
* ``log(1-sigmoid(var))`` (softplus, stabilisation)
* GEMV (matrix-vector multiply from BLAS)
* Loop fusion
.. code-block:: python
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
...
...
@@ -159,22 +170,14 @@ Real example
updates={w:w-0.1*gw, b:b-0.1*gb})
Where are those optimization applied?
- ``log(1+exp(x))``
- ``1 / (1 + T.exp(var))`` (sigmoid)
- ``log(1-sigmoid(var))`` (softplus, stabilisation)
- GEMV (matrix-vector multiply from BLAS)
- Loop fusion
Theano flags
------------
Theano can be configured with flags. They can be defined in two ways
- With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
- With a configuration file that defaults to ``~.theanorc``
* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
* With a configuration file that defaults to ``~/.theanorc``
Exercise 2
...
...
@@ -261,57 +264,69 @@ Modify and execute the example to run on CPU with floatX=float32
GPU
---
-
Only 32 bit floats are supported (being worked on)
-
Only 1 GPU per process
-
Use the Theano flag ``device=gpu`` to tell to use the GPU device
*
Only 32 bit floats are supported (being worked on)
*
Only 1 GPU per process
*
Use the Theano flag ``device=gpu`` to tell to use the GPU device
-
Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
-
Shared variables with float32 dtype are by default moved to the GPU memory space
*
Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
*
Shared variables with float32 dtype are by default moved to the GPU memory space
-
Use the Theano flag ``floatX=float32``
*
Use the Theano flag ``floatX=float32``
-
Be sure to use ``floatX`` (``theano.config.floatX``) in your code
-
Cast inputs before putting them into a shared variable
-
Cast "problem": int32 with float32 to float64
*
Be sure to use ``floatX`` (``theano.config.floatX``) in your code
*
Cast inputs before putting them into a shared variable
*
Cast "problem": int32 with float32 to float64
-
A new casting mechanism is being developed
-
Insert manual cast in your code or use [u]int{8,16}
-
Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
*
A new casting mechanism is being developed
*
Insert manual cast in your code or use [u]int{8,16}
*
Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
Exercise 3
-----------
- Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
- Time with: ``time python file.py``
* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
* Time with: ``time python file.py``
Symbolic variables
------------------
-
# Dimensions
*
# Dimensions
-
T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
*
T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
-
Dtype
*
Dtype
- T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
- T.vector to floatX dtype
- floatX: configurable dtype that can be float32 or float64.
* T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
- Custom variable
- All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
- Other dtype: uint[8,16,32,64], floatX
* T.vector to floatX dtype
* floatX: configurable dtype that can be float32 or float64.
* Custom variable
* All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
* Other dtype: uint[8,16,32,64], floatX
Creating symbolic variables: Broadcastability
- Remember what I said about broadcasting?
- How to add a row to all rows of a matrix?
- How to add a column to all columns of a matrix?
* Remember what I said about broadcasting?
* How to add a row to all rows of a matrix?
* How to add a column to all columns of a matrix?
Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable
- Broadcastability must be specified when creating the variable
- The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
-
For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``
* The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
*
For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``
Differentiation details
...
...
@@ -319,11 +334,15 @@ Differentiation details
>>> gw,gb = T.grad(cost, [w,b])
- T.grad works symbolically: takes and returns a Theano variable
- T.grad can be compared to a macro: it can be applied multiple times
- T.grad takes scalar costs only
- Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
- We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
* T.grad works symbolically: takes and returns a Theano variable
* T.grad can be compared to a macro: it can be applied multiple times
* T.grad takes scalar costs only
* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
...
...
@@ -332,20 +351,20 @@ Benchmarks
Example:
-
Multi-layer perceptron
-
Convolutional Neural Networks
-
Misc Elemwise operations
*
Multi-layer perceptron
*
Convolutional Neural Networks
*
Misc Elemwise operations
Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
-
EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
-
numexpr: similar to Theano, 'virtual machine' for elemwise expressions
*
EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
*
numexpr: similar to Theano, 'virtual machine' for elemwise expressions
**Multi-Layer Perceptron**:
60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation
.. image:: pics/mlp.png
.. image::
../hpcs2011_tutorial/
pics/mlp.png
**Convolutional Network**:
...
...
@@ -353,12 +372,12 @@ Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
tanh, matrix multiply, softmax elementwise, then in reverse
.. image:: pics/conv.png
.. image::
../hpcs2011_tutorial/
pics/conv.png
**Elemwise**
-
All on CPU
-
Solid blue: Theano
-
Dashed Red: numexpr (without MKL)
*
All on CPU
*
Solid blue: Theano
*
Dashed Red: numexpr (without MKL)
.. image:: pics/multiple_graph.png
.. image::
../hpcs2011_tutorial/
pics/multiple_graph.png
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论