Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
P
pytensor
项目
项目
详情
活动
周期分析
仓库
仓库
文件
提交
分支
标签
贡献者
图表
比较
统计图
议题
0
议题
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
CI / CD
CI / CD
流水线
作业
日程
统计图
Wiki
Wiki
代码片段
代码片段
成员
成员
折叠边栏
关闭边栏
活动
图像
聊天
创建新问题
作业
提交
问题看板
Open sidebar
testgroup
pytensor
Commits
489b14c4
提交
489b14c4
authored
7月 31, 2011
作者:
James Bergstra
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
cifarSC - changed indentation levels in rst
上级
ffb29872
显示空白字符变更
内嵌
并排
正在显示
1 个修改的文件
包含
118 行增加
和
78 行删除
+118
-78
introduction.txt
doc/cifarSC2011/introduction.txt
+118
-78
没有找到文件。
doc/cifarSC2011/introduction.txt
浏览文件 @
489b14c4
...
...
@@ -9,29 +9,29 @@ Introduction
Background Questionaire
-----------------------
* Who has used Theano before?
* Who has used Theano before?
* What did you do with it?
* Who has used Python? numpy? scipy? matplotlib?
* Who has used Python? numpy? scipy? matplotlib?
* Who has used iPython?
* Who has used iPython?
* Who has used it as a distributed computing engine?
* Who has done C/C++ programming?
* Who has done C/C++ programming?
* Who has organized computation around a particular physical memory layout?
* Who has organized computation around a particular physical memory layout?
* Who has used a multidimensional array of >2 dimensions?
* Who has used a multidimensional array of >2 dimensions?
* Who has written a Python module in C before?
* Who has written a Python module in C before?
* Who has written a program to *generate* Python modules in C?
* Who has used a templating engine?
* Who has used a templating engine?
* Who has programmed a GPU before?
* Who has programmed a GPU before?
* Using OpenGL / shaders ?
...
...
@@ -43,7 +43,7 @@ Background Questionaire
* Other?
* Who has used Cython?
* Who has used Cython?
Python in one slide
...
...
@@ -51,15 +51,15 @@ Python in one slide
Features:
* General-purpose high-level OO interpreted language
* General-purpose high-level OO interpreted language
* Emphasizes code readability
* Emphasizes code readability
* Comprehensive standard library
* Comprehensive standard library
* Dynamic type and memory management
* Dynamic type and memory management
* builtin types: int, float, str, list, dict, tuple, object
* builtin types: int, float, str, list, dict, tuple, object
Syntax sample:
...
...
@@ -71,22 +71,22 @@ Syntax sample:
def foo(b, c=3): # function w default param c
return a + b + c # note scoping, indentation
b_squared = [b_i**2 for b_i in b] # list comprehension
* List comprehension: ``[i+3 for i in range(10)]``
print b[1:3] # slicing syntax
Numpy in one slide
------------------
* Python floats are full-fledged objects on the heap
* Python floats are full-fledged objects on the heap
* Not suitable for high-performance computing!
* Numpy provides a N-dimensional numeric array in Python
* Numpy provides a N-dimensional numeric array in Python
* Perfect for high-performance computing.
* Numpy provides:
* Numpy provides
* elementwise computations
...
...
@@ -94,7 +94,7 @@ Numpy in one slide
* pseudorandom numbers from many distributions
* Scipy provides lots more, including:
* Scipy provides lots more, including
* more linear algebra
...
...
@@ -161,11 +161,11 @@ Training an MNIST-ready classification neural network in pure numpy might look l
What's missing?
---------------
* Non-lazy evaluation (required by Python) hurts performance
* Non-lazy evaluation (required by Python) hurts performance
* Numpy is bound to the CPU
* Numpy is bound to the CPU
* Numpy lacks symbolic or automatic differentiation
* Numpy lacks symbolic or automatic differentiation
Here's how the algorithm above looks in Theano, and it runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to):
...
...
@@ -210,41 +210,51 @@ you have GPU (I'm skipping some dtype-details which we'll come back to):
Theano in one slide
-------------------
* High-level domain-specific language tailored to numeric computation
* High-level domain-specific language tailored to numeric computation
* Compiles most common expressions to C for CPU and GPU.
* Compiles most common expressions to C for CPU and GPU.
* Limited expressivity means lots of opportunities for expression-level optimizations
* Limited expressivity means lots of opportunities for expression-level optimizations
* No function call -> global optimization
* Strongly typed -> compiles to machine instructions
* Array oriented -> parallelizable across cores
* Expression substitution optimizations automatically draw
* Expression substitution optimizations automatically draw
on many backend technologies for best performance.
* FFTW, MKL, ATLAS, Scipy, Cython, CUDA
* Slower fallbacks always available
* It used to have no/poor support for internal looping and conditional
* It used to have no/poor support for internal looping and conditional
expressions, but these are now quite usable.
Project status
--------------
* Mature: theano has been developed and used since January 2008 (3.5 yrs old)
* Driven over 40 research papers in the last few years
* Core technology for a funded Silicon-Valley startup
* Good user documentation
* Active mailing list with participants from outside our lab
* Many contributors (some from outside our lab)
* Used to teach IFT6266 for two years
* Used for research at Google and Yahoo.
* Unofficial RPMs for Mandriva
* Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
* Mature: theano has been developed and used since January 2008 (3.5 yrs old)
* Driven over 40 research papers in the last few years
* Core technology for a funded Silicon-Valley startup
* Good user documentation
* Active mailing list with participants from outside our lab
* Many contributors (some from outside our lab)
* Used to teach IFT6266 for two years
* Used for research at Google and Yahoo.
* Unofficial RPMs for Mandriva
* Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
Why scripting for GPUs ?
...
...
@@ -252,18 +262,23 @@ Why scripting for GPUs ?
They *Complement each other*:
-
GPUs are everything that scripting/high level languages are not
*
GPUs are everything that scripting/high level languages are not
- Highly parallel
- Very architecture-sensitive
- Built for maximum FP/memory throughput
- So hard to program that meta-programming is easier.
* Highly parallel
- CPU: largely restricted to control
* Very architecture-sensitive
- Optimized for sequential code and low latency (rather than high throughput)
- Tasks (1000/sec)
- Scripting fast enough
* Built for maximum FP/memory throughput
* So hard to program that meta-programming is easier.
* CPU: largely restricted to control
* Optimized for sequential code and low latency (rather than high throughput)
* Tasks (1000/sec)
* Scripting fast enough
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
...
...
@@ -271,28 +286,41 @@ Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
How Fast are GPUs?
------------------
- Theory:
* Theory
* Intel Core i7 980 XE (107Gf/s float64) 6 cores
* NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
* NVIDIA GTX580 (1.5Tf/s float32) 512 cores
* GPUs are faster, cheaper, more power-efficient
* Practice
- Intel Core i7 980 XE (107Gf/s float64) 6 cores
- NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
- NVIDIA GTX580 (1.5Tf/s float32) 512 cores
- GPUs are faster, cheaper, more power-efficient
* Depends on algorithm and implementation!
- Practice:
- Depends on algorithm and implementation!
- Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
- Matrix-matrix multiply speedup: usually about 10-20x.
- Convolution speedup: usually about 15x.
- Elemwise speedup: slower or up to 100x (depending on operation and layout)
- Sum: can be faster or slower depending on layout.
* Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
- Benchmarking is delicate work...
- How to control quality of implementation?
- How much time was spent optimizing CPU vs GPU code?
- Theano goes up to 100x faster on GPU because it uses only one CPU core
- Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
* Matrix-matrix multiply speedup: usually about 10-20x.
- If you see speedup > 100x, the benchmark is probably not fair.
* Convolution speedup: usually about 15x.
* Elemwise speedup: slower or up to 100x (depending on operation and layout)
* Sum: can be faster or slower depending on layout.
* Benchmarking is delicate work...
* How to control quality of implementation?
* How much time was spent optimizing CPU vs GPU code?
* Theano goes up to 100x faster on GPU because it uses only one CPU core
* Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
* If you see speedup > 100x, the benchmark is probably not fair.
Software for Directly Programming a GPU
...
...
@@ -300,15 +328,27 @@ Software for Directly Programming a GPU
Theano is a meta-programmer, doesn't really count.
- CUDA: C extension by NVIDIA
- Vendor-specific
- Numeric libraries (BLAS, RNG, FFT) maturing.
- OpenCL: multi-vendor version of CUDA
- More general, standardized
- Fewer libraries, less adoption.
- PyCUDA: python bindings to CUDA driver interface
- Python interface to CUDA
- Memory management of GPU objects
- Compilation of code for the low-level driver
- Makes it easy to do GPU meta-programming from within Python
- PyOpenCL: PyCUDA for PyOpenCL
* CUDA: C extension by NVIDIA
* Vendor-specific
* Numeric libraries (BLAS, RNG, FFT) maturing.
* OpenCL: multi-vendor version of CUDA
* More general, standardized
* Fewer libraries, less adoption.
* PyCUDA: python bindings to CUDA driver interface
* Python interface to CUDA
* Memory management of GPU objects
* Compilation of code for the low-level driver
* Makes it easy to do GPU meta-programming from within Python
* PyOpenCL: PyCUDA for PyOpenCL
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论