提交 b2362664 authored 作者: carriepl's avatar carriepl

Merge pull request #3964 from memimo/3929

Cleaunup profileMode deprecation in docs
...@@ -123,6 +123,7 @@ Loops ...@@ -123,6 +123,7 @@ Loops
.. testcode:: .. testcode::
import numpy
import theano import theano
import theano.tensor as T import theano.tensor as T
...@@ -179,96 +180,200 @@ Inplace optimization ...@@ -179,96 +180,200 @@ Inplace optimization
Profiling Profiling
--------- ---------
- To replace the default mode with this mode, use the Theano flags ``mode=ProfileMode`` - To replace the default mode with this mode, use the Theano flags ``profile=True``
- To enable the memory profiling use the flags ``ProfileMode.profile_memory=True`` - To enable the memory profiling use the flags ``profile=True,profile_memory=True``
Theano output: Theano output for running the train function of logistic regression
example from :doc:`here <../tutorial/examples>` for one epoch:
.. code-block:: python .. code-block:: python
""" """
Time since import 33.456s Function profiling
Theano compile time: 1.023s (3.1% since import) ==================
Optimization time: 0.789s Message: train.py:47
Linker time: 0.221s Time in 1 calls to Function.__call__: 5.981922e-03s
Theano fct call 30.878s (92.3% since import) Time in Function.fn.__call__: 5.180120e-03s (86.596%)
Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call) Time in thunks: 4.213095e-03s (70.430%)
Theano function overhead in ProfileMode 1.466s 4.4%(since import) Total compile time: 3.739440e-01s
4.7%(of fct call) Number of Apply nodes: 21
10001 Theano fct call, 0.003s per call Theano Optimizer time: 3.258998e-01s
Rest of the time since import 1.555s 4.6% Theano validate time: 5.632162e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 3.185582e-02s
Theano fct summary: Import time 3.157377e-03s
<% total fct time> <total time> <time per call> <nb call> <fct name>
100.0% 30.877s 3.09e-03s 10000 train Time in all call to theano.grad() 2.997899e-02s
0.0% 0.000s 4.06e-04s 1 predict Time since theano import 3.616s
Class
Single Op-wise summary: ---
<% of local_time spent on this kind of Op> <cumulative %> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
<self seconds> <cumulative seconds> <time per call> <nb_call> 50.6% 50.6% 0.002s 1.07e-03s Py 2 2 theano.tensor.basic.Dot
<nb_op> <nb_apply> <Op name> 27.2% 77.8% 0.001s 5.74e-04s C 2 2 theano.sandbox.cuda.basic_ops.HostFromGpu
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 1 <Gemv> 18.1% 95.9% 0.001s 3.81e-04s C 2 2 theano.sandbox.cuda.basic_ops.GpuFromHost
9.7% s 97.0% 2.843s 28.515s 2.84e-04s 10001 1 2 <Dot> 2.6% 98.6% 0.000s 1.23e-05s C 9 9 theano.tensor.elemwise.Elemwise
2.4% 99.3% 0.691s 29.206s 7.68e-06s * 90001 10 10 <Elemwise> 0.8% 99.3% 0.000s 3.29e-05s C 1 1 theano.sandbox.cuda.basic_ops.GpuElemwise
0.4% 99.7% 0.127s 29.334s 1.27e-05s 10000 1 1 <Alloc> 0.3% 99.6% 0.000s 5.60e-06s C 2 2 theano.tensor.elemwise.DimShuffle
0.2% 99.9% 0.053s 29.386s 1.75e-06s * 30001 2 4 <DimShuffle> 0.2% 99.8% 0.000s 6.91e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.014s 29.400s 1.40e-06s * 10000 1 1 <Sum> 0.1% 99.9% 0.000s 5.01e-06s C 1 1 theano.compile.ops.Shape_i
0.0% 100.0% 0.011s 29.411s 1.10e-06s * 10000 1 1 <Shape_i> 0.1% 100.0% 0.000s 5.01e-06s C 1 1 theano.tensor.elemwise.Sum
(*) Op is running a c implementation ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Op-wise summary: Ops
<% of local_time spent on this kind of Op> <cumulative %> ---
<self seconds> <cumulative seconds> <time per call> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
<nb_call> <nb apply> <Op name> 50.6% 50.6% 0.002s 1.07e-03s Py 2 2 dot
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 Gemv{inplace} 27.2% 77.8% 0.001s 5.74e-04s C 2 2 HostFromGpu
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10001 2 dot 18.1% 95.9% 0.001s 3.81e-04s C 2 2 GpuFromHost
1.3% 98.2% 0.378s 28.893s 3.78e-05s * 10000 1 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}} 1.0% 97.0% 0.000s 4.39e-05s C 1 1 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}
0.4% 98.7% 0.127s 29.021s 1.27e-05s 10000 1 Alloc 0.8% 97.7% 0.000s 3.29e-05s C 1 1 GpuElemwise{Sub}[(0, 1)]
0.3% 99.0% 0.092s 29.112s 9.16e-06s * 10000 1 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)] 0.4% 98.1% 0.000s 1.50e-05s C 1 1 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]
0.1% 99.3% 0.033s 29.265s 1.66e-06s * 20001 3 InplaceDimShuffle{x} 0.3% 98.4% 0.000s 5.60e-06s C 2 2 InplaceDimShuffle{x}
... (remaining 11 Apply account for 0.7%(0.00s) of the runtime) 0.3% 98.6% 0.000s 1.10e-05s C 1 1 Elemwise{ScalarSigmoid}[(0, 0)]
(*) Op is running a c implementation 0.2% 98.8% 0.000s 9.06e-06s C 1 1 Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)]
0.2% 99.0% 0.000s 7.15e-06s C 1 1 Elemwise{gt,no_inplace}
Apply-wise summary: 0.2% 99.2% 0.000s 6.91e-06s C 1 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
<% of local_time spent at this position> <cumulative %%> 0.2% 99.3% 0.000s 6.91e-06s C 1 1 GpuDimShuffle{1,0}
<apply time> <cumulative seconds> <time per call> 0.2% 99.5% 0.000s 6.91e-06s C 1 1 Elemwise{neg,no_inplace}
<nb_call> <Apply position> <Apply Op name> 0.1% 99.6% 0.000s 5.96e-06s C 1 1 Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 15 Gemv{inplace}(w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)].0, TensorConstant{0.9998}) 0.1% 99.8% 0.000s 5.01e-06s C 1 1 Elemwise{Cast{float64}}
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10000 1 dot(x, w) 0.1% 99.9% 0.000s 5.01e-06s C 1 1 Shape_i{0}
1.3% 98.2% 0.378s 28.893s 3.78e-05s 10000 9 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}(y, Elemwise{Composite{neg,sub}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 0.1% 100.0% 0.000s 5.01e-06s C 1 1 Sum{acc_dtype=float64}
0.4% 98.7% 0.127s 29.020s 1.27e-05s 10000 10 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
0.3% 99.0% 0.092s 29.112s 9.16e-06s 10000 13 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0,0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{neg,sub}}[(0,0)].0, Elemwise{sub,no_inplace}.0, InplaceDimShuffle{x}.0)
0.3% 99.3% 0.080s 29.192s 7.99e-06s 10000 11 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)](Elemwise{neg,no_inplace}.0) Apply
... (remaining 14 Apply instances account for ------
0.7%(0.00s) of the runtime) <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
26.8% 26.8% 0.001s 1.13e-03s 1 1 dot(x, w)
Profile of Theano functions memory: input 0: dtype=float32, shape=(400, 784), strides=c
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.) input 1: dtype=float64, shape=(784,), strides=c
Theano fct: train output 0: dtype=float64, shape=(400,), strides=c
Max without gc, inplace and view (KB) 2481 26.5% 53.4% 0.001s 1.12e-03s 1 10 HostFromGpu(GpuDimShuffle{1,0}.0)
Max FAST_RUN_NO_GC (KB) 16 input 0: dtype=float32, shape=(784, 400), strides=(1, 784)
Max FAST_RUN (KB) 16 output 0: dtype=float32, shape=(784, 400), strides=c
Memory saved by view (KB) 2450 23.8% 77.1% 0.001s 1.00e-03s 1 18 dot(x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
Memory saved by inplace (KB) 15 input 0: dtype=float32, shape=(784, 400), strides=c
Memory saved by GC (KB) 0 input 1: dtype=float64, shape=(400,), strides=c
<Sum apply outputs (bytes)> <Apply outputs memory size(bytes)> output 0: dtype=float64, shape=(784,), strides=c
<created/inplace/view> <Apply node> 9.6% 86.7% 0.000s 4.04e-04s 1 3 GpuFromHost(y)
<created/inplace/view> is taked from the op declaration, not ... input 0: dtype=float32, shape=(400,), strides=c
2508800B [2508800] v InplaceDimShuffle{1,0}(x) output 0: dtype=float32, shape=(400,), strides=(1,)
6272B [6272] i Gemv{inplace}(w, ...) 8.5% 95.2% 0.000s 3.58e-04s 1 2 GpuFromHost(x)
3200B [3200] c Elemwise{Composite{...}}(y, ...) input 0: dtype=float32, shape=(400, 784), strides=c
output 0: dtype=float32, shape=(400, 784), strides=(784, 1)
Here are tips to potentially make your code run faster (if you think of new ones, suggest them on the mailing list). 1.0% 96.3% 0.000s 4.39e-05s 1 13 Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}(y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, HostFromGpu.0, Elemwise{neg,no_inplace}.0)
Test them first, as they are not guaranteed to always provide a speedup. input 0: dtype=float32, shape=(400,), strides=c
- Try the Theano flag floatX=float32 input 1: dtype=float64, shape=(400,), strides=c
input 2: dtype=float64, shape=(1,), strides=c
input 3: dtype=float32, shape=(400,), strides=c
input 4: dtype=float64, shape=(400,), strides=c
output 0: dtype=float64, shape=(400,), strides=c
0.8% 97.1% 0.000s 3.29e-05s 1 7 GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
input 0: dtype=float32, shape=(1,), strides=c
input 1: dtype=float32, shape=(400,), strides=(1,)
output 0: dtype=float32, shape=(400,), strides=c
0.7% 97.7% 0.000s 2.91e-05s 1 11 HostFromGpu(GpuElemwise{Sub}[(0, 1)].0)
input 0: dtype=float32, shape=(400,), strides=c
output 0: dtype=float32, shape=(400,), strides=c
0.4% 98.1% 0.000s 1.50e-05s 1 15 Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float64}}.0, Elemwise{ScalarSigmoid}[(0, 0)].0, HostFromGpu.0)
input 0: dtype=float64, shape=(400,), strides=c
input 1: dtype=float64, shape=(1,), strides=c
input 2: dtype=float32, shape=(400,), strides=c
input 3: dtype=float64, shape=(1,), strides=c
input 4: dtype=float64, shape=(400,), strides=c
input 5: dtype=float32, shape=(400,), strides=c
output 0: dtype=float64, shape=(400,), strides=c
0.3% 98.4% 0.000s 1.10e-05s 1 14 Elemwise{ScalarSigmoid}[(0, 0)](Elemwise{neg,no_inplace}.0)
input 0: dtype=float64, shape=(400,), strides=c
output 0: dtype=float64, shape=(400,), strides=c
0.2% 98.6% 0.000s 9.06e-06s 1 20 Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)](w, TensorConstant{(1,) of 0...0000000149}, dot.0, TensorConstant{(1,) of 0...9999999553})
input 0: dtype=float64, shape=(784,), strides=c
input 1: dtype=float64, shape=(1,), strides=c
input 2: dtype=float64, shape=(784,), strides=c
input 3: dtype=float64, shape=(1,), strides=c
output 0: dtype=float64, shape=(784,), strides=c
0.2% 98.7% 0.000s 7.15e-06s 1 16 Elemwise{gt,no_inplace}(Elemwise{ScalarSigmoid}[(0, 0)].0, TensorConstant{(1,) of 0.5})
input 0: dtype=float64, shape=(400,), strides=c
input 1: dtype=float32, shape=(1,), strides=c
output 0: dtype=int8, shape=(400,), strides=c
0.2% 98.9% 0.000s 7.15e-06s 1 0 InplaceDimShuffle{x}(b)
input 0: dtype=float64, shape=(), strides=c
output 0: dtype=float64, shape=(1,), strides=c
0.2% 99.1% 0.000s 6.91e-06s 1 19 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.10000000149}, Sum{acc_dtype=float64}.0)
input 0: dtype=float64, shape=(), strides=c
input 1: dtype=float64, shape=(), strides=c
input 2: dtype=float64, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
0.2% 99.2% 0.000s 6.91e-06s 1 9 Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
input 0: dtype=float64, shape=(400,), strides=c
output 0: dtype=float64, shape=(400,), strides=c
0.2% 99.4% 0.000s 6.91e-06s 1 6 GpuDimShuffle{1,0}(GpuFromHost.0)
input 0: dtype=float32, shape=(400, 784), strides=(784, 1)
output 0: dtype=float32, shape=(784, 400), strides=(1, 784)
0.1% 99.5% 0.000s 5.96e-06s 1 5 Elemwise{Composite{((-i0) - i1)}}[(0, 0)](dot.0, InplaceDimShuffle{x}.0)
input 0: dtype=float64, shape=(400,), strides=c
input 1: dtype=float64, shape=(1,), strides=c
output 0: dtype=float64, shape=(400,), strides=c
0.1% 99.7% 0.000s 5.01e-06s 1 17 Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
input 0: dtype=float64, shape=(400,), strides=c
output 0: dtype=float64, shape=(), strides=c
0.1% 99.8% 0.000s 5.01e-06s 1 12 Elemwise{Cast{float64}}(InplaceDimShuffle{x}.0)
input 0: dtype=int64, shape=(1,), strides=c
output 0: dtype=float64, shape=(1,), strides=c
0.1% 99.9% 0.000s 5.01e-06s 1 4 Shape_i{0}(y)
input 0: dtype=float32, shape=(400,), strides=c
output 0: dtype=int64, shape=(), strides=c
... (remaining 1 Apply instances account for 0.10%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max if no gc (allow_gc=False): 2469KB (2469KB)
CPU: 1242KB (1242KB)
GPU: 1227KB (1227KB)
---
Max if linker=cvm(default): 2466KB (2464KB)
CPU: 1241KB (1238KB)
GPU: 1225KB (1227KB)
---
Memory saved if views are used: 1225KB (1225KB)
Memory saved if inplace ops are used: 17KB (17KB)
Memory saved if gc is enabled: 3KB (4KB)
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1254400B [(400, 784)] c GpuFromHost(x)
1254400B [(784, 400)] v GpuDimShuffle{1,0}(GpuFromHost.0)
1254400B [(784, 400)] c HostFromGpu(GpuDimShuffle{1,0}.0)
6272B [(784,)] c dot(x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
6272B [(784,)] i Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)](w, TensorConstant{(1,) of 0...0000000149}, dot.0, TensorConstant{(1,) of 0...9999999553})
3200B [(400,)] c dot(x, w)
3200B [(400,)] i Elemwise{Composite{((-i0) - i1)}}[(0, 0)](dot.0, InplaceDimShuffle{x}.0)
3200B [(400,)] i Elemwise{ScalarSigmoid}[(0, 0)](Elemwise{neg,no_inplace}.0)
3200B [(400,)] c Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
3200B [(400,)] i Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float64}}.0, Elemwise{ScalarSigmoid}[(0, 0)].0, HostFromGpu.0)
3200B [(400,)] c Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}(y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, HostFromGpu.0, Elemwise{neg,no_inplace}.0)
1600B [(400,)] i GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
1600B [(400,)] c HostFromGpu(GpuElemwise{Sub}[(0, 1)].0)
1600B [(400,)] c GpuFromHost(y)
... (remaining 7 Apply account for 448B/3800192B ((0.01%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
""" """
Exercise 5 Exercise 5
----------- -----------
- In the last exercises, do you see a speed up with the GPU? - In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use ProfileMode) - Where does it come from? (Use profile=True)
- Is there something we can do to speed up the GPU version? - Is there something we can do to speed up the GPU version?
...@@ -427,4 +532,3 @@ Known limitations ...@@ -427,4 +532,3 @@ Known limitations
- A few hundreds nodes is fine - A few hundreds nodes is fine
- Disabling a few optimizations can speed up compilation - Disabling a few optimizations can speed up compilation
- Usually too many nodes indicates a problem with the graph - Usually too many nodes indicates a problem with the graph
...@@ -21,13 +21,13 @@ Description ...@@ -21,13 +21,13 @@ Description
* Mathematical symbolic expression compiler * Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation * Dynamic C/CUDA code generation
* Efficient symbolic differentiation * Efficient symbolic differentiation
* Theano computes derivatives of functions with one or many inputs. * Theano computes derivatives of functions with one or many inputs.
* Speed and stability optimizations * Speed and stability optimizations
* Gives the right answer for ``log(1+x)`` even if x is really tiny. * Gives the right answer for ``log(1+x)`` even if x is really tiny.
* Works on Linux, Mac and Windows * Works on Linux, Mac and Windows
* Transparent use of a GPU * Transparent use of a GPU
...@@ -38,7 +38,7 @@ Description ...@@ -38,7 +38,7 @@ Description
* Extensive unit-testing and self-verification * Extensive unit-testing and self-verification
* Detects and diagnoses many types of errors * Detects and diagnoses many types of errors
* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives * On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
* including specialized implementations in C/C++, NumPy, SciPy, and Matlab * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
...@@ -79,7 +79,7 @@ Exercise 1 ...@@ -79,7 +79,7 @@ Exercise 1
f = theano.function([a], out) # compile function f = theano.function([a], out) # compile function
print f([0,1,2]) print f([0,1,2])
# prints `array([0,2,1026])` # prints `array([0,2,1026])`
theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True) theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True) theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
...@@ -101,12 +101,12 @@ Real example ...@@ -101,12 +101,12 @@ Real example
import theano import theano
import theano.tensor as T import theano.tensor as T
rng = numpy.random rng = numpy.random
N = 400 N = 400
feats = 784 feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2)) D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000 training_steps = 10000
# Declare Theano symbolic variables # Declare Theano symbolic variables
x = T.matrix("x") x = T.matrix("x")
y = T.vector("y") y = T.vector("y")
...@@ -176,7 +176,7 @@ Theano flags ...@@ -176,7 +176,7 @@ Theano flags
Theano can be configured with flags. They can be defined in two ways Theano can be configured with flags. They can be defined in two ways
* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"`` * With an environment variable: ``THEANO_FLAGS="profile=True,profile_memory=True"``
* With a configuration file that defaults to ``~/.theanorc`` * With a configuration file that defaults to ``~/.theanorc``
...@@ -185,7 +185,7 @@ Exercise 2 ...@@ -185,7 +185,7 @@ Exercise 2
----------- -----------
.. code-block:: python .. code-block:: python
import numpy import numpy
import theano import theano
import theano.tensor as T import theano.tensor as T
...@@ -268,7 +268,7 @@ GPU ...@@ -268,7 +268,7 @@ GPU
* Only 32 bit floats are supported (being worked on) * Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process * Only 1 GPU per process
* Use the Theano flag ``device=gpu`` to tell to use the GPU device * Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space * Shared variables with float32 dtype are by default moved to the GPU memory space
...@@ -277,7 +277,7 @@ GPU ...@@ -277,7 +277,7 @@ GPU
* Be sure to use ``floatX`` (``theano.config.floatX``) in your code * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable * Cast inputs before putting them into a shared variable
* Cast "problem": int32 with float32 to float64 * Cast "problem": int32 with float32 to float64
* A new casting mechanism is being developed * A new casting mechanism is being developed
* Insert manual cast in your code or use [u]int{8,16} * Insert manual cast in your code or use [u]int{8,16}
* Insert manual cast around the mean operator (which involves a division by the length, which is an int64!) * Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
...@@ -295,7 +295,7 @@ Symbolic variables ...@@ -295,7 +295,7 @@ Symbolic variables
------------------ ------------------
* # Dimensions * # Dimensions
* T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4 * T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
* Dtype * Dtype
...@@ -322,7 +322,7 @@ Creating symbolic variables: Broadcastability ...@@ -322,7 +322,7 @@ Creating symbolic variables: Broadcastability
Details regarding symbolic broadcasting... Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable * Broadcastability must be specified when creating the variable
* The only shorcut with broadcastable dimensions are: **T.row** and **T.col** * The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
...@@ -358,7 +358,7 @@ Benchmarks ...@@ -358,7 +358,7 @@ Benchmarks
.. image:: ../hpcs2011_tutorial/pics/mlp.png .. image:: ../hpcs2011_tutorial/pics/mlp.png
**Convolutional Network**: **Convolutional Network**:
256x256 images convolved with 6 7x7 filters, 256x256 images convolved with 6 7x7 filters,
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
......
...@@ -104,7 +104,7 @@ Exercise 5 ...@@ -104,7 +104,7 @@ Exercise 5
----------- -----------
- In the last exercises, do you see a speed up with the GPU? - In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use ProfileMode) - Where does it come from? (Use profile=True)
- Is there something we can do to speed up the GPU version? - Is there something we can do to speed up the GPU version?
......
...@@ -21,13 +21,13 @@ Description ...@@ -21,13 +21,13 @@ Description
* Mathematical symbolic expression compiler * Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation * Dynamic C/CUDA code generation
* Efficient symbolic differentiation * Efficient symbolic differentiation
* Theano computes derivatives of functions with one or many inputs. * Theano computes derivatives of functions with one or many inputs.
* Speed and stability optimizations * Speed and stability optimizations
* Gives the right answer for ``log(1+x)`` even if x is really tiny. * Gives the right answer for ``log(1+x)`` even if x is really tiny.
* Works on Linux, Mac and Windows * Works on Linux, Mac and Windows
* Transparent use of a GPU * Transparent use of a GPU
...@@ -38,7 +38,7 @@ Description ...@@ -38,7 +38,7 @@ Description
* Extensive unit-testing and self-verification * Extensive unit-testing and self-verification
* Detects and diagnoses many types of errors * Detects and diagnoses many types of errors
* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives * On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
* including specialized implementations in C/C++, NumPy, SciPy, and Matlab * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
...@@ -76,7 +76,7 @@ Exercise 1 ...@@ -76,7 +76,7 @@ Exercise 1
f = theano.function([a], out) # compile function f = theano.function([a], out) # compile function
print f([0, 1, 2]) print f([0, 1, 2])
# prints `array([0, 2, 1026])` # prints `array([0, 2, 1026])`
theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True) theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True) theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
...@@ -133,7 +133,7 @@ Theano flags ...@@ -133,7 +133,7 @@ Theano flags
Theano can be configured with flags. They can be defined in two ways Theano can be configured with flags. They can be defined in two ways
* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"`` * With an environment variable: ``THEANO_FLAGS="profile=True,profile_memory=True"``
* With a configuration file that defaults to ``~/.theanorc`` * With a configuration file that defaults to ``~/.theanorc``
...@@ -142,7 +142,7 @@ Exercise 2 ...@@ -142,7 +142,7 @@ Exercise 2
----------- -----------
.. code-block:: python .. code-block:: python
import numpy import numpy
import theano import theano
import theano.tensor as tt import theano.tensor as tt
...@@ -225,7 +225,7 @@ GPU ...@@ -225,7 +225,7 @@ GPU
* Only 32 bit floats are supported (being worked on) * Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU * Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
* Use the Theano flag ``device=gpu`` to tell to use the GPU device * Use the Theano flag ``device=gpu`` to tell to use the GPU device
* Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
* Shared variables with float32 dtype are by default moved to the GPU memory space * Shared variables with float32 dtype are by default moved to the GPU memory space
...@@ -234,7 +234,7 @@ GPU ...@@ -234,7 +234,7 @@ GPU
* Be sure to use ``floatX`` (``theano.config.floatX``) in your code * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
* Cast inputs before putting them into a shared variable * Cast inputs before putting them into a shared variable
* Cast "problem": int32 with float32 to float64 * Cast "problem": int32 with float32 to float64
* Insert manual cast in your code or use [u]int{8,16} * Insert manual cast in your code or use [u]int{8,16}
* The mean operator is worked on to make the output stay in float32. * The mean operator is worked on to make the output stay in float32.
...@@ -256,7 +256,7 @@ Symbolic variables ...@@ -256,7 +256,7 @@ Symbolic variables
------------------ ------------------
* # Dimensions * # Dimensions
* tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4 * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
* Dtype * Dtype
...@@ -283,7 +283,7 @@ Creating symbolic variables: Broadcastability ...@@ -283,7 +283,7 @@ Creating symbolic variables: Broadcastability
Details regarding symbolic broadcasting... Details regarding symbolic broadcasting...
* Broadcastability must be specified when creating the variable * Broadcastability must be specified when creating the variable
* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col** * The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**
......
...@@ -23,7 +23,7 @@ Theano defines the following modes by name: ...@@ -23,7 +23,7 @@ Theano defines the following modes by name:
- ``'DebugMode'``: A mode for debugging. See :ref:`DebugMode <debugmode>` for details. - ``'DebugMode'``: A mode for debugging. See :ref:`DebugMode <debugmode>` for details.
- ``'ProfileMode'``: Deprecated, use the Theano flag :attr:`config.profile`. - ``'ProfileMode'``: Deprecated, use the Theano flag :attr:`config.profile`.
- ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode. - ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode.
- ``'PROFILE_MODE'``: Deprecated. Use the string ProfileMode. - ``'PROFILE_MODE'``: Deprecated, use the Theano flag :attr:`config.profile`.
The default mode is typically ``FAST_RUN``, but it can be controlled via the The default mode is typically ``FAST_RUN``, but it can be controlled via the
configuration variable :attr:`config.mode`, which can be configuration variable :attr:`config.mode`, which can be
...@@ -70,4 +70,3 @@ Reference ...@@ -70,4 +70,3 @@ Reference
Return a new Mode instance like this one, but with an Return a new Mode instance like this one, but with an
optimizer modified by requiring the given tags. optimizer modified by requiring the given tags.
...@@ -17,7 +17,7 @@ You can profile your ...@@ -17,7 +17,7 @@ You can profile your
functions using either of the following two options: functions using either of the following two options:
1. Use Theano flag :attr:`config.profile` to enable profiling. 1. Use Theano flag :attr:`config.profile` to enable profiling.
- To enable the memory profiler use the Theano flag: - To enable the memory profiler use the Theano flag:
:attr:`config.profile_memory` in addition to :attr:`config.profile`. :attr:`config.profile_memory` in addition to :attr:`config.profile`.
- Moreover, to enable the profiling of Theano optimization phase, - Moreover, to enable the profiling of Theano optimization phase,
...@@ -30,8 +30,8 @@ functions using either of the following two options: ...@@ -30,8 +30,8 @@ functions using either of the following two options:
2. Pass the argument :attr:`profile=True` to the function :func:`theano.function <function.function>`. And then call :attr:`f.profile.print_summary()` for a single function. 2. Pass the argument :attr:`profile=True` to the function :func:`theano.function <function.function>`. And then call :attr:`f.profile.print_summary()` for a single function.
- Use this option when you want to profile not all the - Use this option when you want to profile not all the
functions but one or more specific function(s). functions but one or more specific function(s).
- You can also combine the profile of many functions: - You can also combine the profile of many functions:
.. testcode:: .. testcode::
profile = theano.compile.ProfileStats() profile = theano.compile.ProfileStats()
...@@ -68,6 +68,15 @@ compare equal, if their parameters differ (the scalar being ...@@ -68,6 +68,15 @@ compare equal, if their parameters differ (the scalar being
executed). So the class section will merge more Apply nodes then the executed). So the class section will merge more Apply nodes then the
Ops section. Ops section.
Note that the profile also shows which Ops were running a c implementation.
Developers wishing to optimize the performance of their graph should
focus on the worst offending Ops and Apply nodes – either by optimizing
an implementation, providing a missing C implementation, or by writing
a graph optimization that eliminates the offending Op altogether.
You should strongly consider emailing one of our lists about your
issue before spending too much time on this.
Here is an example output when we disable some Theano optimizations to Here is an example output when we disable some Theano optimizations to
give you a better idea of the difference between sections. With all give you a better idea of the difference between sections. With all
optimizations enabled, there would be only one op left in the graph. optimizations enabled, there would be only one op left in the graph.
......
...@@ -213,8 +213,8 @@ Tips for Improving Performance on GPU ...@@ -213,8 +213,8 @@ Tips for Improving Performance on GPU
frequently-accessed data (see :func:`shared()<shared.shared>`). When using frequently-accessed data (see :func:`shared()<shared.shared>`). When using
the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
eliminate transfer time for GPU ops using those variables. eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try building your functions with * If you aren't happy with the performance you see, try running your script with
``mode='ProfileMode'``. This should print some timing information at program ``profile=True`` flag. This should print some timing information at program
termination. Is time being used sensibly? If an op or Apply is termination. Is time being used sensibly? If an op or Apply is
taking more time than its share, then if you know something about GPU taking more time than its share, then if you know something about GPU
programming, have a look at how it's implemented in theano.sandbox.cuda. programming, have a look at how it's implemented in theano.sandbox.cuda.
...@@ -339,7 +339,7 @@ to the exercise in section :ref:`Configuration Settings and Compiling Mode<using ...@@ -339,7 +339,7 @@ to the exercise in section :ref:`Configuration Settings and Compiling Mode<using
Is there an increase in speed from CPU to GPU? Is there an increase in speed from CPU to GPU?
Where does it come from? (Use ``ProfileMode``) Where does it come from? (Use ``profile=True`` flag.)
What can be done to further increase the speed of the GPU version? Put your ideas to test. What can be done to further increase the speed of the GPU version? Put your ideas to test.
......
...@@ -1395,320 +1395,6 @@ class ProfileStats(object): ...@@ -1395,320 +1395,6 @@ class ProfileStats(object):
print(" Sorry, no tip for today.", file = file) print(" Sorry, no tip for today.", file = file)
if False: # old code still to be ported from ProfileMode
def long_print(self, file=sys.stderr, fct_name=None, message=None,
n_apply_to_print=15, n_ops_to_print=20, print_apply=False):
"""
Print a readable summary of the stats.
Parameters
----------
n_apply_to_print
The number of apply to print. Default 15.
n_ops_to_print
The number of ops to print. Default 20.
"""
local_time = sum(self.apply_time.values())
print('')
print('ProfileMode.long_print()')
print('name = %s' % fct_name)
print('msg = %s' % message)
print('---------------------------')
print('')
print('Total time spent running thunks: %.3fs' % local_time)
sop_time = {}
sop_call = {}
sop_op = {}
# map each op class to Bool. True iff all applies were done in c.
sop_c = {}
for a, t in iteritems(op_time):
typ = type(a)
sop_time.setdefault(typ, 0)
sop_time[typ] += t
sop_op.setdefault(typ, 0)
sop_op[typ] += 1
sop_c.setdefault(typ, True)
sop_c[typ] = sop_c[typ] and op_cimpl.get(a, False)
sop_call[typ] = sop_call.get(typ, 0) + op_call[a]
print('\nSingle Op-wise summary: <% of local_time spent on this kind of Op> <cumulative %%> <self seconds> <cumulative seconds> <time per call> <nb_call> <nb_op> <nb_op> <Op name>')
sotimes = [(t * 100 / local_time, t, a, sop_c[a],
sop_call[a], sop_op[a]) for a, t in iteritems(sop_time)]
sotimes.sort(key=lambda t: (t[1], t[4], t[5]), reverse=True)
tot = 0
for f, t, a, ci, nb_call, nb_op in sotimes[:n_ops_to_print]:
if nb_call == 0:
assert t == 0
continue
tot += t
ftot = tot * 100 / local_time
if ci:
msg = '*'
else:
msg = ' '
print(' %4.1f%% %5.1f%% %5.3fs %5.3fs %.2es %s %5d %2d %s' % (f, ftot, t, tot, t / nb_call, msg, nb_call, nb_op, a))
print(' ... (remaining %i Ops account for %.2f%%(%.2fs) of the runtime)'\
% (max(0, len(sotimes) - n_ops_to_print),
sum(f for f, t, a, ci, nb_call, nb_op in
sotimes[n_ops_to_print:]),
sum(t for f, t, a, ci, nb_call, nb_op in
sotimes[n_ops_to_print:])))
total_time = time.time() - theano_imported_time
total_fct_time = sum(fct_call_time.values())
total_fct_call = sum(fct_call.values())
other_time = total_time - total_fct_time - compile_time
print()
print('Theano fct summary: <% total fct time> <total time> <time per call> <nb call> <fct name>')
for key in fct_call:
if fct_call[key] > 0:
print(' %4.1f%% %.3fs %.2es %d %s' % (
fct_call_time[key] / total_fct_time * 100,
fct_call_time[key],
fct_call_time[key] / fct_call[key],
fct_call[key], key.name))
else:
print(' NOT CALLED', key.name)
if total_fct_time > 0:
time_pr_in_fct = local_time / total_fct_time * 100
time_per_call = total_fct_time / total_fct_call
else:
time_pr_in_fct = 0
time_per_call = 0
print()
print('Time since import %.3fs' % (total_time))
print('Compile time: %.3fs %.1f%%' % (compile_time,
compile_time / total_time * 100))
print('Theano fct call %.3fs %.1f%%' % (total_fct_time,
total_fct_time / total_time *
100))
print((' Theano Op time (included in fct call, Time spent '
'running thunks) %.3fs %.1f%%(of total) %.1f%%(of fct call)' %
(local_time, local_time / total_time * 100, time_pr_in_fct)))
print('Other time since import %.3fs %.1f%%' % (other_time, other_time / total_time * 100))
print('%i Theano fct call, %.3fs per call' % (total_fct_call, time_per_call))
print()
print("List of apply that don't have float64 as input but have float64 in outputs. Usefull to know if we forgot some cast when using floatX=float32 or gpu code.")
print('<Apply> <Apply position> <fct name> <inputs type> <outputs type>')
for fct in fct_call:
for idx, node in enumerate(fct.maker.fgraph.toposort()):
if any(hasattr(i, 'dtype') and i.dtype == 'float64' for i in node.outputs) and not any(hasattr(i, 'dtype') and i.dtype == 'float64' for i in node.inputs):
print(str(node), idx, fct.name, str([getattr(i, 'dtype', None) for i in node.inputs]), str([getattr(i, 'dtype', None) for i in node.outputs]))
if any([x[2].__name__.startswith("Gpu") for x in sotimes]):
cpu = []
gpu = []
trans = []
for so in sotimes:
if so[2].__name__ in ["HostFromGpu", "GpuFromHost"]:
trans.append(so)
elif so[2].__name__.startswith("Gpu"):
gpu.append(so)
else:
cpu.append(so)
sum_cpu = sum(so[1] for so in cpu)
sum_gpu = sum(so[1] for so in gpu)
sum_trans = sum(so[1] for so in trans)
print()
print("Spent %.3fs(%.3f%%) in cpu Op, %.3fs(%.3f%%) in gpu Op and %.3fs(%.3f%%) transfert Op" % (
sum_cpu, sum_cpu / local_time * 100, sum_gpu, sum_gpu / local_time * 100, sum_trans, sum_trans / local_time * 100))
print("Theano function input that are float64")
print("<fct name> <input name> <input type> <str input>")
for fct in fct_call:
for i in fct.input_storage:
if hasattr(i.type, 'dtype') and i.type.dtype == 'float64':
print(fct.name, i.name, i.type, i)
print()
print("Here are tips to potentially make your code run faster (if you think of new ones, suggest them on the mailing list). Test them first as they are not guaranteed to always provide a speedup.")
from theano import tensor as T
from theano.tensor.raw_random import RandomFunction
import theano
import theano.scalar as scal
scalar_op_amdlibm_no_speed_up = [scal.LT, scal.GT, scal.LE, scal.GE, scal.EQ, scal.NEQ, scal.InRange, scal.Switch, scal.OR, scal.XOR, scal.AND, scal.Invert, scal.Maximum,
scal.Minimum, scal.Add, scal.Mul, scal.Sub, scal.TrueDiv, scal.IntDiv, scal.Clip, scal.First, scal.Second, scal.Identity, scal.Cast, scal.Sgn, scal.Neg, scal.Inv, scal.Sqr]
scalar_op_amdlibm_speed_up = [scal.Mod, scal.Pow, scal.Ceil, scal.Floor, scal.RoundHalfToEven, scal.RoundHalfAwayFromZero, scal.Log, scal.Log2, scal.Log10, scal.Log1p, scal.Exp,
scal.Sqrt, scal.Abs, scal.Cos, scal.Sin, scal.Tan, scal.Tanh, scal.Cosh, scal.Sinh, T.nnet.sigm.ScalarSigmoid, T.nnet.sigm.ScalarSoftplus] # Abs, Mod in float{32,64} only
def get_scalar_ops(s):
if isinstance(s, theano.scalar.Composite):
l = []
for node in s.fgraph.toposort():
l += get_scalar_ops(node.op)
return l
else:
return [s]
def list_scalar_op(op):
if isinstance(op.scalar_op, theano.scalar.Composite):
return get_scalar_ops(op.scalar_op)
else:
return [op.scalar_op]
def amdlibm_speed_up(op):
if not isinstance(op, T.Elemwise):
return False
else:
l = list_scalar_op(op)
for s_op in l:
if s_op.__class__ in scalar_op_amdlibm_speed_up:
return True
elif s_op.__class__ not in scalar_op_amdlibm_no_speed_up:
print("We don't know if amdlibm will accelerate this scalar op.", s_op)
return False
def exp_float32_op(op):
if not isinstance(op, T.Elemwise):
return False
else:
l = list_scalar_op(op)
return any([s_op.__class__ in [scal.Exp] for s_op in l])
# tip 1
if config.floatX == 'float64':
print(" - Try the Theano flag floatX=float32")
# tip 2
if not config.lib.amdlibm and any([amdlibm_speed_up(a.op) for i, a in apply_time]):
print(" - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speed up only some Elemwise operation.")
# tip 3
if not config.lib.amdlibm and any([exp_float32_op(a.op) and a.inputs[0].dtype == 'float32' for i, a in apply_time]):
print(" - With the default gcc libm, exp in float32 is slower than in float64! Try Theano flags floatX=float64 or install amdlibm and set the theano flags lib.amdlibm=True")
# tip 4
for a, t in iteritems(apply_time):
node = a
if (isinstance(node.op, T.Dot) and
all([len(i.type.broadcastable) == 2 for i in node.inputs])):
print((" - You have a dot operation that was not optimized "
"to dot22 that is faster. Make sure the inputs are "
"float32 or float64 and are the same for both inputs. "
"Currently they are: %s" %
[i.type for i in node.inputs]))
# tip 5
for a, t in iteritems(apply_time):
node = a
if isinstance(node.op, RandomFunction):
print (" - Replace the default random number generator by "
"'from theano.sandbox.rng_mrg import MRG_RandomStreams "
"as RandomStreams' as this is is faster. It is still "
"experimental, but seams to work correctly.")
if config.device.startswith("gpu"):
print (" - MRG_RandomStreams is the only random number"
" supported on the GPU.")
break
def print_summary(self,
n_apply_to_print=config.ProfileMode.n_apply_to_print,
n_ops_to_print=config.ProfileMode.n_ops_to_print):
"""
Print 3 summaries that show where the time is spent. The first shows an
Apply-wise summary, the second shows an Op-wise summary, the third
shows an type-Op-wise summary.
The Apply-wise summary print the timing information for the worst
offending Apply nodes. This corresponds to individual Op applications
within your graph which take the longest to execute (so if you use dot
twice, you will see two entries there).
The Op-wise summary print the execution time of all Apply nodes
executing the same Op are grouped together and the total execution time
per Op is shown (so if you use dot twice, you will see only one entry
there corresponding to the sum of the time spent in each of them). If
two Op have different hash value, they will be separate.
The type-Op-wise summary group the result by type of op. So event if
two Op have different hash value, they will be merged.
There is a hack with the Op-wise summary. Go see it if you want to know
more.
Parameters
----------
n_apply_to_print
The number of apply to print. Default 15, or n_ops_to_print flag.
n_ops_to_print
The number of ops to print. Default 20, or n_apply_to_print flag.
"""
fct_call_time = self.mode.fct_call_time
fct_call = self.mode.fct_call
apply_time = self.apply_time
op_cimpl = self.op_cimpl
message = self.message
outputs_size = self.outputs_size
self.print_summary_("print_summary",
None,
None,
None,
apply_time,
op_cimpl,
message,
outputs_size,
n_apply_to_print,
n_ops_to_print)
def print_diff_summary(self, other, n_apply_to_print=15,
n_ops_to_print=20):
"""
As print_summary, but print the difference on two different profile
mode.
TODO: Also we don't print the Apply-wise summary as it doesn't work for
now.
TODO: make comparaison with gpu code.
Parameters
----------
other
The other instance of ProfileMode that we want to be compared to.
n_apply_to_print
The number of apply to print. Default 15.
n_ops_to_print
The number of ops to print. Default 20.
"""
def diff_dict(a_time, b_time_):
r = {}
b_time = copy.copy(b_time_)
for a, ta in iteritems(a_time):
r.setdefault(a, 0)
tb = b_time.pop(a, 0)
r[a] += ta - tb
# they are missing in a
for a, t in iteritems(b_time):
r.setdefault(a, 0)
r[a] += t
return r
compile_time = self.compile_time - other.compile_time
fct_call_time = diff_dict(self.fct_call_time, other.fct_call_time)
fct_call = diff_dict(self.fct_call, other.fct_call)
apply_time = diff_dict(self.apply_time, other.apply_time)
op_cimpl = self.op_cimpl and other.op_cimpl
message = self.message
outputs_size = diff_dict(self.outputs_size, other.outputs_size)
self.print_summary_(
"print_diff_summary", compile_time, fct_call_time, fct_call,
apply_time, op_cimpl, message, outputs_size,
n_apply_to_print=n_apply_to_print,
n_ops_to_print=n_ops_to_print, print_apply=False)
class ScanProfileStats(ProfileStats): class ScanProfileStats(ProfileStats):
callcount = 0.0 callcount = 0.0
nbsteps = 0.0 nbsteps = 0.0
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论