提交 baccd5c7 authored 作者: Amjad Almahairi's avatar Amjad Almahairi

adding output of new profiling mechanism and modifying final message

上级 5d767fbb
...@@ -89,238 +89,192 @@ print predict() ...@@ -89,238 +89,192 @@ print predict()
""" """
# 2. Profiling # 2. Profiling
#
# same code as above but run with following command lines:
# THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
# THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py
# for GPU and CPU
# 2.1 Profiling output for CPU computations # 2.1 Profiling for CPU computations
# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
$ THEANO_FLAGS=mode=ProfileMode,device=cpu python program_name.py # You'll see first the output of the script:
Used the cpu
target values for D
prediction on D
Used the cpu Used the cpu
target values for D target values for D
prediction on D prediction on D
ProfileMode.print_summary() # Followed by the output of profiling.. You'll see profiling results for each function
--------------------------- # in the script, followed by a summary for all functions.
# We'll show here only the summary:
Time since import 12.586s
Theano compile time: 0.000s (0.0% since import) Function profiling
Optimization time: 0.000s ==================
Linker time: 0.000s Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Theano fct call 5.147s (40.9% since import) Time in 10002 calls to Function.__call__: 1.590916e+00s
Theano Op time 3.595s 28.6%(since import) 69.8%(of fct call) Time in Function.fn.__call__: 1.492365e+00s (93.805%)
Theano function overhead in ProfileMode 1.552s 12.3%(since import) 30.2%(of fct call) Time in thunks: 1.408159e+00s (88.512%)
20002 Theano fct call, 0.000s per call Total compile time: 6.309664e+00s
Rest of the time since import 7.440s 59.1% Number of Apply nodes: 25
Theano Optimizer time: 4.848340e-01s
Theano fct summary: Theano validate time: 5.454302e-03s
<% total fct time> <total time> <time per call> <nb call> <fct name> Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
49.9% 2.567s 2.57e-04s 10000 train
0.0% 0.000s 1.24e-04s 1 predict Class
0.0% 0.000s 1.26e-04s 1 predict ---
50.1% 2.579s 2.58e-04s 10000 train <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv
Single Op-wise summary: 30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name> 5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv
59.3% 59.3% 2.133s 2.133s 5.33e-05s * 40002 1 6 <class 'theano.tensor.blas_c.CGemv'> 1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc
34.4% 93.8% 1.238s 3.371s 6.19e-06s * 200002 11 22 <class 'theano.tensor.elemwise.Elemwise'> 1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum
2.8% 96.6% 0.100s 3.471s 2.51e-06s * 40002 1 6 <class 'theano.tensor.basic.Alloc'> 1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle
2.1% 98.7% 0.075s 3.546s 1.26e-06s * 60002 2 8 <class 'theano.tensor.elemwise.DimShuffle'> 0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i
0.7% 99.3% 0.024s 3.571s 6.11e-07s * 40002 1 6 <class 'theano.tensor.opt.Shape_i'> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
0.7% 100.0% 0.024s 3.595s 1.18e-06s * 20000 1 2 <class 'theano.tensor.elemwise.Sum'>
... (remaining 0 single Op account for 0.00%(0.00s) of the runtime) Ops
(*) Op is running a c implementation ---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
Op-wise summary: 59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace}
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb apply> <Op name> 15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
59.3% 59.3% 2.133s 2.133s 5.33e-05s * 40002 6 CGemv{inplace} 7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
18.1% 77.4% 0.650s 2.783s 3.25e-05s * 20000 2 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}} 5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace}
6.4% 83.9% 0.231s 3.014s 1.16e-05s * 20000 2 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)] 4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
4.0% 87.8% 0.142s 3.157s 7.11e-06s * 20000 2 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)] 1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc
2.8% 90.6% 0.100s 3.257s 2.51e-06s * 40002 6 Alloc 1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64}
1.4% 92.1% 0.052s 3.309s 1.30e-06s * 40002 6 InplaceDimShuffle{x} 0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x}
1.1% 93.1% 0.038s 3.347s 1.92e-06s * 20000 2 Elemwise{Cast{float32}} 0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace}
1.1% 94.2% 0.038s 3.386s 1.91e-06s * 20000 2 Elemwise{sub,no_inplace} 0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0}
1.0% 95.2% 0.036s 3.421s 1.79e-06s * 20000 2 Elemwise{gt,no_inplace} 0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
0.8% 96.0% 0.029s 3.450s 1.44e-06s * 20000 2 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace}
0.8% 96.8% 0.028s 3.479s 1.42e-06s * 20000 2 Elemwise{neg,no_inplace} 0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.7% 97.5% 0.024s 3.503s 6.11e-07s * 40002 6 Shape_i{0} 0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace}
0.7% 98.1% 0.024s 3.527s 1.18e-06s * 20000 2 Sum 0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}}
0.6% 98.8% 0.023s 3.550s 1.16e-06s * 20000 2 InplaceDimShuffle{1,0} 0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.6% 99.4% 0.023s 3.573s 1.15e-06s * 20000 2 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
0.6% 100.0% 0.022s 3.595s 1.08e-06s * 20000 2 Elemwise{inv,no_inplace} ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
0.0% 100.0% 0.000s 3.595s 1.19e-05s * 2 2 Elemwise{Composite{[Composite{[Composite{[Composite{[GT(scalar_sigmoid(i0), i1)]}(neg(i0), i1)]}(sub(i0, i1), i2)]}(neg(i0), i1, i2)]}}
... (remaining 0 Op account for 0.00%(0.00s) of the runtime) Apply
(*) Op is running a c implementation ------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
Apply-wise summary: 31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name> 27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
14.9% 14.9% 0.536s 0.536s 5.36e-05s * 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0}) 15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
14.9% 29.8% 0.534s 1.070s 5.34e-05s * 10000 18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417}) 7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
14.8% 44.6% 0.532s 1.602s 5.32e-05s * 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{1.0}) 5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
14.7% 59.3% 0.530s 2.132s 5.30e-05s * 10000 18 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417}) 4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
9.1% 68.4% 0.327s 2.460s 3.27e-05s * 10000 13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
9.0% 77.4% 0.323s 2.783s 3.23e-05s * 10000 13 Elemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]}}(y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0)
3.2% 80.6% 0.116s 2.899s 1.16e-05s * 10000 16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0) 0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
3.2% 83.9% 0.116s 3.014s 1.16e-05s * 10000 16 Elemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0) 0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
2.0% 85.8% 0.071s 3.086s 7.12e-06s * 10000 14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0) 0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
2.0% 87.8% 0.071s 3.156s 7.09e-06s * 10000 14 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)](Elemwise{neg,no_inplace}.0) 0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
0.9% 88.8% 0.034s 3.190s 3.38e-06s * 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) 0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b)
0.9% 89.7% 0.034s 3.224s 3.37e-06s * 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) 0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x)
0.5% 90.2% 0.019s 3.243s 1.93e-06s * 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0) 0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.5% 90.8% 0.019s 3.262s 1.92e-06s * 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y) 0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
0.5% 91.3% 0.019s 3.282s 1.90e-06s * 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y) 0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
... (remaining 35 Apply instances account for 8.71%(0.31s) of the runtime) 0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
(*) Op is running a c implementation 0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y)
Profile of Theano functions memory: ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
Here are tips to potentially make your code run faster # 2.1 Profiling for CPU computations
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup. # In your terminal, type:
Sorry, no tip for today. $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
# You'll see first the output of the script:
# 2.2 Profiling output for GPU computations
$ THEANO_FLAGS=mode=ProfileMode,device=gpu python program_name.py
Using gpu device 0: GeForce GTX 580
Used the gpu
target values for D
prediction on D
Used the gpu Used the gpu
target values for D target values for D
prediction on D prediction on D
ProfileMode.print_summary() # Profiling summary for all functions:
---------------------------
Function profiling
Time since import 25.682s ==================
Theano compile time: 0.000s (0.0% since import) Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Optimization time: 0.000s Time in 10002 calls to Function.__call__: 3.535239e+00s
Linker time: 0.000s Time in Function.fn.__call__: 3.420863e+00s (96.765%)
Theano fct call 17.052s (66.4% since import) Time in thunks: 2.865905e+00s (81.067%)
Theano Op time 14.548s 56.6%(since import) 85.3%(of fct call) Total compile time: 4.728150e-01s
Theano function overhead in ProfileMode 2.505s 9.8%(since import) 14.7%(of fct call) Number of Apply nodes: 36
20002 Theano fct call, 0.001s per call Theano Optimizer time: 4.283385e-01s
Rest of the time since import 8.630s 33.6% Theano validate time: 7.687330e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
Theano fct summary:
<% total fct time> <total time> <time per call> <nb call> <fct name> Class
50.0% 8.526s 8.53e-04s 10000 train ---
0.0% 0.001s 1.09e-03s 1 predict <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
50.0% 8.524s 8.52e-04s 10000 train 45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise
0.0% 0.001s 1.10e-03s 1 predict 17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv
15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc
Single Op-wise summary: 8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply> <Op name> 7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
54.8% 54.8% 7.968s 7.968s 1.33e-04s 60002 1 8 <class 'theano.sandbox.cuda.basic_ops.GpuFromHost'> 4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost
16.2% 71.0% 2.358s 10.325s 1.47e-05s * 160002 9 18 <class 'theano.sandbox.cuda.basic_ops.GpuElemwise'> 0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle
12.3% 83.3% 1.795s 12.120s 4.49e-05s * 40002 1 6 <class 'theano.sandbox.cuda.blas.GpuGemv'> 0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise
7.0% 90.4% 1.024s 13.144s 2.56e-05s 40002 1 6 <class 'theano.sandbox.cuda.basic_ops.HostFromGpu'> 0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i
5.0% 95.4% 0.728s 13.872s 1.82e-05s * 40002 1 6 <class 'theano.sandbox.cuda.basic_ops.GpuAlloc'> 0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle
2.1% 97.4% 0.300s 14.171s 1.50e-05s * 20000 1 2 <class 'theano.sandbox.cuda.basic_ops.GpuSum'> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
1.3% 98.7% 0.189s 14.360s 3.15e-06s * 60002 3 8 <class 'theano.sandbox.cuda.basic_ops.GpuDimShuffle'>
0.6% 99.4% 0.094s 14.454s 2.35e-06s * 40002 2 6 <class 'theano.tensor.elemwise.Elemwise'> Ops
0.3% 99.7% 0.048s 14.503s 1.21e-06s * 40002 1 6 <class 'theano.tensor.opt.Shape_i'> ---
0.3% 100.0% 0.045s 14.548s 2.25e-06s * 20000 1 2 <class 'theano.tensor.elemwise.DimShuffle'> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
... (remaining 0 single Op account for 0.00%(0.00s) of the runtime) 17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace}
(*) Op is running a c implementation 8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu
8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True}
Op-wise summary: 7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
<% of local_time spent on this kind of Op> <cumulative %> <self seconds> <cumulative seconds> <time per call> [*] <nb_call> <nb apply> <Op name> 7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1}
54.8% 54.8% 7.968s 7.968s 1.33e-04s 60002 8 GpuFromHost 7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc
12.3% 67.1% 1.795s 9.763s 4.49e-05s * 40002 6 GpuGemv{inplace} 6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace}
7.0% 74.1% 1.024s 10.786s 2.56e-05s 40002 6 HostFromGpu 6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace}
5.0% 79.1% 0.728s 11.514s 1.82e-05s * 40002 6 GpuAlloc 6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace}
2.3% 81.4% 0.334s 11.848s 1.67e-05s * 20000 2 GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)] 5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
2.2% 83.6% 0.319s 12.167s 1.59e-05s * 20000 2 GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace} 4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost
2.1% 85.7% 0.301s 12.468s 1.50e-05s * 20000 2 GpuElemwise{neg,no_inplace} 4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
2.1% 87.8% 0.300s 12.768s 1.50e-05s * 20000 2 GpuSum{1} 4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
2.0% 89.8% 0.292s 13.060s 1.46e-05s * 20000 2 GpuElemwise{inv,no_inplace} 4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
1.9% 91.7% 0.283s 13.343s 1.42e-05s * 20000 2 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0}
1.9% 93.7% 0.281s 13.625s 1.41e-05s * 20000 2 GpuElemwise{sub,no_inplace} 0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace}
1.9% 95.5% 0.273s 13.898s 1.37e-05s * 20000 2 GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)] 0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0}
1.9% 97.4% 0.273s 14.171s 1.37e-05s * 20000 2 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0}
1.0% 98.4% 0.141s 14.313s 7.06e-06s * 20002 4 GpuDimShuffle{x} 0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x}
0.4% 98.8% 0.057s 14.370s 2.87e-06s * 20002 4 Elemwise{gt,no_inplace} 0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}}
0.3% 99.1% 0.048s 14.418s 1.21e-06s * 40002 6 Shape_i{0} ... (remaining 3 Ops account for 0.20%(0.01s) of the runtime)
0.3% 99.4% 0.045s 14.463s 2.25e-06s * 20000 2 InplaceDimShuffle{x}
0.3% 99.7% 0.037s 14.500s 1.83e-06s * 20000 2 Elemwise{Cast{float32}} Apply
0.2% 99.8% 0.025s 14.525s 1.24e-06s * 20000 2 GpuDimShuffle{0} ------
0.2% 100.0% 0.023s 14.548s 1.14e-06s * 20000 2 GpuDimShuffle{1,0} <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
... (remaining 1 Op account for 0.00%(0.00s) of the runtime) 8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
(*) Op is running a c implementation 8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
Apply-wise summary: 7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
<% of local_time spent at this position> <cumulative %%> <apply time> <cumulative seconds> <time per call> [*] <nb_call> <Apply position> <Apply Op name> 7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
24.0% 24.0% 3.493s 3.493s 3.49e-04s 10000 1 GpuFromHost(x) 7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
23.9% 47.9% 3.479s 6.972s 3.48e-04s 10000 1 GpuFromHost(x) 6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
4.3% 52.3% 0.629s 7.602s 6.29e-05s * 10000 24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417}) 6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0)
4.3% 56.6% 0.629s 8.231s 6.29e-05s * 10000 24 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[Composite{[Composite{[Composite{[mul(i0, add(i1, i2))]}(i0, neg(i1), true_div(i2, i3))]}(i0, mul(i1, i2, i3), i4, i5)]}(i0, i1, i2, exp(i3), i4, i5)]}}[(0, 0)].0, TensorConstant{0.999800026417}) 6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
1.8% 58.4% 0.269s 8.499s 2.69e-05s * 10000 9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0}) 5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
1.8% 60.3% 0.268s 8.767s 2.68e-05s * 10000 9 GpuGemv{inplace}(GpuAlloc.0, TensorConstant{1.0}, GpuFromHost.0, w, TensorConstant{1.0}) 4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
1.8% 62.1% 0.266s 9.033s 2.66e-05s 10000 18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0) 4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0)
1.8% 63.9% 0.262s 9.296s 2.62e-05s 10000 18 HostFromGpu(GpuElemwise{Composite{[Composite{[Composite{[sub(mul(i0, i1), neg(i2))]}(i0, scalar_softplus(i1), mul(i2, i3))]}(i0, i1, i2, scalar_softplus(i3))]},no_inplace}.0) 4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
1.8% 65.7% 0.260s 9.555s 2.60e-05s 10000 3 GpuFromHost(y) 4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
1.8% 67.5% 0.258s 9.813s 2.58e-05s 10000 3 GpuFromHost(y) 4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
1.7% 69.2% 0.248s 10.061s 2.48e-05s 10000 20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0) 3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
1.7% 70.9% 0.247s 10.309s 2.47e-05s 10000 20 HostFromGpu(GpuElemwise{ScalarSigmoid{output_types_preference=transfer_type{0}}}[(0, 0)].0) 0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
1.6% 72.5% 0.238s 10.547s 2.38e-05s 10000 12 GpuFromHost(Elemwise{Cast{float32}}.0) 0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x)
1.6% 74.1% 0.237s 10.785s 2.37e-05s 10000 12 GpuFromHost(Elemwise{Cast{float32}}.0) 0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x)
1.3% 75.4% 0.185s 10.969s 1.85e-05s * 10000 6 GpuAlloc(CudaNdarrayConstant{[ 1.58212732e-09]}, Shape_i{0}.0) 0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
... (remaining 53 Apply instances account for 24.60%(3.58s) of the runtime) ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
(*) Op is running a c implementation
Some info useful for gpu:
Spent 1.211s(8.324%) in cpu Op, 13.337s(91.676%) in gpu Op and 0.000s(0.000%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in outputs
(Useful to know if we forgot some cast when using floatX=float32 or gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Profile of Theano functions memory:
(This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
We skipped 4 theano function(s). Each of them used less then 1024B(theano flags ProfileMode.min_memory_size) of total intermediate memory size
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
# 3. Conclusions # 3. Conclusions
Facts: Facts:
Examine and compare 'Single Op-wise' summaries for CPU and GPU. GPU ops 'GpuFromHost' (and 'HostFromGpu') by themselves Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize its overhead. consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
In addition, you probably need to increase the input data size (e.g. set N = 4000) to see the gain of the GPU. Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
Furthermore, notice that each of the GPU ops consumes more time than its CPU counterpart. if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
An additional experiment also confirms that adding an 'out' instance in the GPU version only brings about a minor
improvement in this situation.
Tentative conclusion:
The large number of external training steps (10000) generates disproportionate GPU overhead costs.
Tentative solution:
Include the training steps inside the definition of the Theano function.
Implement this solution and put it to test.
""" """
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论