提交 123c8a0e authored 作者: Frederic's avatar Frederic

Start of the doc how to profile Theano opt

上级 245da0cf
...@@ -615,3 +615,274 @@ Navigator ...@@ -615,3 +615,274 @@ Navigator
------------------- -------------------
WRITEME WRITEME
.. _profiling_opt:
Profiling Theano function compilation
=====================================
You find that compiling a Theano function is taking too much time? You
can get profiling information about Theano optimization. The normal
:ref:`Theano profiler <tut_profiling>` will provide you with very high
level information. The indentation show the included in/subset
relationship between sections. The top of its output look like this:
.. code-block:: none
Function profiling
==================
Message: PATH_TO_A_FILE:23
Time in 0 calls to Function.__call__: 0.000000e+00s
Total compile time: 1.131874e+01s
Number of Apply nodes: 50
Theano Optimizer time: 1.152431e+00s
Theano validate time: 2.790451e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s
Import time 1.153541e-02s
Time in all call to theano.grad() 4.732513e-02s
Explanations:
* ``Total compile time: 1.131874e+01s`` give the total time spend in inside `theano.function`.
* ``Number of Apply nodes: 50`` mean that after optimization, there is 50 apply node in the graph.
* ``Theano Optimizer time: 1.152431e+00s`` mean that we spend 1.15s in the ``theano.function`` phase where we optimize(modify) the graph to make it faster/more stable numerically/work on GPU/...
* ``Theano validate time: 2.790451e-02s`` mean that we spend 2.8e-2s in the validate subset of the optimization phase.
* ``Theano Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s`` mean that we spend 7.9e-2s in the ``theano.function`` linker phase.
* ``Import time 1.153541e-02s`` is a subset of the linker time where we import the compiled module.
* ``Time in all call to theano.grad() 4.732513e-02s`` tell that we spent a total of 4.7e-2s in all calls to theano.grad. This is outside call to ``theano.function``.
The linker phase include the generation of the C code, the time spent
by g++ to compile and the time needed by Theano to build the object we
return. The C code generation and compilation is cached, so the first
time you compile a function and the following one could take different
amount of execution time.
Detailed profiling of Theano optimizer
--------------------------------------
You can get more detailed profiling information about the Theano
optimizer phase by setting to `True` the Theano flags
:attr:`config.profile_optimizer`.
This will output something like this:
.. code-block:: none
Optimizer Profile
-----------------
SeqOptimizer OPT_FAST_RUN time 1.152s for 123/50 nodes before/after optimization
0.028s for fgraph.validate()
0.131s for callback
time - (name, class, index) - validate time
0.751816s - ('canonicalize', 'EquilibriumOptimizer', 4) - 0.004s
EquilibriumOptimizer canonicalize
time 0.751s for 14 passes
nb nodes (start, end, max) 108 81 117
time io_toposort 0.029s
time in local optimizers 0.687s
time in global optimizers 0.010s
0 - 0.050s 27 (0.000s in global opts, 0.002s io_toposort) - 108 nodes - ('local_dimshuffle_lift', 9) ('local_upcast_elemwise_constant_inputs', 5) ('local_shape_to_shape_i', 3) ('local_fill_sink', 3) ('local_fill_to_alloc', 2) ...
1 - 0.288s 26 (0.002s in global opts, 0.002s io_toposort) - 117 nodes - ('local_dimshuffle_lift', 8) ('local_fill_sink', 4) ('constant_folding', 4) ('local_useless_elemwise', 3) ('local_subtensor_make_vector', 3) ...
2 - 0.044s 13 (0.002s in global opts, 0.003s io_toposort) - 96 nodes - ('constant_folding', 4) ('local_dimshuffle_lift', 3) ('local_fill_sink', 3) ('local_useless_elemwise', 1) ('local_fill_to_alloc', 1) ...
3 - 0.045s 11 (0.000s in global opts, 0.002s io_toposort) - 91 nodes - ('constant_folding', 3) ('local_fill_to_alloc', 2) ('local_dimshuffle_lift', 2) ('local_mul_canonizer', 2) ('MergeOptimizer', 1) ...
4 - 0.035s 8 (0.002s in global opts, 0.002s io_toposort) - 93 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1)
5 - 0.035s 6 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_fill_sink', 2) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('local_mul_canonizer', 1)
6 - 0.038s 10 (0.001s in global opts, 0.002s io_toposort) - 95 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 3) ('constant_folding', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1)
7 - 0.032s 5 (0.001s in global opts, 0.002s io_toposort) - 91 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1)
8 - 0.034s 5 (0.000s in global opts, 0.002s io_toposort) - 92 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_greedy_distributor', 1)
9 - 0.031s 6 (0.001s in global opts, 0.002s io_toposort) - 90 nodes - ('local_fill_sink', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1) ('local_greedy_distributor', 1)
10 - 0.032s 5 (0.000s in global opts, 0.002s io_toposort) - 89 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_fill_sink', 1)
11 - 0.030s 5 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1)
12 - 0.026s 1 (0.000s in global opts, 0.003s io_toposort) - 81 nodes - ('MergeOptimizer', 1)
13 - 0.031s 0 (0.000s in global opts, 0.003s io_toposort) - 81 nodes -
times - times applied - nb node created - name:
0.263s - 15 - 0 - constant_folding
0.096s - 2 - 14 - local_greedy_distributor
0.066s - 4 - 19 - local_mul_canonizer
0.046s - 28 - 57 - local_fill_sink
0.042s - 35 - 78 - local_dimshuffle_lift
0.018s - 5 - 15 - local_upcast_elemwise_constant_inputs
0.010s - 11 - 4 - MergeOptimizer
0.009s - 4 - 0 - local_useless_elemwise
0.005s - 11 - 2 - local_fill_to_alloc
0.004s - 3 - 6 - local_neg_to_mul
0.002s - 1 - 3 - local_lift_transpose_through_dot
0.002s - 3 - 4 - local_shape_to_shape_i
0.002s - 2 - 4 - local_subtensor_lift
0.001s - 3 - 0 - local_subtensor_make_vector
0.001s - 1 - 1 - local_sum_all_to_none
0.131s - in 62 optimization that where not used (display only those with a runtime > 0)
0.050s - local_add_canonizer
0.018s - local_mul_zero
0.016s - local_one_minus_erf
0.010s - local_func_inv
0.006s - local_0_dot_x
0.005s - local_track_shape_i
0.004s - local_mul_switch_sink
0.004s - local_fill_cut
0.004s - local_one_minus_erf2
0.003s - local_remove_switch_const_cond
0.003s - local_cast_cast
0.002s - local_IncSubtensor_serialize
0.001s - local_sum_div_dimshuffle
0.001s - local_div_switch_sink
0.001s - local_dimshuffle_no_inplace_at_canonicalize
0.001s - local_cut_useless_reduce
0.001s - local_reduce_join
0.000s - local_sum_sum
0.000s - local_useless_alloc
0.000s - local_reshape_chain
0.000s - local_useless_subtensor
0.000s - local_reshape_lift
0.000s - local_flatten_lift
0.000s - local_useless_slice
0.000s - local_subtensor_of_alloc
0.000s - local_subtensor_of_dot
0.000s - local_subtensor_merge
0.101733s - ('elemwise_fusion', 'SeqOptimizer', 13) - 0.000s
SeqOptimizer elemwise_fusion time 0.102s for 78/50 nodes before/after optimization
0.000s for fgraph.validate()
0.004s for callback
0.095307s - ('composite_elemwise_fusion', 'FusionOptimizer', 1) - 0.000s
FusionOptimizer
nb_iter 3
nb_replacement 10
nb_inconsistency_replace 0
validate_time 0.000249624252319
callback_time 0.00316381454468
time_toposort 0.00375390052795
0.006412s - ('local_add_mul_fusion', 'FusionOptimizer', 0) - 0.000s
FusionOptimizer
nb_iter 2
nb_replacement 3
nb_inconsistency_replace 0
validate_time 6.43730163574e-05
callback_time 0.000783205032349
time_toposort 0.0035240650177
0.090089s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 30) - 0.019s
0.048993s - ('BlasOpt', 'SeqOptimizer', 8) - 0.000s
SeqOptimizer BlasOpt time 0.049s for 81/80 nodes before/after optimization
0.000s for fgraph.validate()
0.003s for callback
0.035997s - ('gemm_optimizer', 'GemmOptimizer', 1) - 0.000s
GemmOptimizer
nb_iter 2
nb_replacement 2
nb_replacement_didn_t_remove 0
nb_inconsistency_make 0
nb_inconsistency_replace 0
time_canonicalize 0.00720071792603
time_factor_can 9.05990600586e-06
time_factor_list 0.00128507614136
time_toposort 0.00311398506165
validate_time 4.60147857666e-05
callback_time 0.00174236297607
0.004569s - ('local_dot_to_dot22', 'TopoOptimizer', 0) - 0.000s
TopoOptimizer
nb_node (start, end, changed) (81, 81, 5)
init io_toposort 0.00139284133911
loop time 0.00312399864197
callback_time 0.00172805786133
0.002283s - ('local_dot22_to_dot22scalar', 'TopoOptimizer', 2) - 0.000s
TopoOptimizer
nb_node (start, end, changed) (80, 80, 0)
init io_toposort 0.00171804428101
loop time 0.000502109527588
callback_time 0.0
0.002257s - ('local_gemm_to_gemv', 'EquilibriumOptimizer', 3) - 0.000s
EquilibriumOptimizer local_gemm_to_gemv
time 0.002s for 1 passes
nb nodes (start, end, max) 80 80 80
time io_toposort 0.001s
time in local optimizers 0.000s
time in global optimizers 0.000s
0 - 0.002s 0 (0.000s in global opts, 0.001s io_toposort) - 80 nodes -
0.002227s - ('use_c_blas', 'TopoOptimizer', 4) - 0.000s
TopoOptimizer
nb_node (start, end, changed) (80, 80, 0)
init io_toposort 0.0014750957489
loop time 0.00068998336792
callback_time 0.0
0.001632s - ('use_scipy_ger', 'TopoOptimizer', 5) - 0.000s
TopoOptimizer
nb_node (start, end, changed) (80, 80, 0)
init io_toposort 0.00138401985168
loop time 0.000202178955078
callback_time 0.0
0.031740s - ('specialize', 'EquilibriumOptimizer', 9) - 0.000s
EquilibriumOptimizer specialize
time 0.031s for 2 passes
nb nodes (start, end, max) 80 78 80
time io_toposort 0.003s
time in local optimizers 0.022s
time in global optimizers 0.004s
0 - 0.017s 6 (0.002s in global opts, 0.001s io_toposort) - 80 nodes - ('constant_folding', 2) ('local_mul_to_sqr', 1) ('local_elemwise_alloc', 1) ('local_div_to_inv', 1) ('local_mul_specialize', 1)
1 - 0.014s 0 (0.002s in global opts, 0.001s io_toposort) - 78 nodes -
times - times applied - nb node created - name:
0.003s - 1 - 1 - local_mul_specialize
0.002s - 1 - 2 - local_elemwise_alloc
0.002s - 2 - 0 - constant_folding
0.001s - 1 - 1 - local_div_to_inv
0.001s - 1 - 1 - local_mul_to_sqr
0.016s - in 69 optimization that where not used (display only those with a runtime > 0)
0.004s - crossentropy_to_crossentropy_with_softmax_with_bias
0.002s - local_one_minus_erf
0.002s - Elemwise{sub,no_inplace}(z, Elemwise{mul,no_inplace}(alpha subject to <function <lambda> at 0x7f475e4da050>, SparseDot(x, y))) -> Usmm{no_inplace}(Elemwise{neg,no_inplace}(alpha), x, y, z)
0.002s - local_add_specialize
0.001s - local_func_inv
0.001s - local_useless_elemwise
0.001s - local_abs_merge
0.001s - local_track_shape_i
0.000s - local_one_minus_erf2
0.000s - local_sum_mul_by_scalar
0.000s - local_elemwise_sub_zeros
0.000s - local_cast_cast
0.000s - local_alloc_unary
0.000s - Elemwise{log,no_inplace}(Softmax(x)) -> <function make_out_pattern at 0x7f47619a8410>(x)
0.000s - local_sum_div_dimshuffle
0.000s - local_sum_alloc
0.000s - local_dimshuffle_lift
0.000s - local_reduce_broadcastable
0.000s - local_grad_log_erfc_neg
0.000s - local_advanced_indexing_crossentropy_onehot
0.000s - local_log_erfc
0.000s - local_log1p
0.000s - local_log_add
0.000s - local_useless_alloc
0.000s - local_neg_neg
0.000s - local_neg_div_neg
...
To understand this profile here is some explanation of how optimization work:
* Optimization are organized in an hierarchy. At the top level, there
is a SeqOptimizer(Sequence Optimizer). It contain other optimizer
and apply them in the order it was told. Those sub optimizer can be
of other Type, but are global optimizer.
* Each Optimizer in the hierarchy will print some stats about
itself. The information that it print depend of the type of the
optimizer.
* The SeqOptimizer will print some stats at the start:
.. code-block:: none
Optimizer Profile
-----------------
SeqOptimizer OPT_FAST_RUN time 1.152s for 123/50 nodes before/after optimization
0.028s for fgraph.validate()
0.131s for callback
time - (name, class, index) - validate time
Then it will print with some indentation each sub optimizer profile
information. It sort them by the time they took. This isn't the
order they where used.
* ``OPT_FAST_RUN`` is the name of the optimizer
* 1.152s is the total time spent in that optimizer
* 123/50 mean that before this optimization, there was 123 apply node in the function and after only 50.
* 0.028s mean it spent that time in the fgraph.validate()
* 0.131s mean it is spent that time for callback. This is a mechanism that can trigger other execution when there is a change to the FunctionGraph.
* ``time - (name, class, index) - validate time`` Tell how the information for each sub optimizer get printed.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论