some little additions

dba9a7b7 · Olivier Breuleux · 112b3b0c · dba9a7b7 · dba9a7b7 · dba9a7b7
--- a/doc/topics/module_vs_op.txt
+++ b/doc/topics/module_vs_op.txt
--- a/doc/topics/debug_faq.txt
+++ b/doc/topics/debug_faq.txt
@@ -56,8 +56,9 @@ something that you're not seeing.
 I wrote a new optimization, but it's not getting used...
 ---------------------------------------------------------

-Remember that you have to register optimizations with the OptDb, for them to get
-used by the normal modes like FAST_COMPILE, FAST_RUN, and DEBUG_MODE.
+Remember that you have to register optimizations with the :ref:`optdb`
+for them to get used by the normal modes like FAST_COMPILE, FAST_RUN,
+and DEBUG_MODE.


 I wrote a new optimization, and it changed my results even though I'm pretty sure it is correct.
@@ -71,11 +72,13 @@ something that you're not seeing.
 The function I compiled is too slow, what's up?
 -----------------------------------------------

-First, make sure you're running in FAST_RUN mode, by passing ``mode='FAST_RUN'``
-to ``theano.function`` or ``theano.make``.
+First, make sure you're running in FAST_RUN mode, by passing
+``mode='FAST_RUN'`` to ``theano.function`` or ``theano.make``. Some
+operations have excruciatingly slow Python implementations and that
+can negatively effect the performance of FAST_COMPILE.

-Second, try the theano :ref:`profilemode`.  This will tell you which Apply nodes,
-and which Ops are eating up your CPU cycles.
+Second, try the theano :ref:`profilemode`.  This will tell you which
+Apply nodes, and which Ops are eating up your CPU cycles.


 .. _faq_wraplinker:

--- a/doc/topics/index.txt
+++ b/doc/topics/index.txt
@@ -14,6 +14,5 @@ Topics
    profilemode
    debugmode
    debug_faq
-    module_vs_op
    randomstreams

--- a/doc/topics/profilemode.txt
+++ b/doc/topics/profilemode.txt
@@ -17,20 +17,31 @@ First create a ProfileMode instance.
 >>> from theano import ProfileMode
 >>> profmode = theano.ProfileMode(optimizer='fast_run', linker=theano.gof.OpWiseCLinker())

-The ProfileMode constructor takes as input an optimizer and a linker. Which optimizer 
-and linker to use will depend on the application. For example, a user wanting
-to profile the Python implementation only, should use the gof.PerformLinker (or
-"py" for short). On the other hand, a user wanting to profile his graph using
-c-implementations wherever possible should use the ``gof.OpWiseCLinker`` (or "c|py").
+The ProfileMode constructor takes as input an optimizer and a
+linker. Which optimizer and linker to use will depend on the
+application. For example, a user wanting to profile the Python
+implementation only, should use the gof.PerformLinker (or "py" for
+short). On the other hand, a user wanting to profile his graph using C
+implementations wherever possible should use the ``gof.OpWiseCLinker``
+(or "c|py").

 In the same manner, modifying which optimizer is passed to ProfileMode
 will decide which optimizations are applied to the graph, prior to
-profiling. Changing the optimizer should be especially useful when developing
-new graph optimizations, in order to evaluate their impact on performance.
-
-Note that most users will want to use ProfileMode to optimize their graph and
-find where most of the computation time is being spent. In this context,
-'fast_run' optimizer and ``gof.OpWiseCLinker`` are the most appropriate choices.
+profiling. Changing the optimizer should be especially useful when
+developing new graph optimizations, in order to evaluate their impact
+on performance. Also keep in mind that optimizations might change the
+computation graph a lot, meaning that you might not recognize some of
+the operations that are profiled (you did not use them explicitly but
+an optimizer decided to use it to improve performance or numerical
+stability). If you cannot easily relate the output of ProfileMode with
+the computations you defined, you might want to try setting optimizer
+to None (but keep in mind the computations will be slower than if they
+were optimized).
+
+Note that most users will want to use ProfileMode to optimize their
+graph and find where most of the computation time is being spent. In
+this context, 'fast_run' optimizer and ``gof.OpWiseCLinker`` are the
+most appropriate choices.

 Compiling your Graph with ProfileMode
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -107,16 +118,20 @@ generates the following output:
    """


-The summary has two components to it. In the first section called the Apply-wise 
-summary, timing information is provided for the worst offending Apply nodes. This 
-corresponds to individual nodes within your graph which take the longest to
-execute. In the second portion, the Op-wise summary, the execution time of 
-all Apply nodes executing the same Op are grouped together and the total
-execution time per Op is shown.
+The summary has two components to it. In the first section called the
+Apply-wise summary, timing information is provided for the worst
+offending Apply nodes. This corresponds to individual Op applications
+within your graph which take the longest to execute (so if you use
+``dot`` twice, you will see two entries there). In the second portion,
+the Op-wise summary, the execution time of all Apply nodes executing
+the same Op are grouped together and the total execution time per Op
+is shown (so if you use ``dot`` twice, you will see only one entry
+there corresponding to the sum of the time spent in each of them).

-Note that the ProfileMode also shows which Ops were running a c implementation.
+Note that the ProfileMode also shows which Ops were running a c
+implementation.

-Developers wishing to optimize the performance of their graph, should focus on the 
-worst offending Ops. If no c-implementation exists for this op, consider writing
-a c-implementation yourself or use the mailing list, to suggest that a c-implementation
-be provided.
+Developers wishing to optimize the performance of their graph, should
+focus on the worst offending Ops. If no C implementation exists for
+this op, consider writing a C implementation yourself or use the
+mailing list, to suggest that a C implementation be provided.