cleup profileMode deprecation in docs

e493f9cd · Mehdi Mirza · memimo · 5a0d273c · e493f9cd · e493f9cd
--- a/doc/cifarSC2011/advanced_theano.txt
+++ b/doc/cifarSC2011/advanced_theano.txt
@@ -123,6 +123,7 @@ Loops
 .. testcode::
+  import numpy
  import theano
  import theano.tensor as T
@@ -179,96 +180,189 @@ Inplace optimization
 Profiling
 ---------
- To replace the default mode with this mode, use the Theano flags ``mode=ProfileMode``
+- To replace the default mode with this mode, use the Theano flags ``profile=True``
- To enable the memory profiling use the flags ``ProfileMode.profile_memory=True``
+- To enable the memory profiling use the flags ``profile=True,profile_memory=True``
 Theano output:
 .. code-block:: python
    """
-    Time since import 33.456s
+    Function profiling
-    Theano compile time: 1.023s (3.1% since import)
+    ==================
-      Optimization time: 0.789s
+      Message: train.py:17
-      Linker time: 0.221s
+      Time in 1 calls to Function.__call__: 5.440712e-04s
-    Theano fct call 30.878s (92.3% since import)
+      Time in Function.fn.__call__: 4.799366e-04s (88.212%)
-     Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call)
+      Time in thunks: 7.891655e-05s (14.505%)
-     Theano function overhead in ProfileMode 1.466s 4.4%(since import)
+      Total compile time: 5.701292e-01s
-                                                  4.7%(of fct call)
+        Number of Apply nodes: 20
-    10001 Theano fct call, 0.003s per call
+        Theano Optimizer time: 2.405829e-01s
-    Rest of the time since import 1.555s 4.6%
+           Theano validate time: 1.702785e-03s
+        Theano Linker time (includes C, CUDA code generation/compiling): 1.597619e-02s
-    Theano fct summary:
+           Import time 1.968861e-03s
-    <% total fct time> <total time> <time per call> <nb call> <fct name>
-     100.0% 30.877s 3.09e-03s 10000 train
+    Time in all call to theano.grad() 0.000000e+00s
-      0.0% 0.000s 4.06e-04s 1 predict
+    Time since theano import 1.436s
+    Class
-    Single Op-wise summary:
+    ---
-    <% of local_time spent on this kind of Op> <cumulative %>
+    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-        <self seconds> <cumulative seconds> <time per call> <nb_call>
+      54.4%    54.4%       0.000s       3.90e-06s     C       11      11   theano.tensor.elemwise.Elemwise
-        <nb_op> <nb_apply> <Op name>
+      17.8%    72.2%       0.000s       1.41e-05s     C        1       1   theano.compile.ops.Shape_i
-       87.3%   87.3%  25.672s  25.672s  2.57e-03s   10000  1  1 <Gemv>
+      11.5%    83.7%       0.000s       2.26e-06s     C        4       4   theano.tensor.basic.ScalarFromTensor
-        9.7% s  97.0%  2.843s  28.515s  2.84e-04s   10001  1  2 <Dot>
+       9.1%    92.7%       0.000s       3.58e-06s     C        2       2   theano.tensor.subtensor.Subtensor
-        2.4%   99.3%  0.691s  29.206s  7.68e-06s * 90001 10 10 <Elemwise>
+       3.6%    96.4%       0.000s       2.86e-06s     C        1       1   theano.tensor.elemwise.DimShuffle
-        0.4%   99.7%  0.127s  29.334s  1.27e-05s   10000  1  1 <Alloc>
+       3.6%   100.0%       0.000s       2.86e-06s     C        1       1   theano.tensor.elemwise.Sum
-        0.2%   99.9%  0.053s  29.386s  1.75e-06s * 30001  2  4 <DimShuffle>
+       ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
-        0.0%  100.0%  0.014s  29.400s  1.40e-06s * 10000  1  1 <Sum>
-        0.0%  100.0%  0.011s  29.411s  1.10e-06s * 10000  1  1 <Shape_i>
+    Ops
-    (*) Op is running a c implementation
+    ---
+    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-    Op-wise summary:
+      17.8%    17.8%       0.000s       1.41e-05s     C        1        1   Shape_i{0}
-    <% of local_time spent on this kind of Op> <cumulative %>
+      15.1%    32.9%       0.000s       1.19e-05s     C        1        1   Elemwise{Composite{(i0 * (i1 ** i2))}}
-        <self seconds> <cumulative seconds> <time per call>
+      11.5%    44.4%       0.000s       2.26e-06s     C        4        4   ScalarFromTensor
-        <nb_call> <nb apply> <Op name>
+       9.1%    53.5%       0.000s       3.58e-06s     C        2        2   Subtensor{int64:int64:int8}
-       87.3%   87.3%  25.672s  25.672s  2.57e-03s   10000  1 Gemv{inplace}
+       8.8%    62.2%       0.000s       3.46e-06s     C        2        2   Elemwise{switch,no_inplace}
-        9.7%   97.0%  2.843s  28.515s  2.84e-04s   10001  2 dot
+       6.3%    68.6%       0.000s       2.50e-06s     C        2        2   Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)]
-        1.3%   98.2%  0.378s  28.893s  3.78e-05s * 10000  1 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}
+       6.0%    74.6%       0.000s       2.38e-06s     C        2        2   Elemwise{le,no_inplace}
-        0.4%   98.7%  0.127s  29.021s  1.27e-05s   10000  1 Alloc
+       5.1%    79.8%       0.000s       4.05e-06s     C        1        1   Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i2, i1), i2, i1))}}[(0, 2)]
-        0.3%   99.0%  0.092s  29.112s  9.16e-06s * 10000  1 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)]
+       5.1%    84.9%       0.000s       4.05e-06s     C        1        1   Elemwise{minimum,no_inplace}
-        0.1%   99.3%  0.033s  29.265s  1.66e-06s * 20001  3 InplaceDimShuffle{x}
+       3.9%    88.8%       0.000s       3.10e-06s     C        1        1   Elemwise{lt,no_inplace}
-       ... (remaining 11 Apply account for 0.7%(0.00s) of the runtime)
+       3.9%    92.7%       0.000s       3.10e-06s     C        1        1   Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i1, i2), i1, i2))}}
-    (*) Op is running a c implementation
+       3.6%    96.4%       0.000s       2.86e-06s     C        1        1   Sum{acc_dtype=float64}
+       3.6%   100.0%       0.000s       2.86e-06s     C        1        1   InplaceDimShuffle{x}
-    Apply-wise summary:
+       ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
-    <% of local_time spent at this position> <cumulative %%>
-        <apply time> <cumulative seconds> <time per call>
+    Apply
-        <nb_call> <Apply position> <Apply Op name>
+    ------
-       87.3%   87.3%  25.672s  25.672s 2.57e-03s  10000  15 Gemv{inplace}(w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)].0, TensorConstant{0.9998})
+    <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
-        9.7%   97.0%  2.843s  28.515s 2.84e-04s  10000   1 dot(x, w)
+      17.8%    17.8%       0.000s       1.41e-05s      1     0                     Shape_i{0}(coefficients)
-        1.3%   98.2%  0.378s  28.893s 3.78e-05s  10000   9 Elemwise{Composite{scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}(y, Elemwise{Composite{neg,sub}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+        input 0: dtype=float32, shape=(3,), strides=c
-        0.4%   98.7%  0.127s  29.020s 1.27e-05s  10000  10 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
+        output 0: dtype=int64, shape=(), strides=c
-        0.3%   99.0%  0.092s  29.112s 9.16e-06s  10000  13 Elemwise{Composite{exp,{mul,{true_div,neg,{add,mul}}}}}[(0,0)](Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)].0, Alloc.0, y, Elemwise{Composite{neg,sub}}[(0,0)].0, Elemwise{sub,no_inplace}.0, InplaceDimShuffle{x}.0)
+      15.1%    32.9%       0.000s       1.19e-05s      1    18                     Elemwise{Composite{(i0 * (i1 ** i2))}}(Subtensor{int64:int64:int8}.0, InplaceDimShuffle{x}.0, Subtensor{int64:int64:int8}.0)
-        0.3%   99.3%  0.080s  29.192s 7.99e-06s  10000  11 Elemwise{ScalarSigmoid{output_types_preference=transfer_type{0}, _op_use_c_code=True}}[(0, 0)](Elemwise{neg,no_inplace}.0)
+        input 0: dtype=float32, shape=(3,), strides=c
-       ... (remaining 14 Apply instances account for
+        input 1: dtype=float32, shape=(1,), strides=c
-           0.7%(0.00s) of the runtime)
+        input 2: dtype=int64, shape=(3,), strides=c
+        output 0: dtype=float64, shape=(3,), strides=c
-    Profile of Theano functions memory:
+       5.1%    38.1%       0.000s       4.05e-06s      1    17                     Subtensor{int64:int64:int8}(TensorConstant{[   0    1..9998 9999]}, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
-    (This check only the output of each apply node. It don't check the temporary memory used by the op in the apply node.)
+        input 0: dtype=int64, shape=(10000,), strides=c
-    Theano fct: train
+        input 1: dtype=int64, shape=8, strides=c
-        Max without gc, inplace and view (KB) 2481
+        input 2: dtype=int64, shape=8, strides=c
-        Max FAST_RUN_NO_GC (KB) 16
+        input 3: dtype=int8, shape=1, strides=c
-        Max FAST_RUN (KB) 16
+        output 0: dtype=int64, shape=(3,), strides=c
-        Memory saved by view (KB) 2450
+       5.1%    43.2%       0.000s       4.05e-06s      1    11                     Elemwise{switch,no_inplace}(Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{0})
-        Memory saved by inplace (KB) 15
+        input 0: dtype=int8, shape=(), strides=c
-        Memory saved by GC (KB) 0
+        input 1: dtype=int8, shape=(), strides=c
-        <Sum apply outputs (bytes)> <Apply outputs memory size(bytes)>
+        input 2: dtype=int64, shape=(), strides=c
-            <created/inplace/view> <Apply node>
+        output 0: dtype=int64, shape=(), strides=c
-        <created/inplace/view> is taked from the op declaration, not ...
+       5.1%    48.3%       0.000s       4.05e-06s      1     5                     Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i2, i1), i2, i1))}}[(0, 2)](Elemwise{lt,no_inplace}.0, TensorConstant{10000}, Elemwise{minimum,no_inplace}.0, TensorConstant{0})
-             2508800B  [2508800] v InplaceDimShuffle{1,0}(x)
+        input 0: dtype=int8, shape=(), strides=c
-                6272B  [6272] i Gemv{inplace}(w, ...)
+        input 1: dtype=int64, shape=(), strides=c
-                3200B  [3200] c Elemwise{Composite{...}}(y, ...)
+        input 2: dtype=int64, shape=(), strides=c
+        input 3: dtype=int8, shape=(), strides=c
-    Here are tips to potentially make your code run faster (if you think of new ones, suggest them on the mailing list).
+        output 0: dtype=int64, shape=(), strides=c
-    Test them first, as they are not guaranteed to always provide a speedup.
+       5.1%    53.5%       0.000s       4.05e-06s      1     2                     Elemwise{minimum,no_inplace}(Shape_i{0}.0, TensorConstant{10000})
-      - Try the Theano flag floatX=float32
+        input 0: dtype=int64, shape=(), strides=c
+        input 1: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=(), strides=c
+       3.9%    57.4%       0.000s       3.10e-06s      1    16                     Subtensor{int64:int64:int8}(coefficients, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
+        input 0: dtype=float32, shape=(3,), strides=c
+        input 1: dtype=int64, shape=8, strides=c
+        input 2: dtype=int64, shape=8, strides=c
+        input 3: dtype=int8, shape=1, strides=c
+        output 0: dtype=float32, shape=(3,), strides=c
+       3.9%    61.3%       0.000s       3.10e-06s      1    14                     ScalarFromTensor(Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)].0)
+        input 0: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=8, strides=c
+       3.9%    65.3%       0.000s       3.10e-06s      1    10                     Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i2, i1), i2, i1))}}[(0, 2)].0, TensorConstant{10000})
+        input 0: dtype=int8, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        input 2: dtype=int64, shape=(), strides=c
+        input 3: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=(), strides=c
+       3.9%    69.2%       0.000s       3.10e-06s      1     4                     Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i1, i2), i1, i2))}}(Elemwise{lt,no_inplace}.0, Elemwise{minimum,no_inplace}.0, Shape_i{0}.0, TensorConstant{0})
+        input 0: dtype=int8, shape=(), strides=c
+        input 1: dtype=int64, shape=(), strides=c
+        input 2: dtype=int64, shape=(), strides=c
+        input 3: dtype=int8, shape=(), strides=c
+        output 0: dtype=int64, shape=(), strides=c
+       3.9%    73.1%       0.000s       3.10e-06s      1     3                     Elemwise{lt,no_inplace}(Elemwise{minimum,no_inplace}.0, TensorConstant{0})
+        input 0: dtype=int64, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        output 0: dtype=int8, shape=(), strides=c
+       3.6%    76.7%       0.000s       2.86e-06s      1    19                     Sum{acc_dtype=float64}(Elemwise{Composite{(i0 * (i1 ** i2))}}.0)
+        input 0: dtype=float64, shape=(3,), strides=c
+        output 0: dtype=float64, shape=(), strides=c
+       3.6%    80.4%       0.000s       2.86e-06s      1     9                     Elemwise{switch,no_inplace}(Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{0})
+        input 0: dtype=int8, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        input 2: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=(), strides=c
+       3.6%    84.0%       0.000s       2.86e-06s      1     7                     Elemwise{le,no_inplace}(Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i2, i1), i2, i1))}}[(0, 2)].0, TensorConstant{0})
+        input 0: dtype=int64, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        output 0: dtype=int8, shape=(), strides=c
+       3.6%    87.6%       0.000s       2.86e-06s      1     1                     InplaceDimShuffle{x}(x)
+        input 0: dtype=float32, shape=(), strides=c
+        output 0: dtype=float32, shape=(1,), strides=c
+       2.7%    90.3%       0.000s       2.15e-06s      1    12                     ScalarFromTensor(Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)].0)
+        input 0: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=8, strides=c
+       2.4%    92.7%       0.000s       1.91e-06s      1    15                     ScalarFromTensor(Elemwise{switch,no_inplace}.0)
+        input 0: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=8, strides=c
+       2.4%    95.2%       0.000s       1.91e-06s      1    13                     ScalarFromTensor(Elemwise{switch,no_inplace}.0)
+        input 0: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=8, strides=c
+       2.4%    97.6%       0.000s       1.91e-06s      1     8                     Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i1, i2), i1, i2))}}.0, Shape_i{0}.0)
+        input 0: dtype=int8, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        input 2: dtype=int64, shape=(), strides=c
+        input 3: dtype=int64, shape=(), strides=c
+        output 0: dtype=int64, shape=(), strides=c
+       2.4%   100.0%       0.000s       1.91e-06s      1     6                     Elemwise{le,no_inplace}(Elemwise{Composite{Switch(i0, Switch(LT((i1 + i2), i3), i3, (i1 + i2)), Switch(LT(i1, i2), i1, i2))}}.0, TensorConstant{0})
+        input 0: dtype=int64, shape=(), strides=c
+        input 1: dtype=int8, shape=(), strides=c
+        output 0: dtype=int8, shape=(), strides=c
+       ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
+    Memory Profile
+    (Sparse variables are ignored)
+    (For values in brackets, it's for linker = c|py
+    ---
+        Max if no gc (allow_gc=False): 0KB (0KB)
+        CPU: 0KB (0KB)
+        GPU: 0KB (0KB)
+    ---
+        Max if linker=cvm(default): 0KB (0KB)
+        CPU: 0KB (0KB)
+        GPU: 0KB (0KB)
+    ---
+        Memory saved if views are used: 0KB (0KB)
+        Memory saved if inplace ops are used: 0KB (0KB)
+        Memory saved if gc is enabled: 0KB (0KB)
+    ---
+        <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
+       ... (remaining 20 Apply account for  171B/171B ((100.00%)) of the Apply with dense outputs sizes)
+        All Apply nodes have output sizes that take less than 1024B.
+        <created/inplace/view> is taken from the Op's declaration.
+        Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
+    Here are tips to potentially make your code run faster
+                     (if you think of new ones, suggest them on the mailing list).
+                     Test them first, as they are not guaranteed to always provide a speedup.
+      Sorry, no tip for today.
    """
 Exercise 5
 -----------
 - In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use ProfileMode)
+- Where does it come from? (Use profile=True)
 - Is there something we can do to speed up the GPU version?
@@ -427,4 +521,3 @@ Known limitations
  - A few hundreds nodes is fine
  - Disabling a few optimizations can speed up compilation
  - Usually too many nodes indicates a problem with the graph
--- a/doc/cifarSC2011/theano.txt
+++ b/doc/cifarSC2011/theano.txt
@@ -21,13 +21,13 @@ Description
 * Mathematical symbolic expression compiler
 * Dynamic C/CUDA code generation
 * Efficient symbolic differentiation
  * Theano computes derivatives of functions with one or many inputs.
 * Speed and stability optimizations
  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
 * Works on Linux, Mac and Windows
 * Transparent use of a GPU
@@ -38,7 +38,7 @@ Description
 * Extensive unit-testing and self-verification
  * Detects and diagnoses many types of errors
 * On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
@@ -79,7 +79,7 @@ Exercise 1
  f = theano.function([a], out)   # compile function
  print f([0,1,2])
  # prints `array([0,2,1026])`
  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
@@ -101,12 +101,12 @@ Real example
  import theano
  import theano.tensor as T
  rng = numpy.random
  N = 400
  feats = 784
  D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
  training_steps = 10000
  # Declare Theano symbolic variables
  x = T.matrix("x")
  y = T.vector("y")
@@ -176,7 +176,7 @@ Theano flags
 Theano can be configured with flags. They can be defined in two ways
-* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
+* With an environment variable: ``THEANO_FLAGS="profile=True,profile_memory=True"``
 * With a configuration file that defaults to ``~/.theanorc``
@@ -185,7 +185,7 @@ Exercise 2
 -----------
 .. code-block:: python
    import numpy
    import theano
    import theano.tensor as T
@@ -268,7 +268,7 @@ GPU
 * Only 32 bit floats are supported (being worked on)
 * Only 1 GPU per process
 * Use the Theano flag ``device=gpu`` to tell to use the GPU device
 * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
 * Shared variables with float32 dtype are by default moved to the GPU memory space
@@ -277,7 +277,7 @@ GPU
 * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
 * Cast inputs before putting them into a shared variable
 * Cast "problem": int32 with float32 to float64
  * A new casting mechanism is being developed
  * Insert manual cast in your code or use [u]int{8,16}
  * Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
@@ -295,7 +295,7 @@ Symbolic variables
 ------------------
 * # Dimensions
 * T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
 * Dtype
@@ -322,7 +322,7 @@ Creating symbolic variables: Broadcastability
 Details regarding symbolic broadcasting...
 * Broadcastability must be specified when creating the variable
 * The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
@@ -358,7 +358,7 @@ Benchmarks
 .. image:: ../hpcs2011_tutorial/pics/mlp.png
-**Convolutional Network**: 
+**Convolutional Network**:
 256x256 images convolved with 6 7x7 filters,
 downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise

--- a/doc/crei2013/advanced_theano.txt
+++ b/doc/crei2013/advanced_theano.txt
@@ -104,7 +104,7 @@ Exercise 5
 -----------
 - In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use ProfileMode)
+- Where does it come from? (Use profile=True)
 - Is there something we can do to speed up the GPU version?

--- a/doc/crei2013/theano.txt
+++ b/doc/crei2013/theano.txt
@@ -21,13 +21,13 @@ Description
 * Mathematical symbolic expression compiler
 * Dynamic C/CUDA code generation
 * Efficient symbolic differentiation
  * Theano computes derivatives of functions with one or many inputs.
 * Speed and stability optimizations
  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
 * Works on Linux, Mac and Windows
 * Transparent use of a GPU
@@ -38,7 +38,7 @@ Description
 * Extensive unit-testing and self-verification
  * Detects and diagnoses many types of errors
 * On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab
@@ -76,7 +76,7 @@ Exercise 1
  f = theano.function([a], out)   # compile function
  print f([0, 1, 2])
  # prints `array([0, 2, 1026])`
  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)
@@ -133,7 +133,7 @@ Theano flags
 Theano can be configured with flags. They can be defined in two ways
-* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
+* With an environment variable: ``THEANO_FLAGS="profile=True,profile_memory=True"``
 * With a configuration file that defaults to ``~/.theanorc``
@@ -142,7 +142,7 @@ Exercise 2
 -----------
 .. code-block:: python
    import numpy
    import theano
    import theano.tensor as tt
@@ -225,7 +225,7 @@ GPU
 * Only 32 bit floats are supported (being worked on)
 * Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
 * Use the Theano flag ``device=gpu`` to tell to use the GPU device
 * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
 * Shared variables with float32 dtype are by default moved to the GPU memory space
@@ -234,7 +234,7 @@ GPU
 * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
 * Cast inputs before putting them into a shared variable
 * Cast "problem": int32 with float32 to float64
  * Insert manual cast in your code or use [u]int{8,16}
  * The mean operator is worked on to make the output stay in float32.
@@ -256,7 +256,7 @@ Symbolic variables
 ------------------
 * # Dimensions
 * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4
 * Dtype
@@ -283,7 +283,7 @@ Creating symbolic variables: Broadcastability
 Details regarding symbolic broadcasting...
 * Broadcastability must be specified when creating the variable
 * The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**

--- a/doc/library/compile/mode.txt
+++ b/doc/library/compile/mode.txt
@@ -23,7 +23,7 @@ Theano defines the following modes by name:
 - ``'DebugMode'``: A mode for debugging. See :ref:`DebugMode <debugmode>` for details.
 - ``'ProfileMode'``: Deprecated, use the Theano flag :attr:`config.profile`.
 - ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode.
- ``'PROFILE_MODE'``: Deprecated. Use the string ProfileMode.
+- ``'PROFILE_MODE'``: Deprecated, use the Theano flag :attr:`config.profile`.
 The default mode is typically ``FAST_RUN``, but it can be controlled via the
 configuration variable :attr:`config.mode`, which can be
@@ -70,4 +70,3 @@ Reference
        Return a new Mode instance like this one, but with an
        optimizer modified by requiring the given tags.
--- a/doc/tutorial/profiling.txt
+++ b/doc/tutorial/profiling.txt
@@ -17,7 +17,7 @@ You can profile your
 functions using either of the following two options:
-1. Use Theano flag :attr:`config.profile` to enable profiling. 
+1. Use Theano flag :attr:`config.profile` to enable profiling.
    - To enable the memory profiler use the Theano flag:
      :attr:`config.profile_memory` in addition to :attr:`config.profile`.
    - Moreover, to enable the profiling of Theano optimization phase,
@@ -30,8 +30,8 @@ functions using either of the following two options:
 2. Pass the argument :attr:`profile=True` to the function :func:`theano.function <function.function>`. And then call :attr:`f.profile.print_summary()` for a single function.
    - Use this option when you want to profile not all the
      functions but one or more specific function(s).
-    - You can also combine the profile of many functions: 
+    - You can also combine the profile of many functions:
      .. testcode::
          profile = theano.compile.ProfileStats()
@@ -68,6 +68,15 @@ compare equal, if their parameters differ (the scalar being
 executed). So the class section will merge more Apply nodes then the
 Ops section.
+Note that the profile also shows which Ops were running a c implementation.
+Developers wishing to optimize the performance of their graph should
+focus on the worst offending Ops and Apply nodes – either by optimizing
+an implementation, providing a missing C implementation, or by writing
+a graph optimization that eliminates the offending Op altogether.
+You should strongly consider emailing one of our lists about your
+issue before spending too much time on this.
 Here is an example output when we disable some Theano optimizations to
 give you a better idea of the difference between sections. With all
 optimizations enabled, there would be only one op left in the graph.

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -213,8 +213,8 @@ Tips for Improving Performance on GPU
  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
  eliminate transfer time for GPU ops using those variables.
-* If you aren't happy with the performance you see, try building your functions with
+* If you aren't happy with the performance you see, try running your script with
-  ``mode='ProfileMode'``. This should print some timing information at program
+  ``profil=True`` flag. This should print some timing information at program
  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
  programming, have a look at how it's implemented in theano.sandbox.cuda.
@@ -339,7 +339,7 @@ to the exercise in section :ref:`Configuration Settings and Compiling Mode<using
 Is there an increase in speed from CPU to GPU?
-Where does it come from? (Use ``ProfileMode``)
+Where does it come from? (Use ``profile=True`` flag.)
 What can be done to further increase the speed of the GPU version? Put your ideas to test.