Extends a bit the documentation on automatic differentiation.

8c449ca8 · Pascal Lamblin · 8bbbc82c · 8c449ca8 · 8c449ca8 · 8c449ca8
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -137,17 +137,23 @@ following methods:
  the gradient of the Op's output but rather the gradient of some
  other criterion C with respect to the Op's input.

-  If the outputs of your op are [ f_1, ... f_n], then
-  ``output_derivatives`` gives [ grad_{f_1} C, grad_{f_2} C, ... , grad_{f_n} C ]
-  If the inputs of your op are [x_1, ..., x_n], then your Op.grad should
-  return [ grad_{x_1} C, grad_{x_2} C, ..., grad_{x_n} C ]
-
-  where (grad_{y} z)_i = partial z / partial y_i  (and i can have any
-  number of dimensions)
-  (note: in the case where i is 2 dimensional, this definition of grad
+  If the outputs of your op are :math:`[ f_1, ... f_n]`, then
+  ``output_derivatives`` gives
+  :math:`[ grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C) ]`.
+  If the inputs of your op are :math:`[x_1, ..., x_m]`, then your Op.grad
+  should return :math:`[ grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C) ]`,
+  where :math:`(grad_{y} z)_i = \frac{\partial z}{\partial y_i}`
+  (and :math:`i` can have any number of dimensions).
+  (Note: in the case where i is 2 dimensional, this definition of grad
  is different from the standard mathematical definition of the gradient
-  of a scalar with respect to a matrix, where you transpose the indices)
- 
+  of a scalar with respect to a matrix, where you transpose the indices.)
+
+  In other words, :func:`grad` does not return
+  :math:`\frac{\partial f_i}{\partial x_j}`, but
+  :math:`\frac{\partial C}{\partial x_j} =
+  \frac{\partial C}{\partial f_i} \cdot \frac{\partial f_i}{\partial x_j}`.
+  Both the partial derivation and that multiplication have to be done by
+  :func:`grad`.


 At a bare minimum, a new Op must define ``make_node`` and ``perform``, which have no defaults.

--- a/doc/library/gradient.txt
+++ b/doc/library/gradient.txt
@@ -18,11 +18,15 @@ awkward to use when :func:`tensor.grad` can do the job.

 .. function:: grad_sources_inputs(sources, graph_inputs, warn_type=True)

-    A gradient source is a pair (``r``, ``g_r``), in which ``r`` is a `Variable`, and ``g_r`` is a
-    `Variable` that is a gradient wrt ``r``.
+    A gradient source is a pair (``v``, ``g_v``), in which ``v`` is
+    a `Variable`, and ``g_v`` is a `Variable` that is a gradient wrt
+    ``v``. More specifically, ``g_v`` is the gradient of an external
+    scalar cost, ``cost`` (that is not explicitly used), wrt ``v``.

    This function traverses the graph backward from the ``r`` sources,
-    calling ``op.grad(...)`` for all ops with some non-None gradient on an output.
+    calling ``op.grad(...)`` for all ops with some non-None gradient
+    on an output, to compute gradients of ``cost`` wrt intermediate
+    variables and ``graph_inputs``.

    The ``op.grad(...)`` functions are called like this:

@@ -30,14 +34,20 @@ awkward to use when :func:`tensor.grad` can do the job.

        op.grad(op.inputs[:], [total_gradient(v) for v in op.outputs])

-    This call to ``op.grad`` should return a list or tuple: one symbolic gradient per input.
-    If ``op`` has a single input, then ``op.grad``  should return a list or tuple of length 1.
+    This call to ``op.grad`` should return a list or tuple: one symbolic
+    gradient per input. These gradients represent the gradients of
+    the same implicit ``cost`` mentionned above, wrt ``op.inputs``.  Note
+    that this is **not** the same as the gradient of ``op.outputs`` wrt
+    ``op.inputs``.

-    For each input wrt to which ``op`` is not differentiable, it should return ``None`` instead
-    of a `Variable` instance.
+    If ``op`` has a single input, then ``op.grad`` should return a list
+    or tuple of length 1.
+    For each input wrt to which ``op`` is not differentiable, it should
+    return ``None`` instead of a `Variable` instance.
+
+    If a source ``r`` receives a gradient from another source ``r2``,
+    then the effective gradient on ``r`` is the sum of both gradients.

-    If a source ``r`` receives a gradient from another source ``r2``, then the effective
-    gradient on ``r`` is the sum of both gradients.

    :type sources: list of pairs of Variable: (v, gradient-on-v) to 
                   initialize the total_gradient dictionary

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -1105,10 +1105,14 @@ Gradient / Differentiation

    Return symbolic gradients for one or more variables with respect to some
    cost.
-    
+
+    For more information about how automatic differentiation works in Theano,
+    see :mod:`gradient`. For information on how to implement the gradient of
+    a certain Op, see :func:`grad`.
+
    :type cost: 0-d tensor variable
    :type wrt: tensor variable or list of tensor variables
-    :type g_cost: same as `cost`
+    :type g_cost: same as type of `cost`
    :type consider_constant: list of variables
    :type warn_type: bool