revision doc keepdims et op's contract

dec92d76 · Eric Larsen · Frederic · d320f322 · dec92d76 · dec92d76
--- a/doc/extending/op.txt
+++ b/doc/extending/op.txt
@@ -19,12 +19,14 @@ following methods:
 .. function:: make_node(*inputs)
  This method is responsible for creating output Variables of a
-  suitable Type to serve as the outputs of this Op's application.
+  suitable symbolic Type to serve as the outputs of this Op's application.
-  This method should put these outputs into an Apply instance, and
+  The Variables found in ``*inputs`` must be operated on using Theano's
-  return the Apply instance.
+  symbolic language to compute the symbolic output Variables. This method
+  should put these outputs into an Apply instance, and return the
+  Apply instance.
  This method creates an Apply node representing the application of
-  the Op on the inputs provided. If the Op cannot be applied on
+  the Op on the inputs provided. If the Op cannot be applied to
  these inputs, it must raise an appropriate exception.
  The inputs of the Apply instance returned by this call must be
@@ -33,25 +35,27 @@ following methods:
 .. function:: perform(node, inputs, output_storage)
-  This method computes the function associated to this Op. The
+  This method computes the function associated to this Op. ``node`` is an Apply node created by the Op's ``make_node``
-  ``node`` is an Apply node created by the Op's ``make_node``
+  method. ``inputs`` is a list of references to data to operate on using non-symbolic statements, 
-  method, ``inputs`` is a list of references to data to operate on,
+  (i.e., statements in Python, Numpy and C languages). ``output_storage`` is a list of storage cells where the
-  and ``output_storage`` is a list of storage cells where the
+  variables of the computation must be put.
-  variables of the computation must be put. More specifically:
+  More specifically:
    - ``node``: This is a reference to an Apply node which was previously
      obtained via the ``Op``'s ``make_node`` method. It is typically not
      used in simple Ops, but it contains symbolic information that
      could be required for complex Ops.
-    - ``inputs``: This is a list of data.
+    - ``inputs``: This is a list of data from which the values stored in ``output_storage``
+      are to be computed using non-symbolic language.
-    - ``output_storage``: This is a list of storage cells.
+    - ``output_storage``: This is a list of storage cells where the output is to be stored.
      A storage cell is a one-element list. It is forbidden to change
      the length of the list(s) contained in ``output_storage``.  There is
      one storage cell for each output of the Op.
-      The data you put in ``output_storage`` must match the type of the
+      The data put in ``output_storage`` must match the type of the
      symbolic output. This is a situation where the ``node`` argument
      can come in handy.
@@ -96,45 +100,65 @@ following methods:
 .. function:: grad(inputs, output_gradients)
-  Optional (but needed if you want to have it work with {tensor,sparse}.grad())
+  Optional (but needed to have it work with {tensor,sparse}.grad()).
-  If the Op you are defining is differentiable, you can define its
-  gradient symbolically in this method.
-  Both the ``inputs`` and ``output_gradients`` will be list of Theano
+  If the Op being defined is differentiable, its gradient may be specified 
-  Variables. This method must return a list containing one Variable
+  symbolically in this method. Both ``inputs`` and ``output_gradients``
-  (or ``None``) for each input. Each returned Variable represents the
+  are lists of symbolic Theano Variables and those must be operated on using 
-  gradient with respect to that input given the symbolic gradients
+  Theano's symbolic language. The grad method must return a list containing 
-  with respect to each output.
+  one Variable (or ``None``) for each input. Each returned Variable represents 
+  the gradient with respect to that input computed based on the symbolic gradients with
+  respect to each output.
  If the output is not differentiable with respect to any inputs,
  then this method should be defined to return ``[None for i in
-  inputs]``.
+  inputs]``. If this method is not defined, then Theano assumes it has been
-  If this method is not defined, then Theano assumes it has been
  forgotten.  Symbolic differentiation will fail on a graph that
  includes this Op.
-  It is important to understand that this is not meant to return
+  It must be understood that the grad method is not meant to return the
-  the gradient of the Op's output but rather the gradient of some
+  gradient of the Op's output but rather the gradient of some other scalar 
-  other criterion C with respect to the Op's input.
+  criterion C with respect to the Op's input.
-  If the outputs of your op are :math:`[ f_1, ... f_n]`, then
+  In essence, the grad method must simply implement through symbolic Variables
-  ``output_gradients`` is
+  and operations the chain rule of differential calculus. The chain rule
-  :math:`[ grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C) ]`.
+  is the mathematical procedure that allows to calculate the total derivative
-  If the inputs of your op are :math:`[x_1, ..., x_m]`, then your Op.grad
+  :math:`\frac{d C}{d x}` of the final scalar symbolic Variable C with respect to a
-  should return :math:`[ grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C) ]`,
+  primitive symbolic Variable x found in the list ``inputs``,
-  where :math:`(grad_{y} z)_i = \frac{\partial z}{\partial y_i}`
+  based on the knowledge of the total derivative :math:`\frac{d C}{d f}` of
-  (and :math:`i` can have any number of dimensions).
+  C with respect to a symbolic Variable that is returned by the Op (this is provided
-  (Note: in the case where i is 2 dimensional, this definition of grad
+  in ``output_gradients``), as well as the knowledge of the total derivative :math:`\frac{d f}{d x}` of the
-  is different from the standard mathematical definition of the gradient
+  latter with respect to the primitive Variable (this has to be computed).
-  of a scalar with respect to a matrix, where you transpose the indices.)
+  In Mathematics, the total derivative of a scalar variable (C) with respect to a vector of
+  scalar variables (x), i.e. the gradient, is customarily represented as the
+  row vector of the partial derivatives, whereas the total derivative of a vector of
+  scalar variables (f) with respect to another (x), is customarily represented by the matrix of
+  the partial derivatives, i.e.the jacobian matrix. In this convenient setting,
+  the chain rule instructs that the gradient of the final scalar variable C with respect
+  to the primitive scalar variables in x through those in f is simply given by the matrix product: 
+  :math:`\frac{d C}{d x} = \frac{d C}{d f} * \frac{d f}{d x}`.
+  Here, the chain rule must be implemented in a similar but slightly more complex
+  setting: Theano provides in the list ``output_gradients`` one gradient for each
+  of the Variables returned by the Op. Where f is one such particular Variable,
+  the corresponding gradient found in ``output_gradients`` and representing
+  :math:`\frac{d C}{d f}` is provided with a shape similar to f and thus not
+  necessarily as a row vector of scalars.  Furthermore, for each Variable x of 
+  the Op's list of input variables ``inputs``, the returned gradient representing
+  :math:`\frac{d C}{d x}` must have a shape similar to that of Variable x.
+  If the output list of the op is :math:`[f_1, ... f_n]`, then the list 
+  ``output_gradients`` is :math:`[grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C)]`.
+  If ``inputs`` consists of the list :math:`[x_1, ..., x_m]`, then Op.grad
+  should return the list :math:`[grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C)]`,
+  where :math:`(grad_{y}(Z))_i = \frac{\partial Z}{\partial y_i}` (and :math:`i` can stand for multiple dimensions).
  In other words, :func:`grad` does not return
-  :math:`\frac{\partial f_i}{\partial x_j}`, but
+  :math:`\frac{d f_i}{d x_j}`, but instead the appropriate dot product specified by the chain rule:  
-  :math:`\frac{\partial C}{\partial x_j} =
+  :math:`\frac{d C}{d x_j} =
-  \frac{\partial C}{\partial f_i} \cdot \frac{\partial f_i}{\partial x_j}`.
+  \frac{d C}{d f_i} \cdot \frac{d f_i}{d x_j}`.
-  Both the partial derivation and that multiplication have to be done by
+  Both the partial differentiation and the multiplication have to be performed by
  :func:`grad`.
 .. function:: infer_shape(node, shapes)

--- a/doc/library/tensor/basic.txt
+++ b/doc/library/tensor/basic.txt
@@ -650,16 +650,17 @@ Reductions
 .. function:: max(x, axis=None, keepdims=False)
-    :Parameter: *x* - symbolic Tensor (or compatible)
+    :Parameter: *x* -  symbolic Tensor (or compatible)
-    :Parameter: *axis* - axis along which to compute the maximum
+    :Parameter: *axis* - axis or axes along which to compute the sum
-    :Parameter: *keepdims* - (boolean) If this is set to True, the axis which is reduced is
+    :Parameter: *keepdims* - (boolean) If this is set to True, the axes which are reduced are
-		left in the result as a dimension with size one. With this option, the result
+		left in the result as dimensions with size one. With this option, the result
 		will broadcast correctly against the original tensor.
-    :Returns: the maximum value along a given axis
+    :Returns: maximum of *x* along *axis*
-    :note: see maximum for elemwise max
-    if axis=None, Theano 0.5rc1 or later: max over the flattened tensor (like numpy)
+    axis can be:
-                  older: then axis is assumed to be ndim(x)-1
+     * *None* - in which case the maximum is computed along all axes (like numpy)
+     * an *int* - computed along this axis
+     * a *list of ints* - computed along these axes
 .. function:: argmax(x, axis=None, keepdims=False)
@@ -687,16 +688,17 @@ Reductions
 .. function:: min(x, axis=None, keepdims=False)
-    :Parameter: *x* - symbolic Tensor (or compatible)
+    :Parameter: *x* -  symbolic Tensor (or compatible)
-    :Parameter: *axis* - axis along which to compute the minimum
+    :Parameter: *axis* - axis or axes along which to compute the sum
-    :Parameter: *keepdims* - (boolean) If this is set to True, the axis which is reduced is
+    :Parameter: *keepdims* - (boolean) If this is set to True, the axes which are reduced are
-		left in the result as a dimension with size one. With this option, the result
+		left in the result as dimensions with size one. With this option, the result
 		will broadcast correctly against the original tensor.
-    :Returns: the minimum value along a given axis
+    :Returns: minimum of *x* along *axis*
-    :note: see miminum for elemwise min
-    if axis=None, Theano 0.5rc1 or later: min over the flattened tensor (like numpy)
+    axis can be:
-                  older: then axis is assumed to be ndim(x)-1
+     * *None* - in which case the minimum is computed along all axes (like numpy)
+     * an *int* - computed along this axis
+     * a *list of ints* - computed along these axes
 .. function:: argmin(x, axis=None, keepdims=False)