提交 ce0a8359 authored 作者: nouiz's avatar nouiz

Merge pull request #897 from lamblin/complex_gradient_proposal

Add proposal of math framework for complex grad
.. complex_grad_proposal:
===========================================
Proposal for gradient wrt complex variables
===========================================
This is a proposal to handle gradients of a scalar, real variable
(usually, a cost) with respect to tensor variables, of complex (and
real) type, in an optimization perspective.
Derivative of complex variables is usually studied only for so-called
*analytical* complex functions, which have a particular structure in
their partial derivatives. However, we do not want to limit ourselves
to analytical functions, and we make other assumptions (that the final
cost is real-valued, for instance), so **we will adopt a different
convention** for gradients than what is usually used in the literature.
Gradient (re-)definition
========================
We are interested in the case where we have a final real-valued
cost, :math:`C`, and a graph of mathematical expressions, including
real-valued and complex-valued variables (scalars, vectors, matrices,
higher-order tensors), and we want to compute the gradient of :math:`C`,
wrt some variables in that graph, using gradient back-propagation.
In the case where some variables are complex, the usual chain rule
cannot be applied, except in some cases.
For each real-valued variable :math:`r` (not necessarily scalar,
it could be a matrix, for instance), in particular :math:`\Re
v` and :math:`\Im v`, *partial derivatives* can be defined:
:math:`\frac{\partial C}{\partial r}` has the same number of dimensions
and shape as :math:`r`. We will limit that notation to real-valued
variables only, this way, the partial derivative itself will be
real-valued too. We will **not** use that notation for the complex
derivative of analytical complex functions.
For any real-valued intermediate variable :math:`t`, the usual chain
rule applies:
.. math::
\frac{\partial C}{\partial r} = \frac{\partial C}{\partial t} \frac{\partial t}{\partial r}
If :math:`z` is a complex variable, with :math:`\Re z = x` and
:math:`\Im z = y`, we can consider :math:`x` and :math:`y` as free
variables, and then:
.. math::
\frac{\partial C}{\partial r} = \frac{\partial C}{\partial x} \frac{\partial x}{\partial r} + \frac{\partial C}{\partial y} \frac{\partial y}{\partial r}
If we want to use an algorithm similar to gradient backpropagation,
we can see that, here, we need to have both :math:\frac{\partial
C}{\partial \Re t} and :math:\frac{\partial C}{\partial \Im t}, in order
to compute :math:`\frac{\partial C}{\partial r}.
For each variable :math:`v` in the expression graph, let us denote
:math:`\nabla_C(v)` the *gradient* of :math:`C` with respect to
:math:`v`. It is a tensor with the same dimensions as :math:`v`, and can
be complex-valued. We define:
.. math::
\nabla_C(v) = \frac{\partial C}{\partial \Re v} + i \frac{\partial C}{\partial \Im v}
This is the tensor that we are going to back-propagate through the
computation graph.
Generalized chain rule
======================
Using the definition above, if we have two complex variables :math:`z = x + iy` and :math:`t = r + is` (with :math:`x, y, r, s` all real-valued):
.. math::
\nabla_C(z) &= \frac{\partial C}{\partial \Re z} + i \frac{\partial C}{\partial \Im z} \\
&= \frac{\partial C}{\partial x} + i \frac{\partial C}{\partial y}
\nabla_C(t) &= \frac{\partial C}{\partial \Re t} + i \frac{\partial C}{\partial \Im t} \\
&= \frac{\partial C}{\partial r} + i \frac{\partial C}{\partial s} \\
&= \left(\frac{\partial C}{\partial x} \frac{\partial x}{\partial r} +
\frac{\partial C}{\partial y} \frac{\partial y}{\partial r}\right) +
i \left(\frac{\partial C}{\partial x} \frac{\partial x}{\partial s} +
\frac{\partial C}{\partial y} \frac{\partial y}{\partial s}\right) \\
&= \frac{\partial C}{\partial x} \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) +
\frac{\partial C}{\partial y} \left(\frac{\partial y}{\partial r} + i \frac{\partial y}{\partial s}\right) \\
&= \Re \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) +
\Im \left(\nabla_C(z)\right) \left(\frac{\partial y}{\partial r} + i \frac{\partial y}{\partial s}\right)
This formula can be used whether or not :math:`C` is an analytical
function of :math:`z` or :math:`t`, and whether or not :math:`z` is an
analytical function of :math:`t`.
Special cases
=============
Real-valued input variable
--------------------------
If variable :math:`x` is defined as real-valued, it can sometimes
be useful to have the value of :math:`\nabla_C(z)` instead of only
:math:`\frac{\partial C}{\partial x}`, because the imaginary part
contains information on how the cost would change if :math:`y` was not
constrained to be 0.
Real-valued intermediate variable
---------------------------------
When :math:`x` is an intermediate variable, however, the gradient of
:math:`C` wrt :math:`t` must not be backpropagated through :math:`y`.
Therefore, we have:
.. math::
\nabla_C(t) &= \frac{\partial C}{\partial r} + i \frac{\partial C}{\partial s} \\
&= \frac{\partial C}{\partial x} \frac{\partial x}{\partial r} +
i \frac{\partial C}{\partial x} \frac{\partial x}{\partial s} \\
&= \Re \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right)
The imaginary part of :math:`\nabla_C(z)` is ignored, because
:math:`\Im z` is constrained to be 0.
Analytic functions
------------------
If :math:`z` is the output of an analytic function of :math:`t`, some
simplifications are possible. Analytic functions include, for instance,
polynomial functions, the exponential function. Most complex functions,
however, are not: absolute value, real part, imaginary part, complex
conjugate, etc.
Analytic (or holomorphic) functions satisfy the Cauchy-Riemann equations:
.. math::
\frac{\partial \Re z}{\partial \Re t} = \frac{\partial \Im z}{\partial \Im t} \text{ and } \frac{\partial \Re z}{\partial \Im t} = - \frac{\partial \Im z}{\partial \Re t}
Or, in our case:
.. math::
\frac{\partial x}{\partial r} = \frac{\partial y}{\partial t} \text{ and } \frac{\partial x}{\partial s} = - \frac{\partial y}{\partial r}
This leads to:
.. math::
\nabla_C(t) &= \Re \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) +
\Im \left(\nabla_C(z)\right) \left(\frac{\partial y}{\partial r} + i \frac{\partial y}{\partial s}\right) \\
&= \Re \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) +
\Im \left(\nabla_C(z)\right) \left(- \frac{\partial x}{\partial s} + i \frac{\partial x}{\partial r}\right) \\
&= \Re \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) +
i \Im \left(\nabla_C(z)\right) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right) \\
\nabla_C(t) &= \nabla_C(z) \left(\frac{\partial x}{\partial r} + i \frac{\partial x}{\partial s}\right)
= - i \nabla_C(z) \left(\frac{\partial y}{\partial r} + i \frac{\partial y}{\partial s}\right)
Finite differences
==================
In order to verify that the mathematical formula for a gradient, or its
implementation, is correct, we usually use a finite-differenciation
approach. If :math:`C` is our real scalar cost, and :math:`x` a
real-valued scalar variable, then:
.. math::
\frac{\partial C}{\partial x} \approx \frac{C(x + \varepsilon) - C(x)}{\varepsilon}
where :math:`\varepsilon` is also a real scalar, of small magnitude
(typically :math:`10^{-6}` to :math:`10^{-4}`). If :math:`x` is a
tensor, then this approximation has to be made for each element
:math:`x_i` independently (a different :math:`\varepsilon_i` could be used
each time, but usually they are all equal to :math:`\varepsilon`).
For a complex scalar variable :math:`z = x + iy`:
.. math::
\nabla_C(z) &= \frac{\partial C}{\partial x} + i \frac{\partial C}{\partial y}\\
\nabla_C(z) &\approx \frac{C(z + \delta) - C(z)}{\delta} + i \frac{C(z + i \varepsilon) - C(z)}{\varepsilon}
Both partial derivative have to be estimated independently, using
generally :math:`\delta = \varepsilon`.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论