Remove mentions of `aesara.tensor.signal` and `aesara.tensor.nnet` in documentation

4c685afb · Rémi Louf · Brandon T. Willard · cf4709d8 · 4c685afb · 4c685afb
--- a/doc/extending/unittest.rst
+++ b/doc/extending/unittest.rst
@@ -209,29 +209,6 @@ Here is an example showing how to use :func:`verify_grad` on an :class:`Op` inst
        rng = np.random.default_rng(42)
        aesara.gradient.verify_grad(at.Flatten(), [a_val], rng=rng)

-Here is another example, showing how to verify the gradient w.r.t. a subset of
-an :class:`Op`'s inputs. This is useful in particular when the gradient w.r.t. some of
-the inputs cannot be computed by finite difference (e.g. for discrete inputs),
-which would cause :func:`verify_grad` to crash.
-
-.. testcode::
-
-    def test_crossentropy_softmax_grad():
-        op = at.nnet.crossentropy_softmax_argmax_1hot_with_bias
-
-        def op_with_fixed_y_idx(x, b):
-            # Input `y_idx` of this `Op` takes integer values, so we fix them
-            # to some constant array.
-            # Although this `Op` has multiple outputs, we can return only one.
-            # Here, we return the first output only.
-            return op(x, b, y_idx=np.asarray([0, 2]))[0]
-
-        x_val = np.asarray([[-1, 0, 1], [3, 2, 1]], dtype='float64')
-        b_val = np.asarray([1, 2, 3], dtype='float64')
-        rng = np.random.default_rng(42)
-
-        aesara.gradient.verify_grad(op_with_fixed_y_idx, [x_val, b_val], rng=rng)
-
 .. note::

    Although :func:`verify_grad` is defined in :mod:`aesara.gradient`, unittests

--- a/doc/library/d3viz/index.ipynb
+++ b/doc/library/d3viz/index.ipynb
@@ -104,7 +104,7 @@
    "\n",
    "wy = th.shared(rng.normal(0, 1, (nhiddens, noutputs)))\n",
    "by = th.shared(np.zeros(noutputs), borrow=True)\n",
-    "y = at.nnet.softmax(at.dot(h, wy) + by)\n",
+    "y = at.math.softmax(at.dot(h, wy) + by)\n",
    "\n",
    "predict = th.function([x], y)"
   ]

--- a/doc/library/d3viz/index.rst
+++ b/doc/library/d3viz/index.rst
@@ -67,7 +67,7 @@ hidden layer and a softmax output layer.

    wy = th.shared(rng.normal(0, 1, (nhiddens, noutputs)))
    by = th.shared(np.zeros(noutputs), borrow=True)
-    y = at.nnet.softmax(at.dot(h, wy) + by)
+    y = at.math.softmax(at.dot(h, wy) + by)

    predict = th.function([x], y)


--- a/doc/library/sandbox/neighbours.rst
+++ b/doc/library/sandbox/neighbours.rst
-.. _libdoc_neighbours:
-
-===================================================================
-:mod:`sandbox.neighbours` --  Neighbours Ops
-===================================================================
-
-.. module:: sandbox.neighbours
-   :platform: Unix, Windows
-   :synopsis: Neighbours Ops
-.. moduleauthor:: LISA
-
-:ref:`Moved <libdoc_tensor_nnet_neighbours>`
--- a/doc/library/tensor/index.rst
+++ b/doc/library/tensor/index.rst
@@ -18,9 +18,7 @@ They are grouped into the following sections:
    :maxdepth: 1

    basic
-    nnet/index
    random/index
-    signal/index
    utils
    elemwise
    extra_ops

--- a/doc/library/tensor/nnet/basic.rst
+++ b/doc/library/tensor/nnet/basic.rst
-.. _libdoc_tensor_nnet_basic:
-
-======================================================
-:mod:`basic` -- Basic Ops for neural networks
-======================================================
-
-.. module:: aesara.tensor.nnet.basic
-   :platform: Unix, Windows
-   :synopsis: Ops for neural networks
-.. moduleauthor:: LISA
-
-
- Sigmoid
-   - :func:`sigmoid`
-   - :func:`ultra_fast_sigmoid`
-   - :func:`hard_sigmoid`
-
- Others
-   - :func:`softplus`
-   - :func:`softmax`
-   - :func:`softsign`
-   - :func:`relu() <aesara.tensor.nnet.relu>`
-   - :func:`elu() <aesara.tensor.nnet.elu>`
-   - :func:`selu() <aesara.tensor.nnet.selu>`
-   - :func:`binary_crossentropy`
-   - :func:`sigmoid_binary_crossentropy`
-   - :func:`.categorical_crossentropy`
-   - :func:`h_softmax() <aesara.tensor.nnet.h_softmax>`
-   - :func:`confusion_matrix <aesara.tensor.nnet.confusion_matrix>`
-
-.. function:: sigmoid(x)
-
-   Returns the standard sigmoid nonlinearity applied to x
-    :Parameters: *x* - symbolic Tensor (or compatible)
-    :Return type: same as x
-    :Returns: element-wise sigmoid: :math:`sigmoid(x) = \frac{1}{1 + \exp(-x)}`.
-    :note: see :func:`ultra_fast_sigmoid` or :func:`hard_sigmoid` for faster versions.
-        Speed comparison for 100M float64 elements on a Core2 Duo @ 3.16 GHz:
-
-          - hard_sigmoid: 1.0s
-          - ultra_fast_sigmoid: 1.3s
-          - sigmoid (with amdlibm): 2.3s
-          - sigmoid (without amdlibm): 3.7s
-
-        Precision: sigmoid(with or without amdlibm) > ultra_fast_sigmoid > hard_sigmoid.
-
-   .. image:: sigmoid_prec.png
-
-   Example:
-
-   .. testcode::
-
-       import aesara.tensor as at
-
-       x, y, b = at.dvectors('x', 'y', 'b')
-       W = at.dmatrix('W')
-       y = at.sigmoid(at.dot(W, x) + b)
-
-   .. note:: The underlying code will return an exact 0 or 1 if an
-      element of x is too small or too big.
-
-.. function:: ultra_fast_sigmoid(x)
-
-   Returns an approximate standard :func:`sigmoid` nonlinearity applied to ``x``.
-    :Parameters: ``x`` - symbolic Tensor (or compatible)
-    :Return type: same as ``x``
-    :Returns: approximated element-wise sigmoid: :math:`sigmoid(x) = \frac{1}{1 + \exp(-x)}`.
-    :note: To automatically change all :func:`sigmoid`\ :class:`Op`\s to this version, use
-      the Aesara rewrite `local_ultra_fast_sigmoid`. This can be done
-      with the Aesara flag ``optimizer_including=local_ultra_fast_sigmoid``.
-      This rewrite is done late, so it should not affect stabilization rewrites.
-
-   .. note:: The underlying code will return 0.00247262315663 as the
-       minimum value and 0.997527376843 as the maximum value. So it
-       never returns 0 or 1.
-
-   .. note:: Using directly the `ultra_fast_sigmoid` in the graph will
-       disable stabilization rewrites associated with it. But
-       using the rewrite to insert them won't disable the
-       stability rewrites.
-
-
-.. function:: hard_sigmoid(x)
-
-   Returns an approximate standard :func:`sigmoid` nonlinearity applied to `1x1`.
-    :Parameters: ``x`` - symbolic Tensor (or compatible)
-    :Return type: same as ``x``
-    :Returns: approximated element-wise sigmoid: :math:`sigmoid(x) = \frac{1}{1 + \exp(-x)}`.
-    :note: To automatically change all :func:`sigmoid`\ :class:`Op`\s to this version, use
-      the Aesara rewrite `local_hard_sigmoid`. This can be done
-      with the Aesara flag ``optimizer_including=local_hard_sigmoid``.
-      This rewrite is done late, so it should not affect
-      stabilization rewrites.
-
-   .. note:: The underlying code will return an exact 0 or 1 if an
-      element of ``x`` is too small or too big.
-
-   .. note:: Using directly the `ultra_fast_sigmoid` in the graph will
-       disable stabilization rewrites associated with it. But
-       using the rewrites to insert them won't disable the
-       stability rewrites.
-
-.. function:: softplus(x)
-
-   Returns the softplus nonlinearity applied to x
-    :Parameter: *x* - symbolic Tensor (or compatible)
-    :Return type: same as x
-    :Returns: element-wise softplus: :math:`softplus(x) = \log_e{\left(1 + \exp(x)\right)}`.
-
-   .. note:: The underlying code will return an exact 0 if an element of x is too small.
-
-   .. testcode::
-
-       x, y, b = at.dvectors('x', 'y', 'b')
-       W = at.dmatrix('W')
-       y = at.nnet.softplus(at.dot(W,x) + b)
-
-.. function:: softsign(x)
-
-   Return the elemwise softsign activation function
-   :math:`\\varphi(\\mathbf{x}) = \\frac{1}{1+|x|}`
-
-
-.. function:: softmax(x)
-
-   Returns the softmax function of x:
-    :Parameter: *x* symbolic **2D** Tensor (or compatible).
-    :Return type: same as x
-    :Returns: a symbolic 2D tensor whose ijth element is  :math:`softmax_{ij}(x) = \frac{\exp{x_{ij}}}{\sum_k\exp(x_{ik})}`.
-
-   The softmax function will, when applied to a matrix, compute the softmax values row-wise.
-
-    :note: this supports hessian free as well.  The code of
-       the softmax op is more numerically stable because it uses this code:
-
-       .. code-block:: python
-
-           e_x = exp(x - x.max(axis=1, keepdims=True))
-           out = e_x / e_x.sum(axis=1, keepdims=True)
-
-   Example of use:
-
-   .. testcode::
-
-       x, y, b = at.dvectors('x', 'y', 'b')
-       W = at.dmatrix('W')
-       y = at.nnet.softmax(at.dot(W,x) + b)
-
-.. autofunction:: aesara.tensor.nnet.relu
-
-.. autofunction:: aesara.tensor.nnet.elu
-
-.. autofunction:: aesara.tensor.nnet.selu
-
-.. function:: binary_crossentropy(output,target)
-
-   Computes the binary cross-entropy between a target and an output:
-    :Parameters:
-
-       * *target* - symbolic Tensor (or compatible)
-       * *output* - symbolic Tensor (or compatible)
-
-    :Return type: same as target
-    :Returns: a symbolic tensor, where the following is applied element-wise :math:`crossentropy(t,o) = -(t\cdot log(o) + (1 - t) \cdot log(1 - o))`.
-
-   The following block implements a simple auto-associator with a
-   sigmoid nonlinearity and a reconstruction error which corresponds
-   to the binary cross-entropy (note that this assumes that x will
-   contain values between 0 and 1):
-
-   .. testcode::
-
-       x, y, b, c = at.dvectors('x', 'y', 'b', 'c')
-       W = at.dmatrix('W')
-       V = at.dmatrix('V')
-       h = at.sigmoid(at.dot(W, x) + b)
-       x_recons = at.sigmoid(at.dot(V, h) + c)
-       recon_cost = at.nnet.binary_crossentropy(x_recons, x).mean()
-
-.. function:: sigmoid_binary_crossentropy(output,target)
-
-   Computes the binary cross-entropy between a target and the sigmoid of an output:
-    :Parameters:
-
-       * *target* - symbolic Tensor (or compatible)
-       * *output* - symbolic Tensor (or compatible)
-
-    :Return type: same as target
-    :Returns: a symbolic tensor, where the following is applied element-wise :math:`crossentropy(o,t) = -(t\cdot log(sigmoid(o)) + (1 - t) \cdot log(1 - sigmoid(o)))`.
-
-   It is equivalent to `binary_crossentropy(sigmoid(output), target)`,
-   but with more efficient and numerically stable computation, especially when
-   taking gradients.
-
-   The following block implements a simple auto-associator with a
-   sigmoid nonlinearity and a reconstruction error which corresponds
-   to the binary cross-entropy (note that this assumes that x will
-   contain values between 0 and 1):
-
-   .. testcode::
-
-       x, y, b, c = at.dvectors('x', 'y', 'b', 'c')
-       W = at.dmatrix('W')
-       V = at.dmatrix('V')
-       h = at.sigmoid(at.dot(W, x) + b)
-       x_precons = at.dot(V, h) + c
-       # final reconstructions are given by sigmoid(x_precons), but we leave
-       # them unnormalized as sigmoid_binary_crossentropy applies sigmoid
-       recon_cost = at.sigmoid_binary_crossentropy(x_precons, x).mean()
-
-.. function:: categorical_crossentropy(coding_dist,true_dist)
-
-    Return the cross-entropy between an approximating distribution and a true distribution.
-    The cross entropy between two probability distributions measures the average number of bits
-    needed to identify an event from a set of possibilities, if a coding scheme is used based
-    on a given probability distribution q, rather than the "true" distribution p. Mathematically, this
-    function computes :math:`H(p,q) = - \sum_x p(x) \log(q(x))`, where
-    p=true_dist and q=coding_dist.
-
-    :Parameters:
-
-       * *coding_dist* - symbolic 2D Tensor (or compatible). Each row
-         represents a distribution.
-       * *true_dist* - symbolic 2D Tensor **OR** symbolic vector of ints.  In
-         the case of an integer vector argument, each element represents the
-         position of the '1' in a 1-of-N encoding (aka "one-hot" encoding)
-
-    :Return type: tensor of rank one-less-than `coding_dist`
-
-   .. note:: An application of the scenario where *true_dist* has a
-       1-of-N representation is in classification with softmax
-       outputs. If `coding_dist` is the output of the softmax and
-       `true_dist` is a vector of correct labels, then the function
-       will compute ``y_i = - \log(coding_dist[i, one_of_n[i]])``,
-       which corresponds to computing the neg-log-probability of the
-       correct class (which is typically the training criterion in
-       classification settings).
-
-   .. testsetup::
-
-      import aesara
-      o = aesara.tensor.ivector()
-
-   .. testcode::
-
-       y = at.nnet.softmax(at.dot(W, x) + b)
-       cost = at.nnet.categorical_crossentropy(y, o)
-       # o is either the above-mentioned 1-of-N vector or 2D tensor
-
-
-.. autofunction:: aesara.tensor.nnet.h_softmax
--- a/doc/library/tensor/nnet/batchnorm.rst
+++ b/doc/library/tensor/nnet/batchnorm.rst
-.. _libdoc_tensor_nnet_batchnorm:
-
-=======================================
-:mod:`batchnorm` -- Batch Normalization
-=======================================
-
-.. module:: tensor.nnet.batchnorm
-   :platform: Unix, Windows
-   :synopsis: Batch Normalization
-.. moduleauthor:: LISA
-
-
-.. autofunction:: aesara.tensor.nnet.batchnorm.batch_normalization_train
-.. autofunction:: aesara.tensor.nnet.batchnorm.batch_normalization_test
-
-.. seealso:: cuDNN batch normalization: :class:`aesara.gpuarray.dnn.dnn_batch_normalization_train`, :class:`aesara.gpuarray.dnn.dnn_batch_normalization_test>`.
-
-.. autofunction:: aesara.tensor.nnet.batchnorm.batch_normalization
--- a/doc/library/tensor/nnet/blocksparse.rst
+++ b/doc/library/tensor/nnet/blocksparse.rst
-.. _libdoc_blocksparse:
-
-===============================================================================
-:mod:`blocksparse` --  Block sparse dot operations (gemv and outer)
-===============================================================================
-
-.. module:: tensor.nnet.blocksparse
-   :platform: Unix, Windows
-   :synopsis: Block sparse dot
-.. moduleauthor:: LISA
-
-
-.. automodule:: aesara.tensor.nnet.blocksparse
-    :members:
--- a/doc/library/tensor/nnet/conv.rst
+++ b/doc/library/tensor/nnet/conv.rst
-.. _libdoc_tensor_nnet_conv:
-
-==========================================================
-:mod:`conv` -- Ops for convolutional neural nets
-==========================================================
-
-.. note::
-
-    Two similar implementation exists for conv2d:
-
-        :func:`signal.conv2d <aesara.tensor.signal.conv.conv2d>` and
-        :func:`nnet.conv2d <aesara.tensor.nnet.conv2d>`.
-
-    The former implements a traditional
-    2D convolution, while the latter implements the convolutional layers
-    present in convolutional neural networks (where filters are 3D and pool
-    over several input channels).
-
-.. module:: conv
-   :platform: Unix, Windows
-   :synopsis: ops for signal processing
-.. moduleauthor:: LISA
-
-
-The recommended user interface are:
-
- :func:`aesara.tensor.nnet.conv2d` for 2d convolution
- :func:`aesara.tensor.nnet.conv3d` for 3d convolution
-
-With those new interface, Aesara will automatically use the fastest
-implementation in many cases. On the CPU, the implementation is a GEMM
-based one.
-
-This auto-tuning has the inconvenience that the first call is much
-slower as it tries and times each implementation it has. So if you
-benchmark, it is important that you remove the first call from your
-timing.
-
-Implementation Details
-======================
-
-This section gives more implementation detail. Most of the time you do
-not need to read it. Aesara will select it for you.
-
-
- Implemented operators for neural network 2D / image convolution:
-    - :func:`nnet.conv.conv2d <aesara.tensor.nnet.conv.conv2d>`.
-      old 2d convolution. DO NOT USE ANYMORE.
-
-      For each element in a batch, it first creates a
-      `Toeplitz <http://en.wikipedia.org/wiki/Toeplitz_matrix>`_ matrix in a CUDA kernel.
-      Then, it performs a ``gemm`` call to multiply this Toeplitz matrix and the filters
-      (hence the name: MM is for matrix multiplication).
-      It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
-      ``(no of channels * filter width * filter height, output width * output height)``.
-
-    - :func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`
-      This is a CPU-only 2d correlation implementation taken from
-      `caffe's cpp implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp>`_.
-      It does not flip the kernel.
-
- Implemented operators for neural network 3D / video convolution:
-    - :func:`Corr3dMM <aesara.tensor.nnet.corr3d.Corr3dMM>`
-      This is a CPU-only 3d correlation implementation based on
-      the 2d version (:func:`CorrMM <aesara.tensor.nnet.corr.CorrMM>`).
-      It does not flip the kernel. As it provides a gradient, you can use it as a
-      replacement for nnet.conv3d. For convolutions done on CPU,
-      nnet.conv3d will be replaced by Corr3dMM.
-
-    - :func:`conv3d2d <aesara.tensor.nnet.conv3d2d.conv3d>`
-      Another conv3d implementation that uses the conv2d with data reshaping.
-      It is faster in some corner cases than conv3d. It flips the kernel.
-
-.. autofunction:: aesara.tensor.nnet.conv2d
-.. autofunction:: aesara.tensor.nnet.conv2d_transpose
-.. autofunction:: aesara.tensor.nnet.conv3d
-.. autofunction:: aesara.tensor.nnet.conv3d2d.conv3d
-.. autofunction:: aesara.tensor.nnet.conv.conv2d
-
-.. automodule:: aesara.tensor.nnet.abstract_conv
-    :members:
--- a/doc/library/tensor/nnet/ctc.rst
+++ b/doc/library/tensor/nnet/ctc.rst
-.. _libdoc_tensor_nnet_ctc:
-
-==================================================================================
-:mod:`aesara.tensor.nnet.ctc` -- Connectionist Temporal Classification (CTC) loss
-==================================================================================
-
-.. note::
-
-    Usage of connectionist temporal classification (CTC) loss Op, requires that
-    the `warp-ctc <https://github.com/baidu-research/warp-ctc>`_ library is
-    available. In case the warp-ctc library is not in your compiler's library path,
-    the ``config.ctc__root`` configuration option must be appropriately set to the
-    directory containing the warp-ctc library files.
-
-.. note::
-
-   This interface is the preferred interface.
-
-.. note::
-
-    Unfortunately, Windows platforms are not yet supported by the underlying
-    library.
-
-.. module:: aesara.tensor.nnet.ctc
-   :platform: Unix
-   :synopsis: Connectionist temporal classification (CTC) loss Op, using the warp-ctc library
-.. moduleauthor:: `João Victor Risso <https://github.com/joaovictortr>`_
-
-.. autofunction:: aesara.tensor.nnet.ctc.ctc
-.. autoclass:: aesara.tensor.nnet.ctc.ConnectionistTemporalClassification
--- a/doc/library/tensor/nnet/index.rst
+++ b/doc/library/tensor/nnet/index.rst
-.. _libdoc_tensor_nnet:
-
-==================================================
-:mod:`nnet`  -- Ops related to neural networks
-==================================================
-
-.. module:: aesara.tensor.nnet
-   :platform: Unix, Windows
-   :synopsis: various ops relating to neural networks
-.. moduleauthor:: LISA
-
-Aesara was originally developed for machine learning applications, particularly
-for the topic of deep learning. As such, our lab has developed many functions
-and ops which are particular to neural networks and deep learning.
-
-.. toctree::
-    :maxdepth: 1
-
-    conv
-    basic
-    neighbours
-    batchnorm
-    blocksparse
-    ctc
--- a/doc/library/tensor/nnet/neighbours.rst
+++ b/doc/library/tensor/nnet/neighbours.rst
-.. _libdoc_tensor_nnet_neighbours:
-
-=======================================================================
-:mod:`neighbours` -- Ops for working with images in convolutional nets
-=======================================================================
-
-.. module:: aesara.tensor.nnet.neighbours
-   :platform: Unix, Windows
-   :synopsis: Ops for working with images in conv nets
-.. moduleauthor:: LISA
-
-
-Functions
-=========
-
-.. autofunction:: aesara.tensor.nnet.neighbours.images2neibs
-
-.. autofunction:: aesara.tensor.nnet.neighbours.neibs2images
-
-
-See also
-========
-
- :ref:`indexing`
- :ref:`lib_scan`
--- a/doc/library/tensor/nnet/sigmoid_prec.png
+++ b/doc/library/tensor/nnet/sigmoid_prec.png
--- a/doc/library/tensor/signal/conv.rst
+++ b/doc/library/tensor/signal/conv.rst
-.. _libdoc_tensor_signal_conv:
-
-======================================================
-:mod:`conv` -- Convolution
-======================================================
-
-.. note::
-
-    Two similar implementation exists for conv2d:
-
-        :func:`signal.conv2d <aesara.tensor.signal.conv.conv2d>` and
-        :func:`nnet.conv2d <aesara.tensor.nnet.conv.conv2d>`.
-
-    The former implements a traditional
-    2D convolution, while the latter implements the convolutional layers
-    present in convolutional neural networks (where filters are 3D and pool
-    over several input channels).
-
-.. module:: aesara.tensor.signal.conv
-   :platform: Unix, Windows
-   :synopsis: ops for performing convolutions
-.. moduleauthor:: LISA
-
-.. autofunction:: aesara.tensor.signal.conv.conv2d
-
-.. function:: fft(*todo)
-
-    [James has some code for this, but hasn't gotten it into the source tree yet.]
--- a/doc/library/tensor/signal/downsample.rst
+++ b/doc/library/tensor/signal/downsample.rst
-.. _libdoc_tensor_signal_downsample:
-
-======================================================
-:mod:`downsample` -- Down-Sampling
-======================================================
-
-.. module:: downsample
-   :platform: Unix, Windows
-   :synopsis: ops for performing various forms of downsampling
-.. moduleauthor:: LISA
-
-.. note::
-
-    This module is deprecated. Use the functions in :func:`aesara.tensor.nnet.signal.pool`
--- a/doc/library/tensor/signal/index.rst
+++ b/doc/library/tensor/signal/index.rst
-.. _libdoc_tensor_signal:
-
-=====================================================
-:mod:`signal` -- Signal Processing
-=====================================================
-
-Signal Processing
-----------------
-
-.. module:: signal
-   :platform: Unix, Windows
-   :synopsis: various ops for performing basic signal processing
-       (convolutions, subsampling, fft, etc.)
-.. moduleauthor:: LISA
-
-The signal subpackage contains ops which are useful for performing various
-forms of signal processing.
-
-.. toctree::
-    :maxdepth: 1
-
-    conv
-    pool
-    downsample
--- a/doc/library/tensor/signal/pool.rst
+++ b/doc/library/tensor/signal/pool.rst
-.. _libdoc_tensor_signal_pool:
-
-======================================================
-:mod:`pool` -- Down-Sampling
-======================================================
-
-.. module:: pool
-   :platform: Unix, Windows
-   :synopsis: ops for performing various forms of downsampling
-.. moduleauthor:: LISA
-
-.. seealso:: :func:`aesara.tensor.nnet.neighbours.images2neibs`
-
-.. autofunction:: aesara.tensor.signal.pool.pool_2d
-.. autofunction:: aesara.tensor.signal.pool.max_pool_2d_same_size
-.. autofunction:: aesara.tensor.signal.pool.pool_3d
--- a/doc/troubleshooting.rst
+++ b/doc/troubleshooting.rst
@@ -127,9 +127,6 @@ Could lower the memory usage, but raise computation time:

 - :attr:`config.scan__allow_gc` = True
 - :attr:`config.scan__allow_output_prealloc` =False
- Use :func:`batch_normalization()
-  <aesara.tensor.nnet.batchnorm.batch_normalization>`. It use less memory
-  then building a corresponding Aesara graph.
 - Disable one or scan more rewrites:
    - ``optimizer_excluding=scan_pushout_seqs_ops``
    - ``optimizer_excluding=scan_pushout_dot1``

--- a/doc/tutorial/conv_arithmetic.rst
+++ b/doc/tutorial/conv_arithmetic.rst
-.. _conv_arithmetic:
-
-===============================
-Convolution arithmetic tutorial
-===============================
-
-.. note::
-
-    This tutorial is adapted from an existing `convolution arithmetic guide
-    <https://arxiv.org/abs/1603.07285>`_ [#]_, with an added emphasis on
-    Aesara's interface.
-
-    Also, note that the signal processing community has a different nomenclature
-    and a well established literature on the topic, but for this tutorial
-    we will stick to the terms used in the machine learning community. For a
-    signal processing point of view on the subject, see for instance *Winograd,
-    Shmuel. Arithmetic complexity of computations. Vol. 33. Siam, 1980*.
-
-About this tutorial
-===================
-
-Learning to use convolutional neural networks (CNNs) for the first time is
-generally an intimidating experience. A convolutional layer's output shape is
-affected by the shape of its input as well as the choice of kernel shape, zero
-padding and strides, and the relationship between these properties is not
-trivial to infer. This contrasts with fully-connected layers, whose output size
-is independent of the input size.  Additionally, so-called transposed
-convolutional layers (also known as fractionally strided convolutional layers,
-or -- wrongly -- as deconvolutions) have been employed in more and more work as
-of late, and their relationship with convolutional layers has been explained
-with various degrees of clarity.
-
-The relationship between a convolution operation's input shape, kernel size,
-stride, padding and its output shape can be confusing at times.
-
-The tutorial's objective is threefold:
-
-* Explain the relationship between convolutional layers and transposed
-  convolutional layers.
-* Provide an intuitive understanding of the relationship between input shape,
-  kernel shape, zero padding, strides and output shape in convolutional and
-  transposed convolutional layers.
-* Clarify Aesara's API on convolutions.
-
-Refresher: discrete convolutions
-================================
-
-The bread and butter of neural networks is *affine transformations*: a
-vector is received as input and is multiplied with a matrix to produce an
-output (to which a bias vector is usually added before passing the result
-through a nonlinearity). This is applicable to any type of input, be it an
-image, a sound clip or an unordered collection of features: whatever their
-dimensionality, their representation can always be flattened into a vector
-before the transformation.
-
-Images, sound clips and many other similar kinds of data have an intrinsic
-structure. More formally, they share these important properties:
-
-* They are stored as multi-dimensional arrays.
-* They feature one or more axes for which ordering matters (e.g., width and
-  height axes for an image, time axis for a sound clip).
-* One axis, called the channel axis, is used to access different views of the
-  data (e.g., the red, green and blue channels of a color image, or the left
-  and right channels of a stereo audio track).
-
-These properties are not exploited when an affine transformation is applied; in
-fact, all the axes are treated in the same way and the topological information
-is not taken into account. Still, taking advantage of the implicit structure of
-the data may prove very handy in solving some tasks, like computer vision and
-speech recognition, and in these cases it would be best to preserve it. This is
-where discrete convolutions come into play.
-
-A discrete convolution is a linear transformation that preserves this notion of
-ordering. It is sparse (only a few input units contribute to a given output
-unit) and reuses parameters (the same weights are applied to multiple locations
-in the input).
-
-Here is an example of a discrete convolution:
-
-.. figure:: conv_arithmetic_figures/numerical_no_padding_no_strides.*
-    :figclass: align-center
-
-The light blue grid is called the *input feature map*. A *kernel* (shaded area)
-of value
-
-.. math::
-
-    \begin{pmatrix}
-    0 & 1 & 2 \\
-    2 & 2 & 0 \\
-    0 & 1 & 2
-    \end{pmatrix}
-
-slides across the input feature map. At each location, the product between each
-element of the kernel and the input element it overlaps is computed and the
-results are summed up to obtain the output in the current location. The final
-output of this procedure is a matrix called *output feature map* (in green).
-
-This procedure can be repeated using different kernels to form as many output
-feature maps (a.k.a. *output channels*) as desired. Note also that to keep the
-drawing simple a single input feature map is being represented, but it is not
-uncommon to have multiple feature maps stacked one onto another (an example of
-this is what was referred to earlier as *channels* for images and sound clips).
-
-.. note::
-
-    While there is a distinction between convolution and cross-correlation from
-    a signal processing perspective, the two become interchangeable when the
-    kernel is learned. For the sake of simplicity and to stay consistent with
-    most of the machine learning literature, the term *convolution* will be
-    used in this tutorial.
-
-If there are multiple input and output feature maps, the collection of kernels
-form a 4D array (``output_channels, input_channels, filter_rows,
-filter_columns``). For each output channel, each input channel is convolved with
-a distinct part of the kernel and the resulting set of feature maps is summed
-elementwise to produce the corresponding output feature map. The result of this
-procedure is a set of output feature maps, one for each output channel, that is
-the output of the convolution.
-
-
-The convolution depicted above is an instance of a 2-D convolution, but can be
-generalized to N-D convolutions. For instance, in a 3-D convolution, the kernel
-would be a *cuboid* and would slide across the height, width and depth of the
-input feature map.
-
-The collection of kernels defining a discrete convolution has a shape
-corresponding to some permutation of :math:`(n, m, k_1, \ldots, k_N)`, where
-
-.. math::
-
-    \begin{split}
-        n &\equiv \text{number of output feature maps},\\
-        m &\equiv \text{number of input feature maps},\\
-        k_j &\equiv \text{kernel size along axis $j$}.
-    \end{split}
-
-The following properties affect the output size :math:`o_j` of a convolutional
-layer along axis :math:`j`:
-
-* :math:`i_j`: input size along axis :math:`j`,
-* :math:`k_j`: kernel size along axis :math:`j`,
-* :math:`s_j`: stride (distance between two consecutive positions of the
-  kernel) along axis :math:`j`,
-* :math:`p_j`: zero padding (number of zeros concatenated at the beginning and
-  at the end of an axis) along axis `j`.
-
-For instance, here is a :math:`3 \times 3` kernel applied to a
-:math:`5 \times 5` input padded with a :math:`1 \times 1` border of zeros using
-:math:`2 \times 2` strides:
-
-.. figure:: conv_arithmetic_figures/numerical_padding_strides.*
-    :figclass: align-center
-
-The analysis of the relationship between convolutional layer properties is eased
-by the fact that they don't interact across axes, i.e., the choice of kernel
-size, stride and zero padding along axis :math:`j` only affects the output size
-of axis :math:`j`. Because of that, this section will focus on the following
-simplified setting:
-
-* 2-D discrete convolutions (:math:`N = 2`),
-* square inputs (:math:`i_1 = i_2 = i`),
-* square kernel size (:math:`k_1 = k_2 = k`),
-* same strides along both axes (:math:`s_1 = s_2 = s`),
-* same zero padding along both axes (:math:`p_1 = p_2 = p`).
-
-This facilitates the analysis and the visualization, but keep in mind that the
-results outlined here also generalize to the N-D and non-square cases.
-
-Aesara terminology
-==================
-
-Aesara has its own terminology, which differs slightly from the convolution
-arithmetic guide's. Here's a simple conversion table for the two:
-
-+------------------+----------------------------------------------------------------------------------------------------+
-| Aesara           | Convolution arithmetic                                                                             |
-+==================+====================================================================================================+
-| ``filters``      | 4D collection of kernels                                                                           |
-+------------------+----------------------------------------------------------------------------------------------------+
-| ``input_shape``  | (batch size (``b``), input channels (``c``), input rows (``i1``), input columns (``i2``))          |
-+------------------+----------------------------------------------------------------------------------------------------+
-| ``filter_shape`` | (output channels (``c1``), input channels (``c2``), filter rows (``k1``), filter columns (``k2``)) |
-+------------------+----------------------------------------------------------------------------------------------------+
-| ``border_mode``  | ``'valid'``, ``'half'``, ``'full'`` or (:math:`p_1`, :math:`p_2`)                                  |
-+------------------+----------------------------------------------------------------------------------------------------+
-| ``subsample``    | (``s1``, ``s2``)                                                                                   |
-+------------------+----------------------------------------------------------------------------------------------------+
-
-For instance, the convolution shown above would correspond to the following
-Aesara call:
-
-.. code-block:: python
-
-    output = aesara.tensor.nnet.conv2d(
-        input, filters, input_shape=(1, 1, 5, 5), filter_shape=(1, 1, 3, 3),
-        border_mode=(1, 1), subsample=(2, 2))
-
-Convolution arithmetic
-======================
-
-No zero padding, unit strides
-----------------------------
-
-The simplest case to analyze is when the kernel just slides across every
-position of the input (i.e., :math:`s = 1` and :math:`p = 0`).
-Here is an example for :math:`i = 4` and :math:`k = 3`:
-
-.. figure:: conv_arithmetic_figures/no_padding_no_strides.*
-    :figclass: align-center
-
-One way of defining the output size in this case is by the number of possible
-placements of the kernel on the input. Let's consider the width axis: the kernel
-starts on the leftmost part of the input feature map and slides by steps of one
-until it touches the right side of the input. The size of the output will be
-equal to the number of steps made, plus one, accounting for the initial position
-of the kernel. The same logic applies for the height axis.
-
-More formally, the following relationship can be inferred:
-
-.. admonition:: Relationship 1
-
-    For any :math:`i` and :math:`k`, and for :math:`s = 1` and :math:`p = 0`,
-
-    .. math::
-
-        o = (i - k) + 1.
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(0, 0), subsample=(1, 1))
-        # output.shape[2] == (i1 - k1) + 1
-        # output.shape[3] == (i2 - k2) + 1
-
-Zero padding, unit strides
--------------------------
-
-To factor in zero padding (i.e., only restricting to :math:`s = 1`), let's
-consider its effect on the effective input size: padding with :math:`p` zeros
-changes the effective input size from :math:`i` to :math:`i + 2p`. In the
-general case, Relationship 1 can then be used to infer the following
-relationship:
-
-.. admonition:: Relationship 2
-
-    For any :math:`i`, :math:`k` and :math:`p`, and for :math:`s = 1`,
-
-    .. math::
-
-        o = (i - k) + 2p + 1.
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(p1, p2), subsample=(1, 1))
-        # output.shape[2] == (i1 - k1) + 2 * p1 + 1
-        # output.shape[3] == (i2 - k2) + 2 * p2 + 1
-
-Here is an example for :math:`i = 5`, :math:`k = 4` and :math:`p = 2`:
-
-.. figure:: conv_arithmetic_figures/arbitrary_padding_no_strides.*
-    :figclass: align-center
-
-Special cases
-------------
-
-In practice, two specific instances of zero padding are used quite extensively
-because of their respective properties. Let's discuss them in more detail.
-
-Half (same) padding
-^^^^^^^^^^^^^^^^^^^
-Having the output size be the same as the input size (i.e., :math:`o = i`) can
-be a desirable property:
-
-.. admonition:: Relationship 3
-
-    For any :math:`i` and for :math:`k` odd (:math:`k = 2n + 1, \quad n \in
-    \mathbb{N}`), :math:`s = 1` and :math:`p = \lfloor k / 2 \rfloor = n`,
-
-    .. math::
-
-        \begin{split}
-            o &= i + 2 \lfloor k / 2 \rfloor - (k - 1) \\
-            &= i + 2n - 2n \\
-            &= i.
-        \end{split}
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode='half', subsample=(1, 1))
-        # output.shape[2] == i1
-        # output.shape[3] == i2
-
-This is sometimes referred to as *half* (or *same*) padding. Here is an example
-for :math:`i = 5`, :math:`k = 3` and (therefore) :math:`p = 1`:
-
-.. figure:: conv_arithmetic_figures/same_padding_no_strides.*
-    :figclass: align-center
-
-Note that half padding also works for even-valued :math:`k` and for :math:`s >
-1`, but in that case the property that the output size is the same as the input
-size is lost. Some frameworks also implement the ``same`` convolution slightly
-differently (e.g., in Keras :math:`o = (i + s - 1) // s`).
-
-Full padding
-^^^^^^^^^^^^
-
-While convolving a kernel generally *decreases* the output size with respect to
-the input size, sometimes the opposite is required. This can be achieved with
-proper zero padding:
-
-.. admonition:: Relationship 4
-
-    For any :math:`i` and :math:`k`, and for :math:`p = k - 1` and
-    :math:`s = 1`,
-
-    .. math::
-
-        \begin{split}
-            o &= i + 2(k - 1) - (k - 1) \\
-            &= i + (k - 1).
-        \end{split}
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode='full', subsample=(1, 1))
-        # output.shape[2] == i1 + (k1 - 1)
-        # output.shape[3] == i2 + (k2 - 1)
-
-This is sometimes referred to as *full* padding, because in this setting every
-possible partial or complete superimposition of the kernel on the input feature
-map is taken into account. Here is an example for :math:`i = 5`, :math:`k = 3`
-and (therefore) :math:`p = 2`:
-
-.. figure:: conv_arithmetic_figures/full_padding_no_strides.*
-    :figclass: align-center
-
-No zero padding, non-unit strides
---------------------------------
-
-All relationships derived so far only apply for unit-strided convolutions.
-Incorporating non unitary strides requires another inference leap. To facilitate
-the analysis, let's momentarily ignore zero padding (i.e., :math:`s > 1` and
-:math:`p = 0`). Here is an example for :math:`i = 5`, :math:`k = 3` and :math:`s
-= 2`:
-
-.. figure:: conv_arithmetic_figures/no_padding_strides.*
-    :figclass: align-center
-
-Once again, the output size can be defined in terms of the number of possible
-placements of the kernel on the input. Let's consider the width axis: the kernel
-starts as usual on the leftmost part of the input, but this time it slides by
-steps of size :math:`s` until it touches the right side of the input. The size
-of the output is again equal to the number of steps made, plus one, accounting
-for the initial position of the kernel. The same logic applies for the height
-axis.
-
-From this, the following relationship can be inferred:
-
-.. admonition:: Relationship 5
-
-    For any :math:`i`, :math:`k` and :math:`s`, and for :math:`p = 0`,
-
-    .. math::
-
-        o = \left\lfloor \frac{i - k}{s} \right\rfloor + 1.
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(0, 0), subsample=(s1, s2))
-        # output.shape[2] == (i1 - k1) // s1 + 1
-        # output.shape[3] == (i2 - k2) // s2 + 1
-
-The floor function accounts for the fact that sometimes the last
-possible step does *not* coincide with the kernel reaching the end of the
-input, i.e., some input units are left out.
-
-Zero padding, non-unit strides
------------------------------
-
-The most general case (convolving over a zero padded input using non-unit
-strides) can be derived by applying Relationship 5 on an effective input of size
-:math:`i + 2p`, in analogy to what was done for Relationship 2:
-
-.. admonition:: Relationship 6
-
-    For any :math:`i`, :math:`k`, :math:`p` and :math:`s`,
-
-    .. math::
-
-        o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1.
-
-    This translates to the following Aesara code:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(p1, p2), subsample=(s1, s2))
-        # output.shape[2] == (i1 - k1 + 2 * p1) // s1 + 1
-        # output.shape[3] == (i2 - k2 + 2 * p2) // s2 + 1
-
-As before, the floor function means that in some cases a convolution will
-produce the same output size for multiple input sizes. More specifically, if
-:math:`i + 2p - k` is a multiple of :math:`s`, then any input size :math:`j = i
-+ a, \quad a \in \{0,\ldots,s - 1\}` will produce the same output size. Note
-that this ambiguity applies only for :math:`s > 1`.
-
-Here is an example for :math:`i = 5`, :math:`k = 3`, :math:`s = 2` and :math:`p
-= 1`:
-
-.. figure:: conv_arithmetic_figures/padding_strides.*
-    :figclass: align-center
-
-Here is an example for :math:`i = 6`, :math:`k = 3`, :math:`s = 2` and :math:`p
-= 1`:
-
-.. figure:: conv_arithmetic_figures/padding_strides_odd.*
-    :figclass: align-center
-
-Interestingly, despite having different input sizes these convolutions share the
-same output size. While this doesn't affect the analysis for *convolutions*,
-this will complicate the analysis in the case of *transposed convolutions*.
-
-Transposed convolution arithmetic
-=================================
-
-The need for transposed convolutions generally arises from the desire to use a
-transformation going in the opposite direction of a normal convolution, i.e.,
-from something that has the shape of the output of some convolution to
-something that has the shape of its input while maintaining a connectivity
-pattern that is compatible with said convolution. For instance, one might use
-such a transformation as the decoding layer of a convolutional autoencoder or to
-project feature maps to a higher-dimensional space.
-
-Once again, the convolutional case is considerably more complex than the
-fully-connected case, which only requires to use a weight matrix whose shape
-has been transposed. However, since every convolution boils down to an
-efficient implementation of a matrix operation, the insights gained from the
-fully-connected case are useful in solving the convolutional case.
-
-Like for convolution arithmetic, the dissertation about transposed convolution
-arithmetic is simplified by the fact that transposed convolution properties
-don't interact across axes.
-
-This section will focus on the following setting:
-
-* 2-D transposed convolutions (:math:`N = 2`),
-* square inputs (:math:`i_1 = i_2 = i`),
-* square kernel size (:math:`k_1 = k_2 = k`),
-* same strides along both axes (:math:`s_1 = s_2 = s`),
-* same zero padding along both axes (:math:`p_1 = p_2 = p`).
-
-Once again, the results outlined generalize to the N-D and non-square cases.
-
-Convolution as a matrix operation
---------------------------------
-
-Take for example the convolution presented in the *No zero padding, unit
-strides* subsection:
-
-.. figure:: conv_arithmetic_figures/no_padding_no_strides.*
-    :figclass: align-center
-
-If the input and output were to be unrolled into vectors from left to right, top
-to bottom, the convolution could be represented as a sparse matrix
-:math:`\mathbf{C}` where the non-zero elements are the elements :math:`w_{i,j}`
-of the kernel (with :math:`i` and :math:`j` being the row and column of the
-kernel respectively):
-
-.. math::
-
-    \begin{pmatrix}
-    w_{0,0} & 0       & 0       & 0       \\
-    w_{0,1} & w_{0,0} & 0       & 0       \\
-    w_{0,2} & w_{0,1} & 0       & 0       \\
-    0       & w_{0,2} & 0       & 0       \\
-    w_{1,0} & 0       & w_{0,0} & 0       \\
-    w_{1,1} & w_{1,0} & w_{0,1} & w_{0,0} \\
-    w_{1,2} & w_{1,1} & w_{0,2} & w_{0,1} \\
-    0       & w_{1,2} & 0       & w_{0,2} \\
-    w_{2,0} & 0       & w_{1,0} & 0       \\
-    w_{2,1} & w_{2,0} & w_{1,1} & w_{1,0} \\
-    w_{2,2} & w_{2,1} & w_{1,2} & w_{1,1} \\
-    0       & w_{2,2} & 0       & w_{1,2} \\
-    0       & 0       & w_{2,0} & 0       \\
-    0       & 0       & w_{2,1} & w_{2,0} \\
-    0       & 0       & w_{2,2} & w_{2,1} \\
-    0       & 0       & 0       & w_{2,2} \\
-    \end{pmatrix}^T
-
-(*Note: the matrix has been transposed for formatting purposes.*) This linear
-operation takes the input matrix flattened as a 16-dimensional vector and
-produces a 4-dimensional vector that is later reshaped as the :math:`2 \times 2`
-output matrix.
-
-Using this representation, the backward pass is easily obtained by transposing
-:math:`\mathbf{C}`; in other words, the error is backpropagated by multiplying
-the loss with :math:`\mathbf{C}^T`. This operation takes a 4-dimensional vector
-as input and produces a 16-dimensional vector as output, and its connectivity
-pattern is compatible with :math:`\mathbf{C}` by construction.
-
-Notably, the kernel :math:`\mathbf{w}` defines both the matrices
-:math:`\mathbf{C}` and :math:`\mathbf{C}^T` used for the forward and backward
-passes.
-
-Transposed convolution
----------------------
-
-Let's now consider what would be required to go the other way around, i.e., map
-from a 4-dimensional space to a 16-dimensional space, while keeping the
-connectivity pattern of the convolution depicted above. This operation is known
-as a *transposed convolution*.
-
-Transposed convolutions -- also called *fractionally strided convolutions* --
-work by swapping the forward and backward passes of a convolution. One way to
-put it is to note that the kernel defines a convolution, but whether it's a
-direct convolution or a transposed convolution is determined by how the forward
-and backward passes are computed.
-
-For instance, the kernel :math:`\mathbf{w}` defines a convolution whose forward
-and backward passes are computed by multiplying with :math:`\mathbf{C}` and
-:math:`\mathbf{C}^T` respectively, but it *also* defines a transposed
-convolution whose forward and backward passes are computed by multiplying with
-:math:`\mathbf{C}^T` and :math:`(\mathbf{C}^T)^T = \mathbf{C}` respectively.
-
-.. note::
-
-    The transposed convolution operation can be thought of as the gradient of
-    *some* convolution with respect to its input, which is usually how
-    transposed convolutions are implemented in practice.
-
-    Finally note that it is always possible to implement a transposed
-    convolution with a direct convolution. The disadvantage is that it usually
-    involves adding many columns and rows of zeros to the input, resulting in a
-    much less efficient implementation.
-
-Building on what has been introduced so far, this section will proceed somewhat
-backwards with respect to the convolution arithmetic section, deriving the
-properties of each transposed convolution by referring to the direct
-convolution with which it shares the kernel, and defining the equivalent direct
-convolution.
-
-No zero padding, unit strides, transposed
-----------------------------------------
-
-The simplest way to think about a transposed convolution is by computing the
-output shape of the direct convolution for a given input shape first, and then
-inverting the input and output shapes for the transposed convolution.
-
-Let's consider the convolution of a :math:`3 \times 3` kernel on a :math:`4
-\times 4` input with unitary stride and no padding (i.e., :math:`i = 4`,
-:math:`k = 3`, :math:`s = 1` and :math:`p = 0`). As depicted in the convolution
-below, this produces a :math:`2 \times 2` output:
-
-.. figure:: conv_arithmetic_figures/no_padding_no_strides.*
-    :figclass: align-center
-
-The transpose of this convolution will then have an output of shape :math:`4
-\times 4` when applied on a :math:`2 \times 2` input.
-
-Another way to obtain the result of a transposed convolution is to apply an
-equivalent -- but much less efficient -- direct convolution. The example
-described so far could be tackled by convolving a :math:`3 \times 3` kernel over
-a :math:`2 \times 2` input padded with a :math:`2 \times 2` border of zeros
-using unit strides (i.e., :math:`i' = 2`, :math:`k' = k`, :math:`s' = 1` and
-:math:`p' = 2`), as shown here:
-
-.. figure:: conv_arithmetic_figures/no_padding_no_strides_transposed.*
-    :figclass: align-center
-
-Notably, the kernel's and stride's sizes remain the same, but the input of the
-equivalent (direct) convolution is now zero padded.
-
-.. note::
-
-    Although equivalent to applying the transposed matrix, this visualization
-    adds a lot of zero multiplications in the form of zero padding. This is done
-    here for illustration purposes, but it is inefficient, and software
-    implementations will normally not perform the useless zero multiplications.
-
-One way to understand the logic behind zero padding is to consider the
-connectivity pattern of the transposed convolution and use it to guide the
-design of the equivalent convolution. For example, the top left pixel of the
-input of the direct convolution only contribute to the top left pixel of the
-output, the top right pixel is only connected to the top right output pixel,
-and so on.
-
-To maintain the same connectivity pattern in the equivalent convolution it is
-necessary to zero pad the input in such a way that the first (top-left)
-application of the kernel only touches the top-left pixel, i.e., the padding
-has to be equal to the size of the kernel minus one.
-
-Proceeding in the same fashion it is possible to determine similar observations
-for the other elements of the image, giving rise to the following relationship:
-
-.. admonition:: Relationship 7
-
-    A convolution described by :math:`s = 1`, :math:`p = 0` and :math:`k` has an
-    associated transposed convolution described by :math:`k' = k`, :math:`s' =
-    s` and :math:`p' = k - 1` and its output size is
-
-    .. math::
-
-        o' = i' + (k - 1).
-
-    In other words,
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(0, 0),
-            subsample=(1, 1))
-        # input.shape[2] == output.shape[2] + (k1 - 1)
-        # input.shape[3] == output.shape[3] + (k2 - 1)
-
-Interestingly, this corresponds to a fully padded convolution with unit strides.
-
-Zero padding, unit strides, transposed
--------------------------------------
-
-Knowing that the transpose of a non-padded convolution is equivalent to
-convolving a zero padded input, it would be reasonable to suppose that the
-transpose of a zero padded convolution is equivalent to convolving an input
-padded with *less* zeros.
-
-It is indeed the case, as shown in here for :math:`i = 5`, :math:`k = 4` and
-:math:`p = 2`:
-
-.. figure:: conv_arithmetic_figures/arbitrary_padding_no_strides_transposed.*
-    :figclass: align-center
-
-Formally, the following relationship applies for zero padded convolutions:
-
-.. _Relationship8:
-
-.. admonition:: Relationship 8
-
-    A convolution described by :math:`s = 1`, :math:`k` and :math:`p` has an
-    associated transposed convolution described by :math:`k' = k`, :math:`s' =
-    s` and :math:`p' = k - p - 1` and its output size is
-
-    .. math::
-
-        o' = i' + (k - 1) - 2p.
-
-    In other words,
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
-            subsample=(1, 1))
-        # input.shape[2] == output.shape[2] + (k1 - 1) - 2 * p1
-        # input.shape[3] == output.shape[3] + (k2 - 1) - 2 * p2
-
-Special cases
-------------
-
-Half (same) padding, transposed
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-By applying the same inductive reasoning as before, it is reasonable to expect
-that the equivalent convolution of the transpose of a half padded convolution
-is itself a half padded convolution, given that the output size of a half
-padded convolution is the same as its input size. Thus the following relation
-applies:
-
-.. admonition:: Relationship 9
-
-    A convolution described by :math:`k = 2n + 1, \quad n \in \mathbb{N}`,
-    :math:`s = 1` and :math:`p = \lfloor k / 2 \rfloor = n` has an associated
-    transposed convolution described by :math:`k' = k`, :math:`s' = s` and
-    :math:`p' = p` and its output size is
-
-    .. math::
-
-        \begin{split}
-            o' &= i' + (k - 1) - 2p \\
-               &= i' + 2n - 2n \\
-               &= i'.
-        \end{split}
-
-    In other words,
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, filter_shape=(c1, c2, k1, k2), border_mode='half',
-            subsample=(1, 1))
-        # input.shape[2] == output.shape[2]
-        # input.shape[3] == output.shape[3]
-
-Here is an example for :math:`i = 5`, :math:`k = 3` and (therefore) :math:`p =
-1`:
-
-.. figure:: conv_arithmetic_figures/same_padding_no_strides_transposed.*
-    :figclass: align-center
-
-Full padding, transposed
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-Knowing that the equivalent convolution of the transpose of a non-padded
-convolution involves full padding, it is unsurprising that the equivalent of
-the transpose of a fully padded convolution is a non-padded convolution:
-
-.. admonition:: Relationship 10
-
-    A convolution described by :math:`s = 1`, :math:`k` and :math:`p = k - 1`
-    has an associated transposed convolution described by :math:`k' = k`,
-    :math:`s' = s` and :math:`p' = 0` and its output size is
-
-    .. math::
-
-        \begin{split}
-            o' &= i' + (k - 1) - 2p \\
-               &= i' - (k - 1)
-        \end{split}
-
-    In other words,
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, filter_shape=(c1, c2, k1, k2), border_mode='full',
-            subsample=(1, 1))
-        # input.shape[2] == output.shape[2] - (k1 - 1)
-        # input.shape[3] == output.shape[3] - (k2 - 1)
-
-Here is an example for :math:`i = 5`, :math:`k = 3` and (therefore) :math:`p =
-2`:
-
-.. figure:: conv_arithmetic_figures/full_padding_no_strides_transposed.*
-    :figclass: align-center
-
-No zero padding, non-unit strides, transposed
---------------------------------------------
-
-Using the same kind of inductive logic as for zero padded convolutions, one
-might expect that the transpose of a convolution with :math:`s > 1` involves an
-equivalent convolution with :math:`s < 1`. As will be explained, this is a valid
-intuition, which is why transposed convolutions are sometimes called
-*fractionally strided convolutions*.
-
-Here is an example for :math:`i = 5`, :math:`k = 3` and :math:`s = 2`:
-
-.. figure:: conv_arithmetic_figures/no_padding_strides_transposed.*
-    :figclass: align-center
-
-This should help understand what fractional strides involve: zeros
-are inserted *between* input units, which makes the kernel move around at
-a slower pace than with unit strides.
-
-.. note::
-
-    Doing so is inefficient and real-world implementations avoid useless
-    multiplications by zero, but conceptually it is how the transpose of a
-    strided convolution can be thought of.
-
-For the moment, it will be assumed that the convolution is non-padded (:math:`p
-= 0`) and that its input size :math:`i` is such that :math:`i - k` is a multiple
-of :math:`s`. In that case, the following relationship holds:
-
-.. _Relationship11:
-
-.. admonition:: Relationship 11
-
-    A convolution described by :math:`p = 0`, :math:`k` and :math:`s` and whose
-    input size is such that :math:`i - k` is a multiple of :math:`s`, has an
-    associated transposed convolution described by :math:`\tilde{i}'`, :math:`k'
-    = k`, :math:`s' = 1` and :math:`p' = k - 1`, where :math:`\tilde{i}'` is the
-    size of the stretched input obtained by adding :math:`s - 1` zeros between
-    each input unit, and its output size is
-
-    .. math::
-
-        o' = s (i' - 1) + k.
-
-    In other words,
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(0, 0),
-            subsample=(s1, s2))
-        # input.shape[2] == s1 * (output.shape[2] - 1) + k1
-        # input.shape[3] == s2 * (output.shape[3] - 1) + k2
-
-Zero padding, non-unit strides, transposed
------------------------------------------
-
-When the convolution's input size :math:`i` is such that :math:`i + 2p - k` is a
-multiple of :math:`s`, the analysis can extended to the zero padded case by
-combining :ref:`Relationship 8 <Relationship8>` and
-:ref:`Relationship 11 <Relationship11>`:
-
-.. admonition:: Relationship 12
-
-    A convolution described by :math:`k`, :math:`s` and :math:`p` and whose
-    input size :math:`i` is such that :math:`i + 2p - k` is a multiple of
-    :math:`s` has an associated transposed convolution described by
-    :math:`\tilde{i}'`, :math:`k' = k`, :math:`s' = 1` and :math:`p' = k - p -
-    1`, where :math:`\tilde{i}'` is the size of the stretched input obtained by
-    adding :math:`s - 1` zeros between each input unit, and its output size is
-
-    .. math::
-
-        o' = s (i' - 1) + k - 2p.
-
-    In other words,
-
-    .. code-block:: python
-
-        o_prime1 = s1 * (output.shape[2] - 1) + k1 - 2 * p1
-        o_prime2 = s2 * (output.shape[3] - 1) + k2 - 2 * p2
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, input_shape=(b, c1, o_prime1, o_prime2),
-            filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
-            subsample=(s1, s2))
-
-Here is an example for :math:`i = 5`, :math:`k = 3`, :math:`s = 2` and :math:`p
-= 1`:
-
-.. figure:: conv_arithmetic_figures/padding_strides_transposed.*
-    :figclass: align-center
-
-The constraint on the size of the input :math:`i` can be relaxed by introducing
-another parameter :math:`a \in \{0, \ldots, s - 1\}` that allows to distinguish
-between the :math:`s` different cases that all lead to the same :math:`i'`:
-
-.. admonition:: Relationship 13
-
-    A convolution described by :math:`k`, :math:`s` and :math:`p` has an
-    associated transposed convolution described by :math:`a`,
-    :math:`\tilde{i}'`, :math:`k' = k`, :math:`s' = 1` and :math:`p' = k - p -
-    1`, where :math:`\tilde{i}'` is the size of the stretched input obtained by
-    adding :math:`s - 1` zeros between each input unit, and :math:`a = (i + 2p -
-    k) \mod s` represents the number of zeros added to the top and right edges
-    of the input, and its output size is
-
-    .. math::
-
-        o' = s (i' - 1) + a + k - 2p.
-
-    In other words,
-
-    .. code-block:: python
-
-        o_prime1 = s1 * (output.shape[2] - 1) + a1 + k1 - 2 * p1
-        o_prime2 = s2 * (output.shape[3] - 1) + a2 + k2 - 2 * p2
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, input_shape=(b, c1, o_prime1, o_prime2),
-            filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
-            subsample=(s1, s2))
-
-Here is an example for :math:`i = 6`, :math:`k = 3`, :math:`s = 2` and :math:`p
-= 1`:
-
-.. figure:: conv_arithmetic_figures/padding_strides_odd_transposed.*
-    :figclass: align-center
-
-Miscellaneous convolutions
-==========================
-
-Dilated convolutions
--------------------
-
-Those familiar with the deep learning literature may have noticed the term
-"dilated convolutions" (or "atrous convolutions", from the French expression
-*convolutions à trous*) appear in recent papers. Here we attempt to provide an
-intuitive understanding of dilated convolutions. For a more in-depth description
-and to understand in what contexts they are applied, see `Chen et al. (2014)
-<https://arxiv.org/abs/1412.7062>`_ [#]_; `Yu and Koltun (2015)
-<https://arxiv.org/abs/1511.07122>`_ [#]_.
-
-Dilated convolutions "inflate" the kernel by inserting spaces between the kernel
-elements. The dilation "rate" is controlled by an additional hyperparameter
-:math:`d`. Implementations may vary, but there are usually :math:`d - 1` spaces
-inserted between kernel elements such that :math:`d = 1` corresponds to a
-regular convolution.
-
-To understand the relationship tying the dilation rate :math:`d` and the output
-size :math:`o`, it is useful to think of the impact of :math:`d` on the
-*effective kernel size*. A kernel of size :math:`k` dilated by a factor
-:math:`d` has an effective size
-
-.. math::
-
-    \hat{k} = k + (k - 1)(d - 1).
-
-This can be combined with Relationship 6 to form the following relationship for
-dilated convolutions:
-
-.. admonition:: Relationship 14
-
-    For any :math:`i`, :math:`k`, :math:`p` and :math:`s`, and for a dilation
-    rate :math:`d`,
-
-    .. math::
-
-        o = \left\lfloor \frac{i + 2p - k - (k - 1)(d - 1)}{s} \right\rfloor + 1.
-
-    This translates to the following Aesara code using the ``filter_dilation``
-    parameter:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(p1, p2), subsample=(s1, s2), filter_dilation=(d1, d2))
-        # output.shape[2] == (i1 + 2 * p1 - k1 - (k1 - 1) * (d1 - 1)) // s1 + 1
-        # output.shape[3] == (i2 + 2 * p2 - k2 - (k2 - 1) * (d2 - 1)) // s2 + 1
-
-Here is an example for :math:`i = 7`, :math:`k = 3`, :math:`d = 2`, :math:`s =
-1` and :math:`p = 0`:
-
-.. figure:: conv_arithmetic_figures/dilation.*
-    :figclass: align-center
-
-.. [#] Dumoulin, Vincent, and Visin, Francesco. "A guide to convolution
-       arithmetic for deep learning". arXiv preprint arXiv:1603.07285 (2016)
-.. [#] Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin
-       and Yuille, Alan L. "Semantic image segmentation with deep convolutional
-       nets and fully connected CRFs". arXiv preprint arXiv:1412.7062 (2014).
-.. [#] Yu, Fisher and Koltun, Vladlen. "Multi-scale context aggregation by
-       dilated convolutions". arXiv preprint arXiv:1511.07122 (2015)
-
-Grouped Convolutions
--------------------
-
-In grouped convolutions with :math:`n` number of groups, the input and kernel
-are split by their channels to form :math:`n` distinct groups. Each group
-performs convolutions independent of the other groups to give :math:`n`
-different outputs. These individual outputs are then concatenated together to give
-the final output. A few examples of works using grouped convolutions are `Krizhevsky et al (2012)
-<https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks>`_ [#]_;
-`Xie et at (2016) <https://arxiv.org/abs/1611.05431>`_ [#]_.
-
-A special case of grouped  convolutions is when :math:`n` equals the number of input
-channels. This is called depth-wise convolutions or channel-wise convolutions.
-depth-wise convolutions also forms a part of separable convolutions.
-
-An example to use Grouped convolutions would be:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2 / n, k1, k2),
-            border_mode=(p1, p2), subsample=(s1, s2), filter_dilation=(d1, d2), num_groups=n)
-        # output.shape[0] == b
-        # output.shape[1] == c1
-        # output.shape[2] == (i1 + 2 * p1 - k1 - (k1 - 1) * (d1 - 1)) // s1 + 1
-        # output.shape[3] == (i2 + 2 * p2 - k2 - (k2 - 1) * (d2 - 1)) // s2 + 1
-
-.. [#] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. "ImageNet
-       Classification with Deep Convolutional Neural Networks".
-       Advances in Neural Information Processing Systems 25 (NIPS 2012)
-.. [#] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He.
-       "Aggregated Residual Transformations for Deep Neural Networks".
-       arxiv preprint arXiv:1611.05431 (2016).
-
-Separable Convolutions
----------------------
-
-Separable convolutions consists of two consecutive convolution operations.
-First is depth-wise convolutions which performs convolutions separately for
-each channel of the input. The output of this operation is the given as input
-to point-wise convolutions which is a special case of general convolutions with
-1x1 filters. This mixes the channels to give the final output.
-
-As we can see from this diagram, modified from `Vanhoucke(2014)`_ [#]_, depth-wise
-convolutions is performed with :math:`c2` single channel depth-wise filters
-to give a total of :math:`c2` output channels in the intermediate output where
-each channel in the input separately performs convolutions with separate kernels
-to give :math:`c2 / n` channels to the intermediate output, where :math:`n` is
-the number of input channels. The intermediate output then performs point-wise
-convolutions with :math:`c3` 1x1 filters which mixes the channels of the intermediate
-output to give the final output.
-
-.. image:: conv_arithmetic_figures/sep2D.jpg
-    :align: center
-
-Separable convolutions is used as follows:
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.separable_conv2d(
-            input, depthwise_filters, pointwise_filters, num_channels = c1,
-            input_shape=(b, c1, i1, i2), depthwise_filter_shape=(c2, 1, k1, k2),
-            pointwise_filter_shape=(c3, c2, 1, 1), border_mode=(p1, p2),
-            subsample=(s1, s2), filter_dilation=(d1, d2))
-        # output.shape[0] == b
-        # output.shape[1] == c3
-        # output.shape[2] == (i1 + 2 * p1 - k1 - (k1 - 1) * (d1 - 1)) // s1 + 1
-        # output.shape[3] == (i2 + 2 * p2 - k2 - (k2 - 1) * (d2 - 1)) // s2 + 1
-
-.. _Vanhoucke(2014):
-   http://scholar.google.co.in/scholar_url?url=http://vincent.vanhoucke.com/
-   publications/vanhoucke-iclr14.pdf&hl=en&sa=X&scisig=AAGBfm0x0bgnudAqSVgZ
-   ALfu8uPjYOIWwQ&nossl=1&oi=scholarr&ved=0ahUKEwjLreLjr_DVAhULwI8KHWmHAM8QgAMIJigAMAA
-
-.. [#] Vincent Vanhoucke. "Learning Visual Representations at Scale",
-   International Conference on Learning Representations(2014).
-
-
-Quick reference
-===============
-
-.. admonition:: Convolution relationship
-
-    A convolution specified by
-
-    * input size :math:`i`,
-    * kernel size :math:`k`,
-    * stride :math:`s`,
-    * padding size :math:`p`,
-
-    has an output size given by
-
-    .. math::
-
-        o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1.
-
-    In Aesara, this translates to
-
-    .. code-block:: python
-
-        output = aesara.tensor.nnet.conv2d(
-            input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
-            border_mode=(p1, p2), subsample=(s1, s2))
-        # output.shape[2] == (i1 + 2 * p1 - k1) // s1 + 1
-        # output.shape[3] == (i2 + 2 * p2 - k2) // s2 + 1
-
-.. admonition:: Transposed convolution relationship
-
-    A transposed convolution specified by
-
-    * input size :math:`i`,
-    * kernel size :math:`k`,
-    * stride :math:`s`,
-    * padding size :math:`p`,
-
-    has an output size given by
-
-    .. math::
-
-        o = s (i - 1) + a + k - 2p, \quad a \in \{0, \ldots, s - 1\}
-
-    where :math:`a` is a user-specified quantity used to distinguish between the
-    :math:`s` different possible output sizes.
-
-    Unless :math:`s = 1`, Aesara requires that :math:`a` is implicitly passed
-    via an ``input_shape`` argument. For instance, if :math:`i = 3`,
-    :math:`k = 4`, :math:`s = 2`, :math:`p = 0` and :math:`a = 1`, then
-    :math:`o = 2 (3 - 1) + 1 + 4 = 9` and the Aesara code would look like
-
-    .. code-block:: python
-
-        input = aesara.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
-            output, filters, input_shape=(9, 9), filter_shape=(c1, c2, 4, 4),
-            border_mode='valid', subsample=(2, 2))
--- a/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/arbitrary_padding_no_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/dilation.gif
+++ b/doc/tutorial/conv_arithmetic_figures/dilation.gif
--- a/doc/tutorial/conv_arithmetic_figures/dilation.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/dilation.pdf
--- a/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/full_padding_no_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_no_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/no_padding_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/no_padding_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/numerical_no_padding_no_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/numerical_no_padding_no_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/numerical_no_padding_no_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/numerical_no_padding_no_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/numerical_padding_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/numerical_padding_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/numerical_padding_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/numerical_padding_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_odd.gif
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_odd.gif
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_odd.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_odd.pdf
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_odd_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_odd_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_odd_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_odd_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/padding_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/padding_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides.gif
+++ b/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides.gif
--- a/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides.pdf
--- a/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides_transposed.gif
+++ b/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides_transposed.gif
--- a/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides_transposed.pdf
+++ b/doc/tutorial/conv_arithmetic_figures/same_padding_no_strides_transposed.pdf
--- a/doc/tutorial/conv_arithmetic_figures/sep2D.jpg
+++ b/doc/tutorial/conv_arithmetic_figures/sep2D.jpg
--- a/doc/tutorial/index.rst
+++ b/doc/tutorial/index.rst
@@ -41,7 +41,6 @@ Advanced
 .. toctree::

    sparse
-    conv_arithmetic

 Advanced configuration and debugging
 ------------------------------------