提交 e6a3b009 authored 作者: Alexandre de Brébisson's avatar Alexandre de Brébisson 提交者: Mathieu Germain

Improve docstring of h_softmax and add basic example (#5095)

上级 bfde042c
...@@ -2235,16 +2235,25 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class, ...@@ -2235,16 +2235,25 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class,
W1, b1, W2, b2, target=None): W1, b1, W2, b2, target=None):
""" Two-level hierarchical softmax. """ Two-level hierarchical softmax.
The architecture is composed of two softmax layers: the first predicts the This function implements a two-layer hierarchical softmax. It is commonly
class of the input x while the second predicts the output of the input x in used as an alternative of the softmax when the number of outputs is
the predicted class. important (it is common to use it for millions of outputs). See
More explanations can be found in the original paper [1]_. reference [1]_ for more information about the computational gains.
If target is specified, it will only compute the outputs of the The `n_outputs` outputs are organized in `n_classes` classes, each class
corresponding targets. Otherwise, if target is None, it will compute all containing the same number `n_outputs_per_class` of outputs.
the outputs. For an input `x` (last hidden activation), the first softmax layer predicts
its class and the second softmax layer predicts its output among its class.
The outputs are grouped in the same order as they are initially defined.
If `target` is specified, it will only compute the outputs of the
corresponding targets. Otherwise, if `target` is `None`, it will compute
all the outputs.
The outputs are grouped in classes in the same order as they are initially
defined: if `n_outputs=10` and `n_classes=2`, then the first class is
composed of the outputs labeled `{0,1,2,3,4}` while the second class is
composed of `{5,6,7,8,9}`. If you need to change the classes, you have to
re-label your outputs.
.. versionadded:: 0.7.1 .. versionadded:: 0.7.1
...@@ -2267,7 +2276,8 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class, ...@@ -2267,7 +2276,8 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class,
probabilities of the classes. probabilities of the classes.
b1: tensor of shape (n_classes,) b1: tensor of shape (n_classes,)
the bias vector of the first softmax layer. the bias vector of the first softmax layer.
W2: tensor of shape (n_classes, number of features of the input x, n_outputs_per_class) W2: tensor of shape (n_classes, number of features of the input x,
n_outputs_per_class)
the weight matrix of the second softmax, which maps the input x to the weight matrix of the second softmax, which maps the input x to
the probabilities of the outputs. the probabilities of the outputs.
b2: tensor of shape (n_classes, n_outputs_per_class) b2: tensor of shape (n_classes, n_outputs_per_class)
...@@ -2281,22 +2291,74 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class, ...@@ -2281,22 +2291,74 @@ def h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class,
Returns Returns
------- -------
output_probs: tensor of shape (batch_size, n_outputs) or (batch_size, 1) tensor of shape (`batch_size`, `n_outputs`) or (`batch_size`, 1)
Output of the two-layer hierarchical softmax for input x. If target is Output tensor of the two-layer hierarchical softmax for input `x`.
not specified (None), then all the outputs are computed and the Depending on argument `target`, it can have two different shapes.
returned tensor has shape (batch_size, n_outputs). Otherwise, when If `target` is not specified (`None`), then all the outputs are
target is specified, only the corresponding outputs are computed and computed and the returned tensor has shape (`batch_size`, `n_outputs`).
the returned tensor has thus shape (batch_size, 1). Otherwise, when `target` is specified, only the corresponding outputs
are computed and the returned tensor has thus shape (`batch_size`, 1).
Notes Notes
----- -----
The product of n_outputs_per_class and n_classes has to be greater or equal The product of `n_outputs_per_class` and `n_classes` has to be greater or
to n_outputs. If it is strictly greater, then the irrelevant outputs will equal to `n_outputs`. If it is strictly greater, then the irrelevant
be ignored. outputs will be ignored.
n_outputs_per_class and n_classes have to be the same as the corresponding `n_outputs_per_class` and `n_classes` have to be the same as the
dimensions of the tensors of W1, b1, W2 and b2. corresponding dimensions of the tensors of `W1`, `b1`, `W2` and `b2`.
The most computational efficient configuration is when n_outputs_per_class The most computational efficient configuration is when
and n_classes are equal to the square root of n_outputs. `n_outputs_per_class` and `n_classes` are equal to the square root of
`n_outputs`.
Examples
--------
The following example builds a simple hierarchical softmax layer.
>>> import numpy as np
>>> import theano
>>> from theano import tensor
>>> from theano.tensor.nnet import h_softmax
>>>
>>> # Parameters
>>> batch_size = 32
>>> n_outputs = 100
>>> dim_x = 10 # dimension of the input
>>> n_classes = int(np.ceil(np.sqrt(n_outputs)))
>>> n_outputs_per_class = n_classes
>>> output_size = n_outputs_per_class * n_outputs_per_class
>>>
>>> # First level of h_softmax
>>> W1 = theano.shared(np.asarray(
... np.random.normal(0, 0.001, (dim_x, n_classes))))
>>> b1 = theano.shared(np.asarray(np.zeros((n_classes,))))
>>>
>>> # Second level of h_softmax
>>> W2 = np.asarray(np.random.normal(0, 0.001,
... size=(n_classes, dim_x, n_outputs_per_class)))
>>> W2 = theano.shared(W2)
>>> b2 = theano.shared(
... np.asarray(np.zeros((n_classes, n_outputs_per_class))))
>>>
>>> # We can now build the graph to compute a loss function, typically the
>>> # negative log-likelihood:
>>>
>>> x = tensor.imatrix('x')
>>> target = tensor.imatrix('target')
>>>
>>> # This only computes the output corresponding to the target.
>>> # The complexity is O(n_classes + n_outputs_per_class).
>>> y_hat_tg = h_softmax(x, batch_size, output_size, n_classes,
... n_outputs_per_class, W1, b1, W2, b2, target)
>>>
>>> negll = -tensor.mean(tensor.log(y_hat_tg))
>>>
>>> # We may need to compute all the outputs (at test time usually):
>>>
>>> # This computes all the outputs.
>>> # The complexity is O(n_classes * n_outputs_per_class).
>>> output = h_softmax(x, batch_size, output_size, n_classes,
... n_outputs_per_class, W1, b1, W2, b2)
References References
---------- ----------
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论