adding the tutorial for the case where a subset of the weight matrix should get updated.

77e4a15b · Sina Honari · d2c39c4e · 77e4a15b · 77e4a15b
--- a/doc/tutorial/faq.txt
+++ b/doc/tutorial/faq.txt
+.. _faq:
+
+===========================
+Frequently Asked Questions
+===========================
+Updating a Subset of Weights
+=============================
+If you want to update only a subset of a weight matrix (such as
+some rows or some columns) that are used in the forward propogation
+of this iteration, then the cost function should be defined in a way
+that it only depends on the subset of weights that are used in this
+iteration.
+
+For example if you want to learn a lookup table, e.g. used for
+word embeddings, where each row is a vector of weights representing
+the embedding that the model has learned for a word, in each
+iteration only the rows of the matrix should get updated that their
+corresponding words were used in the forward propogation. Here is
+how the theano function should be written:
+
+# defining a shared variable for the lookup table
+>>> lookup_table = theano.shared(matrix_ndarray).
+
+# getting a subset of the table (some rows
+# or some columns) by passing an integer vector of
+# indices corresponding to those rows or columns.
+>>> slice = lookup_table[vector_of_indeces]
+
+# From now on, use only 'slice'.
+# Do not call lookup_table[vector_of_indeces] again.
+# This causes problems with grad as this will create new variables.
+
+# defining cost which depends only on slice
+# and not the entire lookup_table
+>>> cost = something that depends on slice
+>>> g = theano.grad(cost, slice)
+
+# There are two ways for updating the parameters:
+# either use inc_subtensor or set_subtensor.
+# It is recommended to use inc_subtensor.
+# Some theano optimizations do the conversion between
+# the two functions, but not in all cases.
+>>> updates = inc_subtensor(slice, g*lr)
+# OR
+>>> updates = set_subtensor(slice, slice + g*lr)
+
+# Note that currently we just cover the case here,
+# not if you use inc_subtensor or set_subtensor with other types of indexing.
+
+#defining the theano function
+>>> f=theano.function(..., update=updates)
+
+Note that you can compute the gradient of the cost function w.r.t.
+the entire lookup_table, and the gradient will have nonzero rows only
+for the rows that were selected during forward propagation. If you use
+gradient descent to update the parameters, there are no issues except
+for unnecessary computation, e.g. you will update the lookup table
+parameters with many zero gradient rows. However, if you want to use
+a different optimization method like rmsprop or Hessian-Free optimization,
+then there will be issues. In rmsprop, you keep an exponentially decaying
+squared gradient by whose square root you divide the current gradient to
+rescale the update step component-wise. If the gradient of the lookup table row
+which corresponds to a rare word is very often zero, the squared gradient history
+will tend to zero for that row because the history of that row decays towards zero.
+Using Hessian-Free, you will get many zero rows and columns. Even one of them would
+make it non-invertible. In general, it would be better to compute the gradient only
+w.r.t. to those lookup table rows or columns which are actually used during the
+forward propagation.
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -46,3 +46,4 @@ you out.
    extending_theano_c
    python-memory-management
    multi_cores
+    faq