The standard method for processing temporal sequences is to employ a recurrent net with feedback connections. The feedback connections allow for a short-term memory of information earlier in a sequence. The present work suggests a novel approach to building a short-term memory by employing fast weights that can be set and reset by the 'memory controller' . Fast weights can hold on to information over time because they remain essentially invariant unless they are explicitly modified.
One potential advantage of the method over the more conventional recurrent net algorithms is that it does not necessarily require full-fledged units - experiencing some sort of feedback - for storing temporal information. A single weight may be sufficient. Because there are many more weights than units in most networks, this property represents a potential for storage efficiency. For related reasons, the novel representation of past inputs is well-suited for solving certain problems involving temporary variable binding in a natural manner: 's current input may be viewed as a representation of the addresses of a set of variables; 's current output may be viewed as the representation of the current contents of this set of variables. In contrast with recurrent nets, temporary bindings can be established very naturally by temporary connectivity patterns instead of temporary activation patterns (see section 3.2 for an illustrative experiment).
For initialization reasons we introduce an additional time step 0
at the beginning of an episode.
At time step
each weight variable
of a directed connection from unit to unit is set to
(a function of 's outputs as described below).
At time step , the are used to compute the output of
according to the usual activation spreading rules for
back-propagation networks (e.g. [Werbos, 1974]).
After this, each weight variable
Equation (1) is essentially identical to Möller and Thrun's equation (1) in [Möller and Thrun, 1990]. Unlike [Möller and Thrun, 1990], however, the current paper derives an exact gradient descent algorithm for time-varying inputs and outputs for this kind of architecture.
For all weights
(from unit to unit )
we are interested in the increment
Here is a constant learning rate.
At each time step , the factor
A simple interface between and
would provide one output unit for each weight variable
A disadvantage of (4) is that the
number of output units in
grows in proportion to the number of weights in .
An alternative is the following:
Provide an output unit in for each unit in from which
at least one fast weight originates. Call the set of
these output units FROM.
Provide an output unit in for each unit in to which
at least one fast weight leads. Call the set of
these output units TO.
For each weight variable
we now have a unit
FROM and a unit
TO. At time , define
is differentiable with respect to all its parameters. As a
we will focus on the special case
of being the multiplication operator:
One way to interpret the FROM/TO architecture is to view as a device for creating temorary associations by giving two parameters to the short term memory: The first parameter is an activation pattern over FROM representing a key to a temporary association pair, the second parameter is an activation pattern over TO representing the corresponding entry. Note that both key and entry may involve hidden units.
(4) and (5) differ in the way that error signals are
obtained at 's output units:
If (4) is employed, then we use conventional back-propagation to
If (5) is employed, note that
Conventional back-propagation can be used to compute for each output unit and for all . The results can be kept in variables. This makes it easy to solve (6) in a second pass.
The algorithm is local in time, its update-complexity per time step is . However, it is not local in space (see [Schmidhuber, 1990b] for a definition of locality in space and time).