The standard method for processing temporal sequences is to employ a recurrent net with feedback connections. The feedback connections allow for a short-term memory of information earlier in a sequence. The present work suggests a novel approach to building a short-term memory by employing fast weights that can be set and reset by the 'memory controller' . Fast weights can hold on to information over time because they remain essentially invariant unless they are explicitly modified.

One potential advantage of the method over the more conventional
recurrent net algorithms is that it does not necessarily
require full-fledged units - experiencing some sort of feedback -
for storing temporal information. A single weight may be sufficient.
Because there are many more weights than units in most networks,
this property represents a potential
for storage efficiency. For related reasons,
the novel representation of past inputs
is well-suited for solving certain problems involving
*temporary variable binding*
in a natural manner:
's current input may be viewed as a representation of
the *addresses* of a set of variables;
's current output may be viewed as the representation of
the current *contents* of this set of variables.
In contrast with recurrent nets, temporary bindings can be established very
naturally by temporary *connectivity patterns* instead of temporary
*activation patterns*
(see section 3.2 for an illustrative experiment).

For initialization reasons we introduce an additional time step 0
at the beginning of an episode.
At time step
each weight variable
of a directed connection from unit to unit is set to
(a function of 's outputs as described below).
At time step , the are used to compute the output of
according to the usual activation spreading rules for
back-propagation networks (e.g. [Werbos, 1974]).
After this, each weight variable
is altered
according to

(1) |

Equation (1) is essentially identical to Möller and Thrun's equation (1) in [Möller and Thrun, 1990]. Unlike [Möller and Thrun, 1990], however, the current paper derives an exact gradient descent algorithm for time-varying inputs and outputs for this kind of architecture.

For all weights
(from unit to unit )
we are interested in the increment

(2) |

Here is a constant learning rate.
At each time step , the factor

can be computed by conventional back-propagation (e.g. [Werbos, 1974]). For we obtain the recursion

We can employ a method similar to the one described in [Robinson and Fallside, 1987] and [Williams and Zipser, 1989]: For each and each we introduce a variable (initialized to zero at the beginning of an episode) which can be updated at each time step :

(3) |

A simple interface between and
would provide one output unit for each weight variable
, where

(4) |

A disadvantage of (4) is that the
number of output units in
grows in proportion to the number of weights in .
An alternative is the following:
Provide an output unit in for each unit in from which
at least one fast weight originates. Call the set of
these output units *FROM*.
Provide an output unit in for each unit in to which
at least one fast weight leads. Call the set of
these output units *TO*.
For each weight variable
we now have a unit
*FROM* and a unit
*TO*. At time , define
, where
is differentiable with respect to all its parameters. As a
representative example
we will focus on the special case
of being the multiplication operator:

(5) |

One way to interpret the *FROM/TO* architecture is to view
as a device for creating temorary associations by giving two parameters
to the short term memory: The first parameter is an activation
pattern over *FROM* representing a *key* to
a temporary association pair, the second
parameter is an activation
pattern over *TO* representing the corresponding *entry*.
Note that both
key and entry may involve hidden units.

(4) and (5) differ in the way that error signals are
obtained at 's output units:
If (4) is employed, then we use conventional back-propagation to
compute
in (3).
If (5) is employed, note that

(6) |

Conventional back-propagation can be used to compute for each output unit and for all . The results can be kept in variables. This makes it easy to solve (6) in a second pass.

The algorithm is *local in time*, its
update-complexity per time step is
.
However, it is not local in space
(see [Schmidhuber, 1990b] for a definition of
locality in space and time).

Back to Recurrent Neural Networks page