The standard method for processing temporal sequences is to employ
a recurrent net with feedback connections. The feedback connections
allow for a short-term memory of information earlier in a sequence.
The present work suggests a novel approach to building a short-term
memory by employing fast weights that can be set and reset by the
'memory controller'
. Fast weights can hold on to information
over time because they remain essentially invariant unless they are
explicitly modified.
One potential advantage of the method over the more conventional
recurrent net algorithms is that it does not necessarily
require full-fledged units - experiencing some sort of feedback -
for storing temporal information. A single weight may be sufficient.
Because there are many more weights than units in most networks,
this property represents a potential
for storage efficiency. For related reasons,
the novel representation of past inputs
is well-suited for solving certain problems involving
temporary variable binding
in a natural manner:
's current input may be viewed as a representation of
the addresses of a set of variables;
's current output may be viewed as the representation of
the current contents of this set of variables.
In contrast with recurrent nets, temporary bindings can be established very
naturally by temporary connectivity patterns instead of temporary
activation patterns
(see section 3.2 for an illustrative experiment).
For initialization reasons we introduce an additional time step 0
at the beginning of an episode.
At time step
each weight variable
of a directed connection from unit
to unit
is set to
(a function of
's outputs as described below).
At time step
, the
are used to compute the output of
according to the usual activation spreading rules for
back-propagation networks (e.g. [Werbos, 1974]).
After this, each weight variable
is altered
according to
| (1) |
Equation (1) is essentially identical to Möller and Thrun's equation (1) in [Möller and Thrun, 1990]. Unlike [Möller and Thrun, 1990], however, the current paper derives an exact gradient descent algorithm for time-varying inputs and outputs for this kind of architecture.
For all weights
(from unit
to unit
)
we are interested in the increment
![]() |
(2) |
Here
is a constant learning rate.
At each time step
, the factor
| (3) |
A simple interface between
and
would provide one output unit
for each weight variable
, where
| (4) |
A disadvantage of (4) is that the
number of output units in
grows in proportion to the number of weights in
.
An alternative is the following:
Provide an output unit in
for each unit in
from which
at least one fast weight originates. Call the set of
these output units FROM.
Provide an output unit in
for each unit in
to which
at least one fast weight leads. Call the set of
these output units TO.
For each weight variable
we now have a unit
FROM and a unit
TO. At time
, define
, where
is differentiable with respect to all its parameters. As a
representative example
we will focus on the special case
of
being the multiplication operator:
| (5) |
One way to interpret the FROM/TO architecture is to view
as a device for creating temorary associations by giving two parameters
to the short term memory: The first parameter is an activation
pattern over FROM representing a key to
a temporary association pair, the second
parameter is an activation
pattern over TO representing the corresponding entry.
Note that both
key and entry may involve hidden units.
(4) and (5) differ in the way that error signals are
obtained at
's output units:
If (4) is employed, then we use conventional back-propagation to
compute
in (3).
If (5) is employed, note that
| (6) |
Conventional back-propagation can be used to compute
for each output
unit
and for all
. The results
can be kept in
variables.
This makes it easy to solve (6) in a second pass.
The algorithm is local in time, its
update-complexity per time step is
.
However, it is not local in space
(see [Schmidhuber, 1990b] for a definition of
locality in space and time).