Dynamics of conventional recurrent nets.
To save indices, I consider a single discrete
sequence of real-valued input
vectors
,
, each with dimension
.
In what follows, if
denotes a vector
then
denotes the
-th component of
.
If
denotes a unit, then
denotes the activation of the
unit at time
.
's real-valued
weight at time
is denoted by
.
Throughout the remainder of this paper, unquantized
variables are assumed to take on their maximal range.
The dynamic evolution of a traditional discrete time recurrent net
(e.g. [2],
[7]) is given by
| (1) |
Typical objective function. There may exist specified
target values
for certain outputs
. We define
![]() |
(2) |
Complexity of traditional recurrent net algorithms.
Let
be the number of time-varying variables for storing temporal
events. Let
be the ratio between the number of operations per time
step (for an exact gradient based
supervised sequence learning algorithm),
and
. Let
be the ratio between
the maximum number of storage cells necessary for learning arbitrary
sequences, and
.
The fastest known exact gradient based learning algorithm for
minimizing (2) with fully recurrent networks is `back-propagation
through time' (BPTT, e.g. [3]).
With BPTT,
, and
. BPTT's disadvantage is that
it requires
storage - which means that
is infinite: BPTT is not a fixed-size storage algorithm.
The most well-known fixed-size storage learning
algorithm for minimizing
with
fully recurrent nets is RTRL
[2][8].
With RTRL,
,
(much worse than with BPTT) and
(much better than with BPTT).
1
Contribution of this paper.
The contribution of this paper is an extension of the conventional
dynamics (as in equation (1)) plus a corresponding exact gradient based
learning algorithm with
(as good as with BPTT)
and
(as good as with RTRL).
The basic idea is: The
weights themselves are included
in the set of time-varying variables that can store temporal information
(
).
Section 2 describes the novel network dynamics that
allow the network to actively and quickly manipulate
its own weights
(via so-called intra-sequence weight changes)
by creating certain appropriate internal activation patterns
2during observation of an
input sequence3.
Section 3 derives an exact supervised sequence learning algorithm
(for creating inter-sequence weight changes
which affect only the initial weights
at the beginning of each training sequence)
forcing the network to use its association building capabilities
to minimize
.
The gradient-based learning algorithm for inter-sequence weight
changes takes into account the fact
that intra-sequence weight changes at some point in time
may influence both activations and
intra-sequence weight changes at later points in time.