I will keep the architecture and the objective function from section 1
but I will modify the system dynamics.
Recall that unquantized
variables are assumed to take on their maximal range.
For our single training sequence with
discrete time steps, the
system dynamics (explanation follows below)
are defined by
| (3) |
| (4) |
Equation (3) is just the conventional recurrent net update rule (1).
Unlike with conventional recurrent nets, however, the weights do
not remain constant during sequence processing :
Equation (4) says that connections between units active at
successive time steps
are immediately strengthened or weakened essentially in proportion
to pre-synaptic
and post-synaptic activity.
These intra-sequence weight changes are
modulated by the non-linear functions
and
and may be negative (anti-Hebb-like)
or zero as well as positive.
Let us assume that all input vectors and all
are such
that all units can take on only activations between 0 and 1.
and
are meant to specify the upper and lower thresholds that determine
how strongly units have to be excited or inhibited to contribute to
intra-sequence weight changes. A reasonable
choice for
and
is one where
and
are strongly negative only if their argument is close to 0
and are strongly positive only if their argument is close to 1.
Both
and
should return values close to 0 for arguments from
the largest part of the interval between 0 and 1. This implies
hardly any intra-sequence weight changes for connections
between units that have non-extreme activations during successive time steps.
The overall effect is that only connections between units that are exceptionally active or exceptionally inactive during successive time steps can be significantly modified. Intra-sequence weight changes essentially occur only if the network `pays a lot of attention' to certain units by strongly exciting them or strongly inhibiting them. Weights to units that are not `illuminated by adaptive internal spotlights of attention' essentially remain invariant and participate only in `automatic processing' as opposed to `active intra-sequence learning'. The remainder of this paper derives an exact gradient-based algorithm designed to adjust the system (via inter-sequence weight changes) such that it creates appropriate intra-sequence weight changes at appropriate time steps.