The initial state vector is the same for all sequences . The input at time of sequence is the concatenation of the input and the last internal state . The output is itself.

We minimize and maximize essentially the same
objective functions as described above. That is,
for the -th module *which now needs
recurrent connections to itself and the other modules*,
there is again
an adaptive
predictor *which need not be recurrent*.
's
input at time is the concatenation
of the outputs of
all units .
's one-dimensional output
is trained to equal the expectation of the
output , given the outputs of the other units,
,
by defining 's error function as

In addition, all units are trained to take on values that

where is defined analogously to the respective stationary cases.

The only way a unit can protect itself from being predictable
from the other units
is to store properties of the input sequences that are independent
of aspects stored by the other units.
In other words, this method will tend to throw away redundant
temporal information much as the systems in
(Schmidhuber, 1992a)
and
(Schmidhuber, 1992b)
.
For computing weight changes, each module
looks back only to the last time step. In the on-line case,
this implies an *entirely local* learning algorithm. Still,
even when there are long time lags, the algorithm theoretically
may learn unique representations of *extended* sequences
- as can be seen
by induction over the length of the longest training
sequence:

*1. can learn
unique representations of the beginnings of all sequences.
*

*2. Suppose all sequences and sub-sequences with length are
uniquely represented in . Then, by looking back only
one time step at a time, can learn unique representations of all
sub-sequences with length .
*

The argument neglects all on-line effects and possible cross-talk.

On-line variants of the system described above were implemented by Daniel Prelinger. Preliminary experiments were conducted with the resulting recurrent systems. These experiments demonstrated that there are entirely local sequence learning methods that allow for learning unique representations of all subsequences of non-trivial sequences (like a sequence consisting of 8 consecutive presentations of the same input pattern represented by the activation of a single input unit). Best results were obtained by introducing additional modifications (like other error functions than mean squared error for the representational modules). A future paper will elaborate on sequence learning by predictability minimization.

Back to Independent Component Analysis page.