We need two conventional
recurrent networks:
The *automatizer* A and the *chunker* C.
At each time step A receives the current external input.
A's error function is threefold: One term forces
it to emit certain
desired target outputs at certain times.
If there is a target, then it
becomes part of the next input. The second term forces A
at every time step
to predict its own next non-target input.
The third (crucial) term will be explained
below.

If and only if A makes an error concerning the first and
second term of its error function,
the unpredicted
input (including a potentially available teaching vector)
*along with a unique representation of the current time step* will become
the new input to C.
Before this new input can be processed,
C (whose last input may have
occurred many time steps earlier)
is trained to predict this higher-level
input from its current internal state and its last input
(employing a conventional recurrent net algorithm).
After this, C performs an activation update which
contributes to a higher level internal
representation of the input history.
Note that
according to the principle of history compression
C is fed with an *unambiguous reduced description of the input history*.
The information deducable by means of A's predictions
can be considered as *redundant*.
(The beginning of an episode
usually is not predictable, therefore
it has to be fed to the chunking level, too.)

Since C's `credit assignment paths' will often
be short compared to those of A,
C will often be able to develop useful internal
representations of previous unexpected
input events.
Due to the final term of its error function,
A will be forced to reproduce these internal
representations, *by predicting C's state*.
Therefore A will be able to create useful internal
representations by itself in an *early* stage of
processing a given sequence; it will often receive meaningful
error signals long before errors of the first or
second kind occur.
These internal representations in turn must carry the
discriminating information for enabling
A to improve
its low-level predictions.
Therefore the chunker will receive fewer and fewer
inputs, since more and more inputs become
predictable by the automatizer. This is the *collapsing operation*.
Ideally, the chunker will become obsolete after some time.

It must be emphasized that
unlike with the incremental
creation of a multi-level
predictor hierarchy described in section 3 there is no formal proof that
the 2-net *on-line* version
is free of instabilities. For instance, one can imagine
situations where A unlearns previously learned predictions
because of the third term of its error function. Relative weighting
of the different terms in A's error function
represents an ad-hoc remedy for this potential problem. In the
experiments (see below) relative weighting was not necessary.

Back to Recurrent Neural Networks page