We need two conventional recurrent networks: The automatizer A and the chunker C. At each time step A receives the current external input. A's error function is threefold: One term forces it to emit certain desired target outputs at certain times. If there is a target, then it becomes part of the next input. The second term forces A at every time step to predict its own next non-target input. The third (crucial) term will be explained below.
If and only if A makes an error concerning the first and second term of its error function, the unpredicted input (including a potentially available teaching vector) along with a unique representation of the current time step will become the new input to C. Before this new input can be processed, C (whose last input may have occurred many time steps earlier) is trained to predict this higher-level input from its current internal state and its last input (employing a conventional recurrent net algorithm). After this, C performs an activation update which contributes to a higher level internal representation of the input history. Note that according to the principle of history compression C is fed with an unambiguous reduced description of the input history. The information deducable by means of A's predictions can be considered as redundant. (The beginning of an episode usually is not predictable, therefore it has to be fed to the chunking level, too.)
Since C's `credit assignment paths' will often be short compared to those of A, C will often be able to develop useful internal representations of previous unexpected input events. Due to the final term of its error function, A will be forced to reproduce these internal representations, by predicting C's state. Therefore A will be able to create useful internal representations by itself in an early stage of processing a given sequence; it will often receive meaningful error signals long before errors of the first or second kind occur. These internal representations in turn must carry the discriminating information for enabling A to improve its low-level predictions. Therefore the chunker will receive fewer and fewer inputs, since more and more inputs become predictable by the automatizer. This is the collapsing operation. Ideally, the chunker will become obsolete after some time.
It must be emphasized that unlike with the incremental creation of a multi-level predictor hierarchy described in section 3 there is no formal proof that the 2-net on-line version is free of instabilities. For instance, one can imagine situations where A unlearns previously learned predictions because of the third term of its error function. Relative weighting of the different terms in A's error function represents an ad-hoc remedy for this potential problem. In the experiments (see below) relative weighting was not necessary.