We start with 100 training sequences for a recurrent predictor network with 5 input units, 5 output units, 1 bias unit, and a learning rate of 0.5. The predictor sees the original sequences, one symbol at a time. At every time step, it tries to predict the next element (the symbol whose input representation is closest to the predictor's real-valued output vector is taken as the deterministic prediction).

Then we ``freeze'' the predictor (never change its weights again).
Now training sequences for a separate recurrent classifier with
6 input units, 5 hidden units, 1 bias unit, and learning rate 0.5,
are ``filtered through'' the predictor:
The classifier sequentially is fed with the more compact sequence
descriptions
consisting of the unexpected inputs only. Each input is concatenated
with a representation
of the time that went by since the last unexpected input
(the exponentially decaying activation
of a special input unit is reset to 1 whenever there is
an unexpected input).
The compact sequence conveys all the information conveyed
by the original sequence, **and the classifier learns
the correct classifications within less than 10000 training sequences.**

Recall that conventional recurrent networks failed to learn the task within training sequences. In this special case, the speed-up factor obtained by adaptive redundancy reduction is at least .

More details and extensions of the principle above can be found in [21], [23], [29], and especially in [24]. The next section describes a somewhat different kind of neural predictor for compressing natural text (as opposed to artificial symbol strings).