We start with 100 training sequences for a recurrent predictor network with 5 input units, 5 output units, 1 bias unit, and a learning rate of 0.5. The predictor sees the original sequences, one symbol at a time. At every time step, it tries to predict the next element (the symbol whose input representation is closest to the predictor's real-valued output vector is taken as the deterministic prediction).
Then we ``freeze'' the predictor (never change its weights again). Now training sequences for a separate recurrent classifier with 6 input units, 5 hidden units, 1 bias unit, and learning rate 0.5, are ``filtered through'' the predictor: The classifier sequentially is fed with the more compact sequence descriptions consisting of the unexpected inputs only. Each input is concatenated with a representation of the time that went by since the last unexpected input (the exponentially decaying activation of a special input unit is reset to 1 whenever there is an unexpected input). The compact sequence conveys all the information conveyed by the original sequence, and the classifier learns the correct classifications within less than 10000 training sequences.
Recall that conventional recurrent networks failed to learn the task within training sequences. In this special case, the speed-up factor obtained by adaptive redundancy reduction is at least .
More details and extensions of the principle above can be found in , , , and especially in . The next section describes a somewhat different kind of neural predictor for compressing natural text (as opposed to artificial symbol strings).