Our results show that neural networks are promising tools for loss-free data compression. It was demonstrated that even off-line methods based on small time windows can lead to excellent compression ratios. We have hardly begun, however, to exhaust the potential of the basic approach.
A disadvantage of the off-line technique is that it is off-line: the predictor does not adapt to the specific text file it sees. Instead it relies on regularities extracted during the training phase, and on its ability to generalize. This tends to make it language specific. English texts or C-code should be compressed with a predictor different from the one used for German texts (unless one takes the effort and trains the predictor on texts from many different sources, of course).
As mentioned in section II, this limitation is not essential. It is straight-forward to construct on-line variants of all three methods described in the previous sections. With these on-line variants, the predictor continues to learn during compression. A typical on-line variant proceeds like this: both the sender and the receiver start with exactly the same initial predictor. Whenever the sender sees a new character, it encodes it using its current predictor. The code is sent to the receiver who decodes it. Both the sender and the receiver use exactly the same learning protocol to modify their weights (for instance: after processing every 1000th character, take the last 10000 symbols to retrain the predictor). The modified weights need not be sent from the sender to the receiver and do not have to be taken into account to compute the average compression ratio. Especially with long unknown text files, the on-line variant should make a big difference. Initial experiments with on-line variants of METHODs 2 and 3 led to additional significant improvements of the compression ratio.
The main disadvantage of both on-line and off-line variants, however, is their computational complexity. Our current off-line implementations are clearly slower than conventional standard techniques, by a factor of about 1000 (but we did not attempt to optimize our systems with respect to speed). And the complexity of an on-line method is typically even worse than the one of the corresponding off-line method (the precise slow-down factor depends on the nature of the learning protocol, of course). For this reason, especially the promising on-line variants can be recommended only if special neural net hardware is available. Note, however, that there are many commercial data compression applications which rely on specialized electronic chips.
There are a few obvious directions for future experimental research: (1) Use larger time windows or recurrent nets - they seem to be promising even for off-line methods (see the last paragraph of section IV). (2) Thoroughly test the potential of on-line methods. Both (1) and (2) should greatly benefit from fast hardware.
Finally we mention that there are additional interesting applications of neural predictors of conditional probabilities. See  for a method that uses a predictor of conditional probabilities to modulate the sequence processing strategy of a separate recurrent network . This can greatly improve 's ability to detect correlations between events separated by long time lags. See  for a method that uses predictors of conditional probabilities to develop factorial codes of environmental input patterns - codes with the property that the code components are statistically independent (see  and  for applications). This can be useful in conjunction with statistical classifiers that assume statistical independence of their input variables.