All our methods are instances of a strategy known as ``predictive coding'' or ``model-based coding''. We use a neural predictor network which is trained to approximate the conditional probability distribution of possible characters, given previous characters. 's outputs are fed into coding algorithms that generate short codes for characters with low information content (characters with high predicted probability) and long codes for characters conveying a lot of information (highly unpredictable characters).
Why not use a look-up table instead of a network? Because look-up tables tend to be extremely inefficient. A look-up table requires entries for all the conditional probabilities of possible characters, given previous characters. In addition, a special procedure is required for dealing with previously unseen combinations of input characters. In contrast, the size of a neural net typically grows in proportion to (assuming the number of hidden units grows in proportion to the number of inputs), and its inherent ``generalization capability'' is going to take care of previously unseen combinations of input characters (hopefully by coming up with good predicted probabilities).
We will make the distinction between on-line and off-line variants of our approach. With off-line methods, is trained on a separate set of training files. After training, the weights are frozen and copies of are installed at all machines functioning as message receivers or senders. From then on, is used to encode and decode unknown files without being changed any more. The weights become part of the code of the compression algorithm. The storage occupied by the network weights does not have to be taken into account to measure the performance on unknown files - just like the code for a conventional data compression algorithm does not have to be taken into account.
The on-line variants are based on the insight that even if the predictor learns during compression, the modified weights need not be sent from the sender to the receiver across the communication channel - as long as the predictor employed for decoding uses exactly the same initial conditions and learning algorithm as the predictor used for encoding (this observation goes back to Shannon). Since on-line methods can adapt to the statistical properties of specific files, they promise significantly better performance than off-line methods. But there is a price to pay: on-line methods tend to be computationally more expensive.
Section IV will show that even off-line methods can sometimes achieve excellent results. We will briefly come back to on-line methods in the final section of this paper.