We applied our off-line methods to German newspaper articles (simulations were run and evaluated by the second author). We compared the results to those obtained with standard encoding techniques provided by the operating system UNIX, namely ``pack'', ``compress'', and ``gzip''. The corresponding decoding algorithms are ``unpack'', ``uncompress'', and ``gunzip'', respectively. ``pack'' is based on Huffman-Coding (e.g. [4]), while ``compress'' and ``gzip'' are based on the asymptotically ``optimal'' Lempel-Ziv technique [2]. It should be noted that ``pack'', ``compress'', and ``gzip'' ought to be classified as on-line methods - they adapt to the specific text file they see. In contrast, the competing ``neural'' methods ran off-line, due to time limitations. Therefore our comparison was unfair in the sense that it was biased against the ``neural'' methods. See section V, however, for on-line ``neural'' alternatives.
The training set for the predictor was given by a
set of 40 articles from the newspaper Münchner Merkur,
each containing between 10000 and 20000 characters.
The alphabet consisted of
possible characters, including
upper case and lower case letters, ciphers, interpunction symbols, and
special German letters like ``ö'', ``ü'', ``ä''.
had 430 hidden units.
A ``true'' unit with constant activation
1.0 was connected to all hidden and output units.
The learning rate was 0.2.
The training phase consisted of 25 sweeps through the
training set, taking 3 days of computation time on an HP 700 station.
Why just 25 sweeps?
On a separate test set, numbers of sweeps
between 20 and 40 were empirically found to lead to
acceptable performance. Note that a single sweep actually
provides many different training examples for the predictor.
The test set consisted of 20 newspaper articles (from the same newspaper), each containing between 10000 and 20000 characters. Of course, the test set did not overlap with the training set. Table 1 lists the average compression ratios and the corresponding variances. Our methods achieved ``excellent'' performance (according to Held's statement quoted in the introduction). Even METHOD 3 led to an ``excellent'' compression ratio, although it does not make use of all the information about the conditional probabilities. The best performance was obtained with METHOD 2, which outperformed the strongest conventional competitor, the UNIX ``gzip'' function based on the asymptotically optimal Lempel-Ziv algorithm. Note that variance goes up (but always remains within acceptable limits) as compression performance improves.
------------ INSERT TABLE 1 HERE ----------------
The hidden units were actually necessary to achieve good performance. A network without hidden units was not able to achieve average compression ratios exceeding 2.0. The precise number of hidden units appeared to be not very important, though. A network with 300 hidden units achieved performance similar to the one of the network above.
How does a neural net trained on articles from Münchner Merkur perform on articles from other sources? Without retraining the neural predictor, we applied all competing methods to 10 articles from another German newspaper (the Frankenpost). The results are given in table 2.
------------ INSERT TABLE 2 HERE ----------------
The Frankenpost articles were harder to compress for all algorithms. But relative performance remained comparable.
Note that we used quite a small time-window (
).
In general, larger time windows will make more information
available to the predictor. In turn, this will improve
the prediction quality and increase the compression ratio.
Therefore we expect to obtain even better results for
and for recurrent predictor networks (note that recurrent
nets are less limited than the time window approach -
in principle they can emit predictions
based on all previous characters).
Another reason for optimism is given by a
performance comparison with three human subjects
who had to predict characters
(randomly selected from the test files) from
preceding characters. With
, the humans
were able to predict 52 percent of all characters, while
our predictor predicted 49 percent (the character with the
highest predicted probability was taken as the prediction).
With
, humans were able to predict about 59 percent
of all characters. With
, humans
were able to predict about 63 percent
of all characters. We expect that
will remain close
to human performance for
.
More training data, however, are required to avoid overfitting.