With the conventional algorithm, with various learning rates, and with more than 1,000,000 training sequences it was not possible to obtain a significant performance improvement concerning the target unit. A similar task involving time lags of as few as 5 steps required many hundreds of thousands of training sequences.
But, a chunking system was able to solve the 20-step task rather quickly, using an efficient approximation of the BPTT-method where error was propagated a maximum of 3 steps into the past (although there was a 20 step time lag!). No unique representations of time steps were necessary for this task. 13 out of 17 test runs required fewer than 5000 training sequences. The remaining test runs required fewer than 35000 training sequences.
Typically, A quickly learned to predict the `easy' symbols
. This led to a greatly reduced input sequence
for C which now did not have many problems in learning
to predict
the target values at the end of the sequences. After a while A was able
to mimic C's internal representations, which in turn allowed it
to learn correct target predictions by itself.
A's final weight matrix often
looked like one one would hope to get from the
conventional algorithm:
There were hidden units which learned to bridge the 20-step time lags by
means of strong self-connections.
The chunking system needed less computation per time step
than the conventional method.
Still it required many fewer
training sequences.