The experiment will show that LSTM can solve non-trivial, complex long time lag problems involving distributed, high-precision, continuous-valued representations.
Task.
Each element of each input sequence is a pair
consisting of two components.
The first component is a real value randomly chosen
from the interval
. The second component is either
1.0, 0.0, or -1.0, and is used as
a marker: at the end of each sequence,
the task is to output the sum of
the first components of those pairs that are marked
by second components equal to 1.0.
The value
is used to determine average sequence length,
which is a randomly chosen integer between
and
.
With a given sequence, exactly two pairs
are marked as follows: we first randomly select and mark
one of the first ten pairs
(whose first component is called
).
Then we randomly select and mark
one of the first
still unmarked pairs
(whose first component is called
).
The second components of the remaining pairs are zero except for
the first and final pair, whose second components are -1
(
is set to zero in the rare case where the first
pair of the sequence got marked).
An error signal is generated only at the sequence end:
the target is
(the sum
scaled to the interval
).
A sequence was processed correctly if the absolute error
at the sequence end is below 0.04.
Architecture. We use a 3-layer net with 2 input units, 1 output unit, and 2 memory cell blocks of size 2 (a cell block size of 1 works well, too). The output layer receives connections only from memory cells. Memory cells/ gate units receive inputs from memory cells/gate units (fully connected hidden layer).
State drift versus initial bias.
Note that the task requires to store the precise values of real numbers for
long durations -- the system must learn to protect
memory cell contents against
even minor ``internal state drifts''.
Our simple but highly effective
way of solving drift problems at the beginning
of learning is to initially bias the input gate
towards zero.
There is no need for fine tuning initial bias:
with sigmoid logistic activation functions,
the precise initial bias hardly matters because vastly different
initial bias values produce almost the same near-zero activations.
In fact, the system itself learns to generate the most appropriate
input gate bias. To study the significance of the drift problem, we
bias all non-input units, thus
artificially inducing internal state drifts.
Weights (including bias weights)
are randomly initialized in the range
.
The first (second) input gate bias is initialized with
(
)
(recall that the precise initialization values hardly matters, as
confirmed by additional experiments).
Training / Testing. The learning rate is 0.5. Training examples are generated on-line. Training is stopped if the average training error is below 0.01, and the 2000 most recent sequences were processed correctly (see definition above).
Results.
With a test set consisting of 2560 randomly chosen sequences,
the average test set error was always below 0.01, and
there were never more than 3 incorrectly processed sequences.
The following results are means of 10 trials:
For
(
,
), training was stopped after
74,000 (209,000; 853,000) training sequences,
and then only 1 (0, 1) of the test sequences was
not processed correctly. For
,
the number of required training examples
varied between 370,000 and 2,020,000,
exceeding 700,000 in only 3 cases.
The experiment demonstrates even for very long time lags: (1) LSTM is able to work well with distributed representations. (2) LSTM is able to perform calculations involving high-precision, continuous values. Such tasks are impossible to solve within reasonable time by other algorithms: the main problem of gradient-based approaches (including TDNN, pseudo Newton) is their inability to deal with very long minimal time lags (vanishing gradient). A main problem of ``global'' and ``discrete'' approaches (RS, Bengio's and Frasconi's EM-approach, discrete error propagation) is their inability to deal with high-precision, continuous values.