In this subsection, LSTM solves another task that cannot be solved at all by any other recurrent net learning algorithm we are aware of.
Task 2a: two relevant, widely separated symbols.
The goal is to
classify
sequences.
Elements are represented locally (binary input vectors with
only one non-zero bit).
The sequence starts with an
,
ends with
a
(the ``trigger symbol'')
and otherwise consists
of randomly chosen symbols from the set
except for two elements
at positions
and
that are either
or
.
The sequence length is randomly chosen between
and
,
is randomly chosen between
and
,
and
is randomly chosen between
and
.
There are 4 sequence classes
(locally represented targets)
which depend on the temporal order of
and
. The rules are:
.
Task 2b: three relevant, widely separated symbols.
Again, the goal is to
classify
sequences.
Elements are represented locally.
The sequence starts with an
,
ends
with a
(the ``trigger symbol''),
and otherwise consists
of randomly chosen symbols from the set
except for three elements
at positions
and
that are either
or
.
The sequence length is randomly chosen between
and
,
is randomly chosen between
and
,
is randomly chosen between
and
,
and
is randomly chosen between
and
.
There are 8 (locally represented)
sequence classes
which depend on the temporal order of
the
s and
s. The rules are:
.
With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final error of all output units is below 0.3.
Architecture. We use a 3-layer net with 8 input units, 2 (3) cell blocks of size 2 for task 2a (2b), 4 (8) output units for task 2a (2b). Again, non-input units are biased, and the output layer receives connections from memory cells only. Memory cells/gate units receive inputs from input units, memory cells, gate units (fully connected hidden layer -- less connectivity works as well).
Training / Testing.
The learning rate is 0.5 (0.1) for experiment 2a (2b).
Training examples are generated on-line.
Training is stopped if average training error is below
0.1, and the 2000 most recent sequences were classified correctly.
Weights are initialized in
.
The first (second) input gate bias is initialized with
(
)
(again, precise initialization values don't matter much).
Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.1, and there were never more than 3 incorrectly classified sequences. The following results are means of 20 trials: For task 2a (2b), training was stopped (see stopping criterion in previous paragraph) after on average 31,390 (571,100) training sequences, and then only 1 (2) of the 2560 test sequences were not classified correctly (see definition above). Obviously, LSTM is able to extract information conveyed by the temporal order of widely separated inputs.
Conclusion. For non-trivial tasks (where RS is infeasible), we recommend LSTM.