The task is to observe and classify input sequences. There are two classes. There is only one input unit or input line. See, e.g., Bengio et al. (1994); Bengio and Frasconi (1994); Lin et al. (1995). The latch problem was designed to show how gradient descent fails. We tested two variants, both with 3 free parameters.
Latch variant I. The recurrent net itself has a single free parameter: the recurrent weight of a single tanh unit. The remaining 2 free parameters are the initial input values, one for each sequence class. The output targets at sequence end are +0.8 and -0.8. Gaussian noise with mean zero and variance 0.2 is added to each sequence element except the first. Hence a large positive recurrent weight is necessary to accomplish long-term storage of the bit of information identifying the class determined by the initial input. The latter's absolute value must be large to allow for latching the recurrent unit.
Results.
RG solves the task within only 6 trials on average (mean of 40 simulations).
This is better than the 1600 trials reported in Bengio et al. (1994)
with several methods.
RG's success is due to the few parameters
and the fact that in search space
it almost suffices to get the parameter signs right
(there are only 8 sign combinations).
Latch variant II. Sequences from class 1 start with 1.0; others with -1.0. The targets at sequence end are 1.0 and -1.0. The recurrent net has a single unit with tanh activation function. There are 3 incoming connections: one from itself, one from the input, and one bias connection (the inputs are not free parameters). Gaussian noise is added like with variant I.
Results. RG solves variant II within 22 trials on average. This is better than the 1600 trials reported by Bengio et al. (1994) with several methods on the simpler variant I.
2-sequence problem.
Only the first
real-valued sequence elements convey relevant
information about the class
(the inputs are not free parameters).
Sequence elements at positions
are generated
by a Gaussian with mean zero and variance 0.2.
Again we will consider two cases:
, and
(used in Bengio et al. 1994).
In the first case (
) we set the first sequence element to
1.0 for class 1, and -1.0 for class 2.
The output neuron has a sigmoid activation function;
the target at sequence end is 1.0 for class 1
and 0.0 for class 2. In case
two
3-vectors in
are randomly chosen
in advance -- they represent the initial subsequences
determining the sequence class.
was chosen because the probability of obtaining
random initial patterns with significant differences
increases with
.
Previous Results.
The best method
among the six tested by Bengio et al. (1994)
solved the 2-sequence problem (
=3) after 6400
sequence presentations, with a final classification error rate of 0.06.
In more recent work (1994), Bengio and Frasconi
were able to improve their results:
an EM-approach was reported to solve the problem within 2900 trials.
Results with RG.
RG with architecture A2 and
solves the problem within 718
trials on average. RG with A1 (
in the
architecture) requires
1247 trials, and reaches zero classification error on
all trials.
With
, however, RG requires
around 68000 trials on average to find a solution (mean of 40 simulations).
If we ignore that RG trials need much less computation time than
EM trials then EM seems faster.
RG's difficulties stem from the continuous nature of the search space.
With
, two tasks must be learned simultaneously: recognizing
a multi-dimensional pattern (class 1 or 2), and latching it (storing
one bit for the long term). The pattern recognition task is harder for RG.