Daniel Prelinger and Jeff Rink implemented on-line and off-line systems based on section 6 (see details in (Schmidhuber, 1991) and (Prelinger, 1992)). The purpose of this section, however, is not to compare on-line and off-line versions but to demonstrate that both can lead to satisfactory results.
With the off-line version, the sufficient number of consecutive epochs in PASS 1 was taken to be 5.
With the on-line system, at any given time, the same single input pattern was used in both PASS 1 and PASS 2. The learning rates of all predictors were 10 times higher than the learning rates of the representational modules. An additional modification for escaping certain cases of local minima was introduced (see Schmidhuber (1991) and Prelinger (1992) ).
The significance of non-linearities.
With many experiments it turned out that the
inclusion of hidden units
led to better performance. Assume that
and that there is an XOR-like relationship
between the activations of the first two representational
units and the third one. A linear predictor could not possibly
detect this relationship. Therefore the representational modules
could not be encouraged to remove the redundancy.
The next subsections list some selected experiments with both the on-line
and the off-line method.
In what follows, the term `local input representation'
means that there are
different binary inputs, each
with only one non-zero bit. The term `distributed input representation'
means that there are
different binary inputs.
With all experiments,
a representational unit was considered to be binary if
the absolute difference between its possible activations
and either the maximal or the minimal
activation permitted by its activation function never exceeded 0.05.
Local maxima.
With some of the experiments,
multiples of 10,000 training epochs were employed.
In many cases, however, the representational units
settled into a stable code long before the training phase was over
(even if the code
corresponded to a sub-optimal solution).
The repetitive method based on varying weight
initializations (section 6) sometimes allowed shorter overall learning times
(using values
of the order of a few 1000). A high
number of repetitions increases the probability that
a factorial code is found.
Again it should be emphasized, however, that
learning speed and methods for dealing with local maxima
are not the main objective
of this paper.