Daniel Prelinger and Jeff Rink implemented on-line and off-line systems based on section 6 (see details in (Schmidhuber, 1991) and (Prelinger, 1992)). The purpose of this section, however, is not to compare on-line and off-line versions but to demonstrate that both can lead to satisfactory results.
With the off-line version, the sufficient number of consecutive epochs in PASS 1 was taken to be 5.
With the on-line system, at any given time, the same single input pattern was used in both PASS 1 and PASS 2. The learning rates of all predictors were 10 times higher than the learning rates of the representational modules. An additional modification for escaping certain cases of local minima was introduced (see Schmidhuber (1991) and Prelinger (1992) ).
The significance of non-linearities. With many experiments it turned out that the inclusion of hidden units led to better performance. Assume that and that there is an XOR-like relationship between the activations of the first two representational units and the third one. A linear predictor could not possibly detect this relationship. Therefore the representational modules could not be encouraged to remove the redundancy.
The next subsections list some selected experiments with both the on-line and the off-line method. In what follows, the term `local input representation' means that there are different binary inputs, each with only one non-zero bit. The term `distributed input representation' means that there are different binary inputs. With all experiments, a representational unit was considered to be binary if the absolute difference between its possible activations and either the maximal or the minimal activation permitted by its activation function never exceeded 0.05.
Local maxima. With some of the experiments, multiples of 10,000 training epochs were employed. In many cases, however, the representational units settled into a stable code long before the training phase was over (even if the code corresponded to a sub-optimal solution). The repetitive method based on varying weight initializations (section 6) sometimes allowed shorter overall learning times (using values of the order of a few 1000). A high number of repetitions increases the probability that a factorial code is found. Again it should be emphasized, however, that learning speed and methods for dealing with local maxima are not the main objective of this paper.