next up previous
Next: LATCH AND 2-SEQUENCE PROBLEMS Up: EVALUATING LONG-TERM DEPENDENCY BENCHMARK Previous: INTRODUCTION

RANDOM GUESSING (RG)

Given a particular network architecture, RG works as follows:

REPEAT randomly initialize the weights UNTIL the resulting net happens to classify all training sequences correctly. Then test on a separate test set.

For more sophisticated weight guessing biased towards nets with low Levin complexity, see Schmidhuber (1997).

Architectures. We use two architectures A1 and A2 suitable for many widely used ``benchmark'' problems: A1 is a recurrent, fully connected net with 1 input, 1 output, and $n$ hidden units. Each hidden unit has a bias weight on a connection from a ``true'' unit with constant activation 1.0. A2 is like A1 with $n=10$, but less densely connected: each hidden unit sees the input unit, the output unit, and itself; the output unit sees all other units; all units are biased. We will indicate where we also use different architectures or experimental setups.

Initialization. In all our experiments, we randomly initialize weights in [-100.0,100.0]. Activations of all units (except for the ``true'' unit) are set to 0 at the beginning of each sequence.

Training and testing. All the ``benchmark'' problems below require to classify two types of pattern sequences that are fed sequentially (pattern by pattern) into the net via its input units (in the standard way). To achieve a uniform setting, all sequence lengths are randomly chosen between 500 and 600 (in most previously reported experiments, shorter training/test sequences have been used). Training sets consist of 100 sequences, 50 from class 1 (target 0) and 50 from class 2 (target 1). Unless mentioned otherwise, binary inputs are -1.0 and 1.0. Correct sequence classification is defined as ``absolute error at sequence end below 0.1''. We stop the search once a random weight matrix correctly classifies all training sequences. Then we test on the separate test set (100 sequences). In this sense all our results measure generalization performance.

Our main motivation for using the same architectures and experimental conditions for many different problems is to prevent critique of problem-specific fine-tuning. For instance, in our experiments we do not adapt architectures until we find the best.

Evaluation. All results in the remainder of this note are averages of 10 or more simulations. In all our simulations, RG classifies correctly at least 99% of all test set sequences; average absolute test errors are always below 0.02, in most cases below 0.005.


next up previous
Next: LATCH AND 2-SEQUENCE PROBLEMS Up: EVALUATING LONG-TERM DEPENDENCY BENCHMARK Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-19