Consider again figure 1. A visual scene was made of a dark object on a white background which was placed on a 512 x 512 pixel field. Instead of using hundredthousands of input units (as in a straight-forward inefficient static approach) only 40 input units for were employed. These were sitting on the fovea (a two-dimensional artificial retina) which was controlled by the activations of four output units of the control network: There was one output unit for each of the directions `up', `down', `left', and `right'. At each time step the activation of each output node was mapped (by a multiplication operation) to the interval between 0 pixels and 20 pixels. The result was interpreted as the length of a vector pointing in the corresponding direction. A move was computed by adding the four vectors. The fovea diameter was about equal to the object size. Figure 1 shows a typical visual scene and the receptive fields of the 40 input units. At a given time step the activation of an input unit was computed by simply averaging the values of the pixels (black = 1, white = 0) covered by its receptive field. All non-input units in the system employed the logistic activation function . had a layer of input units, a layer of 40 hidden units, and a layer of 40 output units. had a layer of 20 hidden units. Both and were fully forward-connected. Unlike with the more complicated situations described in  we did not allow internal feedback within or . In the beginning all weights were randomly initialized between -0.1 and 0.1. Both and were set equal to 0.1.
With this experiment, there were two separate training phases for and . First was trained: For 50000 training cycles the fovea was randomly placed in the environment of the object, and a move was generated according to a uniform distribution of possible controller outputs. As mentioned above, we deviated from `real' gradient descent by changing 's weights after each training cycle. After the training phase 's average error was about 10 percent. Now 's weights were fixed and 's training phase (involving 20000 `trials') began.
In the beginning of each `trial' the object occupied a randomly chosen position in the pixel field. Again the fovea was randomly placed near the object such that the latter was partially overlapped by some of the receptive fields of the input units (figure 1). Then generated a fovea trajectory. Whenever the fovea left the pixel plane its receptive fields received `white' zero input. 's final input error was determined as described above, and the `unfolding in time' algorithm was applied. During training was set equal to 5 (this corresponds to `layers' in the `unfolded' network). After training, 50 time steps per trajectory were allowed.
The system described above was able to learn (without a teacher) correct sequences of fovea movements although the model network often made erroneous predictions. (The precondition for a successful trajectory was a partial overlap between the area covered by the object and the area covered by the `retina' in the beginning of some trajectory.) At the end of a successful trajectory the fovea used to have moved towards the target part of the object. ( In figure 4 the fovea center at some time step is given by the center of some arrow.) The accuracy was nearly perfect: In most cases the difference between the desired position and the actual position was not greater than one or two pixels.
Note that the fovea typically did not find the shortest path to the target. It could not, because the it saw just a part of the scene and usually did not receive enough information to determine the direction to the target. Instead it often developed a preference for edges. This is presumably due to the fact that with many of our training objects it is a good strategy to follow the outer boundary line until a new visual cue comes into sight.
Each of the 50-steps trajectories depicted in the figures took about one second real time on a SUN SPARC station (including graphics output). Using a fully parallel approach for solving similar target detection problems (by considering all pixels at one single time step) would require orders of magnitude more execution time (and probably much more training cycles for solving the problem of translation invariance, however, due to limited computer time we were not able to test this experimentally).