Consider again figure 1.
A visual scene was made of a dark object on a white background
which was placed
on a 512 x 512 pixel field.
Instead of using hundredthousands of input units
(as in a straight-forward inefficient static approach) only
40 input units for
were employed.
These were sitting on the fovea (a two-dimensional
artificial retina) which was controlled by
the activations of four
output units of the control network: There was one output unit for
each of the directions `up', `down', `left', and `right'.
At each time step the activation of each output node was mapped
(by a multiplication operation) to the interval between 0 pixels
and 20 pixels.
The result was interpreted as the length of a vector pointing
in the corresponding direction.
A move was computed by adding
the four vectors.
The fovea diameter
was about equal to the object size. Figure 1
shows a typical visual scene and the receptive fields of
the 40 input units. At a given time step the activation of
an input unit was computed by simply averaging the values of the
pixels (black = 1, white = 0) covered by its receptive field.
All non-input units in the system employed the logistic activation
function
.
had a layer of
input units, a layer
of 40 hidden units,
and a layer of 40 output units.
had a layer of 20 hidden units.
Both
and
were fully forward-connected.
Unlike with the more complicated situations
described in
[8][12][10][11]
we did not allow internal feedback within
or
.
In the beginning all weights were randomly initialized
between -0.1 and 0.1. Both
and
were set equal to 0.1.
With this experiment, there were two separate training phases for
and
.
First
was trained: For 50000 training cycles
the fovea was randomly placed in the environment
of the object, and a move was generated according to a uniform
distribution of possible controller outputs.
As mentioned above, we deviated from `real' gradient descent
by changing
's weights after each training cycle.
After the training phase
's average error was about 10
percent. Now
's weights were fixed and
's training phase
(involving 20000 `trials') began.
In the beginning of each `trial' the object
occupied a randomly chosen position
in the pixel field.
Again the fovea was randomly placed near the object such that the latter
was partially overlapped by some of the receptive fields of the
input units (figure 1). Then
generated a fovea trajectory.
Whenever the fovea left the pixel plane its receptive
fields
received `white' zero
input.
's final input error was determined as described above,
and the `unfolding in time' algorithm was applied.
During training
was set equal to 5
(this corresponds to
`layers'
in the `unfolded' network). After training, 50 time steps
per trajectory were allowed.
The system described above was able to learn (without a teacher) correct sequences of fovea movements although the model network often made erroneous predictions. (The precondition for a successful trajectory was a partial overlap between the area covered by the object and the area covered by the `retina' in the beginning of some trajectory.) At the end of a successful trajectory the fovea used to have moved towards the target part of the object. ( In figure 4 the fovea center at some time step is given by the center of some arrow.) The accuracy was nearly perfect: In most cases the difference between the desired position and the actual position was not greater than one or two pixels.
Note that the fovea typically did not find the shortest path to the target. It could not, because the it saw just a part of the scene and usually did not receive enough information to determine the direction to the target. Instead it often developed a preference for edges. This is presumably due to the fact that with many of our training objects it is a good strategy to follow the outer boundary line until a new visual cue comes into sight.
Each of the 50-steps trajectories depicted in the figures took about one second real time on a SUN SPARC station (including graphics output). Using a fully parallel approach for solving similar target detection problems (by considering all pixels at one single time step) would require orders of magnitude more execution time (and probably much more training cycles for solving the problem of translation invariance, however, due to limited computer time we were not able to test this experimentally).