Multiple resolutions offer a potential for applying maximal resolution to each part of the pixel plane by simply moving the fovea center there. The low-resolution parts of the fovea are useful for detecting coarse structure in the visual scene. The high-resolution parts are useful for detecting details. Later on we will show that the low-resolution parts may trigger system actions which lead the high-resolution parts of the fovea to potentially relevant parts of the plane. These actions may be viewed as attention guiding actions.
At each time step of a multi-step target detection process from each receptive field we extract one input value for our learning system. Such an input value is simply the average value of the pixels currently covered by the corresponding field (there is no sophisticated pre-processing).
The goal is adaptive target detection. In our case the targets are pre-defined details of 2-dimensional objects which can be arbitrarily translated and/or rotated on the pixel plane. In the beginning of each target detection task the fovea is placed on a randomly chosen part of the pixel plane. Then there is a limited number of time steps during which the system can generate a finite sequence of fovea movements. At the end of the target detection process the fovea should be placed directly above the target (showing the correct rotation). The final position and rotation of the fovea represent the desired information. During training, the only goal-specific information is given by predefined desired input values which correspond to those input values obtained by placing the fovea directly on the target. The final input values should match the desired input values. No informed teacher provides knowledge about useful fovea movements, making the task is a `reward-only-at-goal' task.
The first network of our learning system is called the controller
.
The input values provided by the retina are
the inputs of
.
At each time step of a multi-step target detection process
produces outputs (actions) which serve to control
movements of the artificial retina on the pixel plane. In general,
a movement causes new input values. This kind of feedback
is called
external feedback. See figure 2.
The problem is to find a mapping from retina inputs to control actions making the system find the target at the end of each finite target detection process. Note that we are not looking for a system that finds the shortest path from the current position of the retina to the target. In fact, with many practical problems the current fovea input will not provide enough information for determining the direction to the target. In such cases we want the system to learn to generate moves causing new inputs that allow to continue with `more informed' moves. This can be interpreted as active perception and attention shifting.
Our desired mapping
has to be implemented by
. Note that
cannot be trained by simple supervised learning. Simple
supervised learning
would require an external teacher providing the desired output
actions at each time step. In our case, however, the only external
information is about `how the target looks'.
There only is one final
desired input.
(Control theory calls this a `terminal control problem'.)
From the difference between the desired and the actual input
at the end of a trajectory
we somehow would like to generate
gradient information for the output units of the controller.
This would require to somehow propagate errors from the input units
back `through the
environment', which is not possible. The remedy is to
consider
's output units as hidden units of
a larger dynamic recurrent network obtained by the following
procedure:
1. Introduce
an additional model network
(with a separate learning procedure) for `bridging the gap' between the
outputs of the controller and its inputs at the next time step:
is trained to emulate the
visible environmental dynamics
by predicting the
next input, given the current controller input and output.
See figure 2.
serves to approximate the environmental dynamics by a differentiable
mapping which will be used for the temporal credit assignment process
of
.
(We will see that
need not be a perfect predictor to allow
to
discover perfect solutions.)
No informed
teacher is required for
:
can be trained by generating
random fovea movements and observing the consequences.
2. Now identify
's output units with the corresponding input units
of
and identify
's ouput units with the corresponding input
units of
. See figure 3.
Gradient information for the weights of
the control network now can be back-propagated from
's final prediction
through the model network down into the control network and
back through the model network etc. according to the
`unfolding in time' algorithm [18]
[9]. An important difference
to conventional `back-propagation through time'
(with a single recurrent network) is
that the weights
of the model network remain fixed during this procedure.
In different contexts and with different degrees of generality the basic principle for credit assignment by system realization and `gradient descent through a frozen model network' has been previously described by Werbos [19], Jordan [2], Munro [5], Robinson & Fallside [8], Nguyen & Widrow [6], and Schmidhuber [10] [11] [15].
The only work by other authors that also addresses the problem of learning active perception in reactive environments (and that we are aware of) is the work of Whitehead and Ballard [20]. Their system uses adaptive actions that can bind `markers' to certain features of an environmental state. Markers dynamically mask or emphasize inputs from the visible environment. This is analogous to our fovea-guiding actions which dynamically change the input such that certain environmental details become visible, while others disappear. With Whitehead and Ballard's system the learning of active perception is based on an adaptive control technique for delayed reinforcement learning called `Q-learning' [16].
Our approach implements an adaptive control technique for `reward-only-at-goal' tasks which is quite different from those reinforcement learning control architectures used by Whitehead and Ballard. Our approach is gradient-based. It tries to provide an error gradient for the controller outputs by propagating the final input error through time-varying instances of a differentiable approximation of the environment (namely the model network).
If objects in a visual scene may occupy random positions
then it will be impossible for the model network to
predict exactly the future fovea inputs from previous ones.
Unlike with e.g. the `truck backer upper' [6]
both
and
never `see' the
complete state of the environment, but only some local details.
But this is what active perception is good for: The
focus of attention should be shifted to parts of the scene
allowing to acquire more detailed (domain-dependent)
information about how to
go on in the target detection process.
The main
task of the model network is to help the controller to move the
fovea into regions of the plane which allow to continue
with more informed moves. (Although one can not exactly predict what
one will see after moving one's eyes to the door,
one is setting the stage for additional eye-movements that
help to recognize an entering person.)
If the things one attends to never provided unpredicted
information, the concept of attention would make no sense.
One might say that if the situation was such that the model
network could be trained to always make perfect predictions,
there would be no need for a model
network. In that case a single network would be able to
store all information about the environment.
Thus for all interesting cases the model network
necessarily has to remain imperfect.
So unlike with the `truck backer upper' problem
[6] it is not intended to make
a
perfect predictor whose output could replace the input from the
environment (in that case not much would be gained
compared to the static approach to target detection).
But, an imperfect model network still can capture enough of the
environmental dynamics to allow the controller to learn perfect
solutions, as will be seen in the experimental section.
The reason is:
It suffices if the inner products of the
approximated
gradients (based on an inaccurate model) for
and the true gradients
(according to a hypothetical perfect model) are always positive
(see also [2]).
Even if these inner products are not always positive but only
`in most cases', performance improvement can be expected.