Multiple resolutions offer a potential for applying maximal resolution to each part of the pixel plane by simply moving the fovea center there. The low-resolution parts of the fovea are useful for detecting coarse structure in the visual scene. The high-resolution parts are useful for detecting details. Later on we will show that the low-resolution parts may trigger system actions which lead the high-resolution parts of the fovea to potentially relevant parts of the plane. These actions may be viewed as attention guiding actions.
At each time step of a multi-step target detection process from each receptive field we extract one input value for our learning system. Such an input value is simply the average value of the pixels currently covered by the corresponding field (there is no sophisticated pre-processing).
The goal is adaptive target detection. In our case the targets are pre-defined details of 2-dimensional objects which can be arbitrarily translated and/or rotated on the pixel plane. In the beginning of each target detection task the fovea is placed on a randomly chosen part of the pixel plane. Then there is a limited number of time steps during which the system can generate a finite sequence of fovea movements. At the end of the target detection process the fovea should be placed directly above the target (showing the correct rotation). The final position and rotation of the fovea represent the desired information. During training, the only goal-specific information is given by predefined desired input values which correspond to those input values obtained by placing the fovea directly on the target. The final input values should match the desired input values. No informed teacher provides knowledge about useful fovea movements, making the task is a `reward-only-at-goal' task.
The first network of our learning system is called the controller . The input values provided by the retina are the inputs of . At each time step of a multi-step target detection process produces outputs (actions) which serve to control movements of the artificial retina on the pixel plane. In general, a movement causes new input values. This kind of feedback is called external feedback. See figure 2.
The problem is to find a mapping from retina inputs to control actions making the system find the target at the end of each finite target detection process. Note that we are not looking for a system that finds the shortest path from the current position of the retina to the target. In fact, with many practical problems the current fovea input will not provide enough information for determining the direction to the target. In such cases we want the system to learn to generate moves causing new inputs that allow to continue with `more informed' moves. This can be interpreted as active perception and attention shifting.
Our desired mapping has to be implemented by . Note that cannot be trained by simple supervised learning. Simple supervised learning would require an external teacher providing the desired output actions at each time step. In our case, however, the only external information is about `how the target looks'. There only is one final desired input. (Control theory calls this a `terminal control problem'.)
From the difference between the desired and the actual input at the end of a trajectory we somehow would like to generate gradient information for the output units of the controller. This would require to somehow propagate errors from the input units back `through the environment', which is not possible. The remedy is to consider 's output units as hidden units of a larger dynamic recurrent network obtained by the following procedure:
1. Introduce an additional model network (with a separate learning procedure) for `bridging the gap' between the outputs of the controller and its inputs at the next time step: is trained to emulate the visible environmental dynamics by predicting the next input, given the current controller input and output. See figure 2. serves to approximate the environmental dynamics by a differentiable mapping which will be used for the temporal credit assignment process of . (We will see that need not be a perfect predictor to allow to discover perfect solutions.) No informed teacher is required for : can be trained by generating random fovea movements and observing the consequences.
2. Now identify 's output units with the corresponding input units of and identify 's ouput units with the corresponding input units of . See figure 3. Gradient information for the weights of the control network now can be back-propagated from 's final prediction through the model network down into the control network and back through the model network etc. according to the `unfolding in time' algorithm  . An important difference to conventional `back-propagation through time' (with a single recurrent network) is that the weights of the model network remain fixed during this procedure.
In different contexts and with different degrees of generality the basic principle for credit assignment by system realization and `gradient descent through a frozen model network' has been previously described by Werbos , Jordan , Munro , Robinson & Fallside , Nguyen & Widrow , and Schmidhuber   .
The only work by other authors that also addresses the problem of learning active perception in reactive environments (and that we are aware of) is the work of Whitehead and Ballard . Their system uses adaptive actions that can bind `markers' to certain features of an environmental state. Markers dynamically mask or emphasize inputs from the visible environment. This is analogous to our fovea-guiding actions which dynamically change the input such that certain environmental details become visible, while others disappear. With Whitehead and Ballard's system the learning of active perception is based on an adaptive control technique for delayed reinforcement learning called `Q-learning' .
Our approach implements an adaptive control technique for `reward-only-at-goal' tasks which is quite different from those reinforcement learning control architectures used by Whitehead and Ballard. Our approach is gradient-based. It tries to provide an error gradient for the controller outputs by propagating the final input error through time-varying instances of a differentiable approximation of the environment (namely the model network).
If objects in a visual scene may occupy random positions then it will be impossible for the model network to predict exactly the future fovea inputs from previous ones. Unlike with e.g. the `truck backer upper'  both and never `see' the complete state of the environment, but only some local details. But this is what active perception is good for: The focus of attention should be shifted to parts of the scene allowing to acquire more detailed (domain-dependent) information about how to go on in the target detection process. The main task of the model network is to help the controller to move the fovea into regions of the plane which allow to continue with more informed moves. (Although one can not exactly predict what one will see after moving one's eyes to the door, one is setting the stage for additional eye-movements that help to recognize an entering person.) If the things one attends to never provided unpredicted information, the concept of attention would make no sense. One might say that if the situation was such that the model network could be trained to always make perfect predictions, there would be no need for a model network. In that case a single network would be able to store all information about the environment. Thus for all interesting cases the model network necessarily has to remain imperfect.
So unlike with the `truck backer upper' problem  it is not intended to make a perfect predictor whose output could replace the input from the environment (in that case not much would be gained compared to the static approach to target detection). But, an imperfect model network still can capture enough of the environmental dynamics to allow the controller to learn perfect solutions, as will be seen in the experimental section. The reason is: It suffices if the inner products of the approximated gradients (based on an inaccurate model) for and the true gradients (according to a hypothetical perfect model) are always positive (see also ). Even if these inner products are not always positive but only `in most cases', performance improvement can be expected.