With the experiments reported in the last sections there were separate training phases for and . The search element that usually is incorporated within reinforcement learning systems by using probabilistic activation rules was buried in the random search of the first phase.
For realistic large scale applications it is highly desireable that and learn in parallel. In general the model network will not be able to explore all possible combinations of inputs and actions and their consequences. The control network should already start learning with an incomplete representation of the external dynamics in the model network. should concentrate on those parts of the external dynamics that are necessary for achieving 's goals. Just like Kohonen's self organizing feature maps  dedicate more storage capacity for fine grained representation of common similar inputs, should dedicate more storage capacity and time for fine grained modeling of those aspects of the world that are likely to be relevant for the system's main goal. (See  for more reasons for parallel on-line learning of and .)
We conducted some experiments with on-line learning. It was found that two interacting conventional deterministic networks in the style of  and  were not appropriate. Usually a deterministic system soon became trapped in a state where the controller never shifted the fovea towards regions which allowed the model network to collect new relevant information about the external world. This is called the deadlock problem.
To attack the deadlock problem, we introduced some modifications for the controller, in order to provide it with explicit search capabilities. Each of the output units was replaced by a little network consisting of two units, one giving the mean and the other one giving the variance for a random number generator which produced random numbers according to a continuous distribution. (We approximated a Gauss distribution by a Bernoulli distribution.) Weight gradients were computed by applying William's concept of `back-propagation through random number generators' .
It was found that within 100000 trials such an on-line learning system was able to learn appropriate fovea trajectories (like e.g. in figure 4). As it was expected, after training the model network was a good predictor only for those situations which the controller typically was confronted with.
With these experiments, the on-line approach did not significantly improve efficiency. So the main contribution of this section is the demonstration that the introduction of probabilistic output units can make on-line learning possible.