Figure 1 shows a second important module: an adaptive evaluator network which receives as input a start state and a goal state , and is trained to produce an output that indicates whether knows a program that leads from the start state to or `close' to the goal state. An output of means that there is an appropriate sub-program, an output of means that there is no such sub-program. An output between and means that there is a sub-program that leads from the start state to a state that comes close to the goal in a certain sense. This measure of closeness has to be given by some evaluative process that may be adaptive or not (it might be an adaptive critic based on TD-methods, for instance). represents the system's current model of its own capabilities. We assume that has learned to correctly predict that each of the already existing sub-programs works. We also assume that is able to predict the closeness of an end state of a sub-program to a goal state, given a start state. In the simplest case, can be trained in an exploratory phase during which various combinations of start and goal states are given to the program executer.
Finally, the system contains an adaptive network which serves as a sub-goal generator. With a given task specified by a start/goal pair , the sub-goal generator receives as input the concatenation of and .
The output of the sub-goal generator is a list of sub-goals
Like the goal, a sub-goal is an activation patterns
describing the desired external input at the end of some sub-program,
which also is the start input for the next sub-program.
After training the subgoal generator, the
list of sub-goals should satisfy the following
How does the sub-goal generator, which initially is a tabula rasa, learn to generate appropriate sub-goals? We take copies of , and connect them to the sub-goal generator as shown in figure 2. The desired output of each of the copies is . For all positive outputs of some copies, error gradients are propagated through 's copies down into the sub-goal generator. (as well as its copies, of course) remains unchanged during this procedure. Only the weights of the sub-goal generator change according to
where is the learning rate of the sub-goal generator, is its weight vector, and its increment. For a given problem the procedure is iterated until the complete error is zero (corresponding to a solution obtained by combining the two sub-programs), or until a local minimum is reached (no solution found). The gradient descent procedure is used for a search in sub-goal space. This works because the effects of programs are made differentiable with respect to program names ( = start/goal combinations). This contrasts the approach of `back-propagation through time' which makes the effects of programs differentiable with respect to whole action sequences (instead of selectively focussing on more abstract short representations of action sequences).