Figure 1 shows a second important module: an adaptive evaluator
network
which receives
as input a start state
and a goal state
, and
is trained to produce
an output
that indicates whether
knows a program that leads from
the start state to or `close' to the goal state. An output of
means that there is an appropriate sub-program, an output
of
means that there is no such sub-program. An
output between
and
means that there is a
sub-program that leads from the start state to
a state that comes close to the goal in a certain sense.
This measure of closeness has to be given by some evaluative
process that may be adaptive or not
(it might be an adaptive
critic
based on TD-methods, for instance).
represents the system's current model of its
own capabilities.
We assume that
has learned to
correctly predict that each of the already
existing sub-programs works.
We also assume that
is able to
predict the closeness of an end state of a
sub-program to a goal state, given a start state.
In the simplest case,
can be trained in an exploratory phase during which
various combinations of
start and goal states are given to the program executer.
Finally, the system
contains an adaptive network
which serves as a sub-goal generator.
With a given task
specified by a start/goal pair
,
the sub-goal generator receives as input the concatenation of
and
.
The output of the sub-goal generator is a list of
sub-goals
.
Like the goal, a sub-goal
is an activation patterns
describing the desired external input at the end of some sub-program,
which also is the start input for the next sub-program.
After training the subgoal generator, the
list of sub-goals should satisfy the following
condition:
How does the sub-goal generator, which initially is a tabula rasa, learn to generate appropriate
sub-goals?
We take
copies of
, and connect them
to the sub-goal generator as shown in figure 2. The desired
output of each of the copies is
. For all
positive outputs of some copies, error
gradients are propagated through
's copies
down into the sub-goal generator.
(as well as
its copies, of course) remains
unchanged during this procedure. Only the
weights of the sub-goal generator change according to
where
is the learning rate of the sub-goal generator,
is its weight vector, and
its increment.
For a given problem the procedure is iterated until
the complete error is zero (corresponding to a solution obtained
by combining the two sub-programs), or until a local minimum
is reached (no solution found).
The gradient descent procedure is used for a search in
sub-goal space. This works because the effects of programs are
made differentiable with respect to program names ( = start/goal
combinations). This contrasts the approach of `back-propagation
through time' which makes the effects of programs differentiable
with respect to whole action sequences (instead of selectively
focussing on more abstract short representations of action sequences).