A simple experiment was conducted in order to demonstrate sub-goal learning. The programming was done by Rudolf Huber, a student of computer science at TUM.
A two-dimensional artificial `world' covering the unit square was constructed. An artificial `animal' controlled by the program executer was able to move around in the world. The program executer's output was four-dimensional (one output unit for each of the directions `north', `south', `east', `west'). At a given time step, the activation of each output unit (ranging from 0 to 1) was divided by 20, the animal's move was calculated by adding the four corresponding vectors (thus the maximal stepsize in each direction was 0.05).
In the center of the world there was an obstacle (indicated by the black square in figure 4). If the animal hit the obstacle, it had to stop.
Conventional back-propagation (3-layer feedforward nets with logistic activation functions were employed, no recurrent connections were necessary) was used to train the program executer to move the animal in a straight line from randomly chosen points (starts) to other randomly chosen points (goals), until it hit the obstacle or until the number of time steps exceeded 20. The training procedure was as follows: At each time step of an action sequence the straight line leading from the current position to the goal was computed, and the corresponding desired output of the executer served as a training signal. Both start and goal states were indicated by pairs of coordinates of corresponding points. The executer had 20 hidden units and experienced 100000 action sequences during the training phase.
The evaluator was trained by randomly selecting start/goal combinations, executing them with the program executer, and watching the result. The output of the evaluator (20 hidden units) was trained to be 1.0 in cases with the final position of the animal being more than 0.1 away from the goal. It was trained to be 0.0 in cases with the final position matching the goal. In between linear interpolation was used. The evaluator experienced 1000000 examples during the training phase. Both the executer and the evaluator were trained with a learning rate of 0.05.
The reason for chosing a comparatively simple environment was to isolate the sub-goal generation process from effects that could be introduced by an adaptive on-line evaluation function (an adaptive critic, say). For our simple environment it was easy to define a prewired evaluation function. (Future research will focus on parallel on-line learning of all components of the system, but, as always, it is preferable to proceed incrementally from small problems to bigger ones.)
Figure 4 shows the `world' and traces of the animal for some of the many programs successfully executed by the program executer. (Due to the imperfect executer, not all of these traces correspond to perfect straight lines.)
In the final phase the sub-goal generator (20 hidden units) was trained: Combinations of start and goals states that did not have a working program associated with them were given to the sub-goal generation process described in the last section. With , within about 10 iterations the sub-goal generator actually found appropriate sub-goals for given start/goal combinations. See figure 5 for one out of many examples.
By using more examples and smaller learning rates for training the subgoal generator, it soon learned to generate appropriate subgoals for a whole variety of situations.