Viewing each connection as a separate agent may seem unconventional: standard AI research is used to more complex agents with full-sized knowledge bases etc. For the purposes of this paper, however, this is irrelevant: SSA does not care for the complexity of the agents.
Again, it should be emphasized that the following preliminary experiment in no way represents a systematic experimental analysis. The only purpose of the current section is to illustrate paragraph (*) (in the theoretical section 1).
Basic set-up.
There is a fully recurrent neural net with 5 input, 4 output,
and 5 hidden units.
The variable
activation of the
-th unit
is denoted
(initialized at time 0 with 0.0).
denotes
the real-valued, randomly initialized weight (a variable) on
the connection
from unit
to
.
There is a lifelong sequence of
``cycles''. A cycle involves the following computations:
the activation of
the first (second, third, fourth) input unit is set to 0.0
if the field to the north (south, east, west)
of the animat is blocked or does not exist, and set to 1.0 otherwise.
Each noninput unit
updates its variable activation
(initialized
at time 0 with 0.0) as follows:
with probability
,
otherwise, where
,
and the
(initially 0) are sequentially updated as follows:
.
If
stands for an output unit, then
's
current value is defined as
.
For each output unit
, there is an action
(possible actions are
one-step-north, one-step-south,
one-step-east, one-step-west).
is selected
with a probability proportional to
. The selected
action gets executed. Whenever the animat hits
the goal, there is constant reinforcement 1.0,
the animat is transferred back to the start position (but the
activations and
-values of the recurrent net are not reset), and the
activation of the fifth input unit is set to 1.0 (0.0 otherwise).
Weight stacks.
The current weight of a connection represents its current policy.
For each connection
, there is a stack.
Following the SSA principle (section 1),
whenever a weight is modified (see below), the following
values are pushed onto
's stack:
the current time, the total cumulative reinforcement so far,
and the weight before the modification.
Weight restauration. After the goal is hit, or if the goal has not been hit for 10,000 cycles (in this case the animat is transferred back to the start position), right before the next cycle, each connection sequentially pops entries from its stack and restores the corresponding previous weight values, until its SSC (see section 1) is satisfied.
Weight modification.
So far, preliminary experiments were conducted
only with very simple weight
modification processes (corresponding to the PMP
from section 1).
Weight modification is executed after each
weight restauration (see above), and works as follows.
For each weight
do: with small probability (0.05),
replace
by a randomly chosen
value in
.
Of course, many alternative weight
modification processes are possible, but this is irrelevant for
the purposes of this paper.
Measuring time. By definition, each computation involving some weight -- such as pushing, popping, computing unit activations etc.-- costs one time step. Other computations do not cost anything. Still, measured time is of the order of total cpu-time.
Where is the changing environment?
Although the animat is occasionally reset to its start
position, the recurrent net activations and the
-values are not:
the past may influence the future.
Apart from this, each single connection's environment
continually changes, simply because
all the other connections in its environment keep changing.
However, all connections
receive the same global reinforcement signal. Hence,
for each connection, the only way to speed up its local
reinforcement intake is to contribute to speeding up global
reinforcement intake. Since no connection can solve the task by itself,
this enforces learning to cooperate.
Results without SSA. Again, for purposes of comparison, the system was first run with different weight initializations, and the weight changing mechanism being switched off. Often, the average number of cycles required to reach the goal exceeded 100,000 cycles.
Among the different weight initializations, one was picked that led to an average trial length of 2660 cycles (measured over 20,000,000 cycles). This initialization was used for SSA (see below), because the corresponding average trial length is of the same order of magnitude as the one obtained by using Sutton's original set-up.
Results with SSA.
With multi-weight SSA being switched on,
after about 10,000 ``trials'' (more precisely: 11,339,740 cycles
-- about
time steps),
the average trial length was down to 89.
This corresponds to a speed-up factor of about 30.
The shortest trial ever took 14 cycles, which is the theoretical
optimum for Sutton's task.
In the end, no weight stack had more than 7 entries (each followed
by faster reinforcement intake than all the previous ones).
Ongoing work. It is intended to replace the random weight modification process by a process that may be strongly influenced by the current weights themselves. Following [30], the idea is to use activation patterns across special output units to address and modify the network's own weights. Then the recurrent net will be theoretically able to evolve its own weight modification strategies, to replace the random mutations by more directed self-mutations.