This section will show that ALS can use experience to significantly reduce average search time consumed by successive LS calls in cases where there are more and more complex tasks to solve (inductive transfer), and that ALS can be further improved by plugging it into SSA.
Task. Figure 2 shows a
maze and 7
different goal positions marked 1,2,...,7. With a given goal, the task
is to reach it from the start state. Each goal is further away from
start than goals with lower numbers. We create 4 different ``goalsets''
,
,
,
.
contains goals 1, 2, ..., 3 + i.
One simulation consists of 40 ``epochs''
,
, ...
.
During epochs
to
, all goals in
have to be found in order of their distances to the start.
Finding a goal yields reward 1.0 divided by solution path length (short
paths preferred). There is no negative reward for collisions.
During each epoch, we update the probability matrices
,
and
whenever a goal is found
(for all epochs dealing with goalset
there are
updates).
For each epoch we store the total
number of steps required to find all goals in the corresponding goalset.
We compare two variants of incremental learning, METHOD 1 and METHOD 2:
Comparison. We compare LS by itself, ALS by itself, and SSA+ALS,
for both METHODs 1 and 2.
LS results. Using
and
, LS needed
time steps to find goal 7 (without any kind of incremental learning
or inductive transfer).
Learning rate influence. To find optimal learning rates minimizing
the total number of steps during simulations of ALS and SSA+ALS, we tried
all learning rates
in {0.01, 0.02,..., 0.95}. We found that
SSA+ALS is fairly learning rate independent: it solves all tasks
with all learning rates in acceptable time (
time steps),
whereas for ALS without SSA (and METHOD 2) only small learning rates are feasible
- large learning rate subspaces do not work for many goals. Thus, the
first type of SSA-generated speed-up lies in the lower expected search
time for appropriate learning rates.
With METHOD 1, ALS performs best with a fixed learning rate
, and SSA+ALS performs best with
, with additional
uniform noise in
(noise tends to improve SSA+ALS's
performance a little bit, but worsens ALS' performance). With METHOD
2, ALS performs best with
, and SSA+ALS performs best
with
and added noise in
.
| Algorithm | METHOD | SET 1 | SET 2 | SET 3 | SET 4 | |
| LS last goal | 4.3 | 1,014 | 9,505 | 17,295 | ||
| LS | 8.7 | 1,024 | 10,530 | 27,820 | ||
| ALS | 1 | 12.9 | 382 | 553 | 650 | |
| SSA + ALS | 1 | 12.2 | 237 | 331 | 405 | |
| ALS | 2 | 13.0 | 487 | 192 | 289 | |
| SSA + ALS | 2 | 11.5 | 345 | 85 | 230 | |
For METHODs 1 and 2 and all goalsets
(
), Table
1 lists the numbers of steps required by LS, ALS, SSA+ALS
to find all of
's goals during epoch
, in which
the agent encounters the goal positions in the goalset for the first time.
ALS versus LS. ALS performs much better than LS on goalsets
. ALS does not help to to improve performance on
's
goalset, though (epoch
), because there are many easily discoverable
programs solving the first few goals.
SSA+ALS versus ALS. SSA+ALS always outperforms ALS by itself. For optimal learning rates, the speed-up factor for METHOD 1 ranges from 6 % to 67 %. The speed-up factor for METHOD 2 ranges from 13 % to 26 %. Recall, however, that there are many learning rates where ALS by itself completely fails, while SSA+ALS does not. SSA+ALS is more robust.
Example of bias shifts undone. For optimal learning rates, the
biggest speed-up occurs for
. Here SSA decreases search costs
dramatically: after goal 5 is found, the policy ``overfits''
in the sense that it is too much biased towards problem 5's optimal
(lowest complexity) solution: (1) Repeat step forward until blocked,
rotate left. (2) Jump (1,11). (3) Repeat step forward until blocked,
rotate right. (4) Repeat step forward until blocked, rotate right.
Problem 6's optimal solution can be obtained from this by replacing
the final instruction
by (4) Jump (3,3). This represents a significant
change though (3 probability distributions) and requires time.
Problem 5, however, can also be solved
by replacing its lowest complexity solution's
final instruction by (4) Jump (3,1).
This increases complexity but makes learning problem 6 easier, because
less change is required. After problem 5 has been solved using
the lowest complexity solution, SSA eventually suspects
``overfitting'' because too much computation time goes by without
sufficient new rewards. Before discovering goal 6, SSA undoes apparently
harmful probability shifts until SSC is satisfied again. This makes
Jump instructions more likely and speeds up the search for
a solution to problem 6.
METHOD 1 versus METHOD 2. METHOD 2 works much better than METHOD
1 on
and
, but not as well on
(for
both methods
are equal -- differences in performance can be explained by different
learning rates which were optimized for the total task). Why? Optimizing
a policy for goals 1--4 will not necessarily help to speed up discovery
of goal 5, but instead cause a harmful bias shift by overtraining the
probability matrices. METHOD 1, however, can extract enough useful
knowledge from the first 4 goals to decrease search costs for goal 5.
|
More SSA benefits. Table 2 lists the number of steps
consumed during the final epoch
of each goalset
(the results
of LS by itself are identical to those in table 1).
Using SSA typically improves the final result, and never worsens it.
Speed-up factors range from 0 to 560 %.
|
For all goalsets Table 3 lists the total number of
steps consumed during all epochs of one simulation, the total number of
all steps for those epochs (
,
,
,
) in which
new goalsets are introduced, and the total number of steps required for
the final epochs (
,
,
,
). SSA always
improves the results. For the total number of steps -- which is an
almost linear function of the time consumed during the simulation -- the
SSA-generated speed-up is 60% for METHOD 1 and 108 % for METHOD 2 (the
``fully incremental'' method). Although METHOD 2 speeds up performance
during each goalset's first epoch (ignoring the costs that occurred before
introduction of this goalset),
final results are better without inter-goalset learning.
This is not so surprising: by using policies optimized for previous
goalsets, we generate bias
shifts for speeding up discovery of new, acceptable solutions, without
necessarily making optimal solutions of future tasks more likely
(due to ``evolutionary ballast'' from previous solutions).
LS by itself needs
steps for finding all goals
in
. Recall that
of them are spent for finding
only goal 7. Using inductive transfer, however, we obtain large
speed-up factors. METHOD 1 with SSA+ALS improves performance by a
factor in excess of 40 (see results of SSA+ALS on the first epoch
of
). Figure 3(A) plots performance against
epoch numbers. Each time the goalset changes, initial search costs
are large (reflected by sharp peaks). Soon, however, both methods
incorporate experience into the policy. We see that SSA keeps initial
search costs significantly lower.
The safety net effect. Figure 3(B) plots epoch numbers against average probability of programs computing solutions. With METHOD 1, SSA+ALS tends to keep the probabilities lower than ALS by itself: high program probabilities are not always beneficial. With METHOD 2, SSA undoes many policy modifications when goalsets change, thus keeping the policy flexible and reducing initial search costs.
Effectively, SSA is controlling the prior on the search space such that
overall average search time is reduced, given a particular task sequence.
For METHOD 1, after
the number of still valid modifications of
policy components (probability distributions) is 377 for ALS, but only
61 for SSA+ALS (therefore, 61 is SSA+ALS's total final stack size).
For METHOD 2, the corresponding numbers are 996 and 63. We see that
SSA keeps only about 16% respectively 6% of all modifications.
The remaining modifications are deemed unworthy because they have
not been observed to trigger lifelong reward speed-ups.
Clearly, SSA prevents ALS from overdoing its policy modifications
(``safety net effect''). This is SSA's simple, basic purpose: undo
certain learning algorithms' policy changes and bias shifts once they
start looking harmful in terms of long-term reward/time ratios.
It should be clear that the SSA+ALS implementation is just one of many possible SSA applications -- we may plug many alternative learning algorithms into the basic cycle.