The currently surprising module wants to repeat similar surprises with higher probability in the future. The other wants to avoid further surprises by learning not to agree on similar computation sequences (implicitly learning what the other already knows). And it wants to be ``creative'' in the sense that it wants to generate new surprises for the other module instead. In principle, both can learn by executing subsequences of instructions that include LIs. How can we ensure that each module indeed improves by accelerating its reward intake?
In this chapter
I will use the IS paradigm
[35,34]
to deal with both modules'
complex spatio-temporal credit assignment problem.
This does not necessarily mean that IS is the best way of doing so.
Other RL paradigms may be appropriate, too --
this chapter's basic ideas are independent of
the choice of RL method.
IS seems attractive, however, because: (1) It does not make an explicit
difference between learning algorithms and other instructions, or between
learning, metalearning,
metametalearning, etc.
(2) It properly takes into account
that the success of each module modification recursively depends on the
success of all later modifications for which it is setting the stage.
(3) Its objective takes into account the entire
time consumed by lifelong learning itself. (4) It
is designed for quite general non-Markovian credit
assignment problems in lifelong learning situations -- see
[35,34,31] for
recent IS applications.
Following [34], the remainder of this section will describe
's IS-based learning process.
's is analogous.
Checkpoints.
's
entire life time can be partitioned into
time intervals separated by special times called checkpoints.
Checkpoints are computed dynamically during the learner's life by certain
instructions in
executed according to the modules themselves.
's
-th checkpoint is denoted
.
Checkpoints obey the following rules:
(1)
.
(2)
(3) Except for the first, checkpoints may not occur
before at least one
LI executed at least one
-modification
since the previous checkpoint.
Sequences of module modifications.
SLM
denotes the sequence of
-modifications
(SLM) computed by LIs between checkpoints
and
.
Since LI execution probabilities depend on
and
,
the modules can in fact modify the way they modify themselves -- they
can devise their own probabilistic learning algorithms.
Goal.
At some checkpoint
's goal is to
generate
-modifications that
will accelerate long-term reward intake: it wants
to let the value of
exceed
the current average reward intake.
Success-story criterion (SSC).
maintains a time-varying set
of past
checkpoints that have led to
long-term reward accelerations. Initially
is empty.
Let
denote the
-th element of
in ascending order.
SSC is satisfied at time
if either
is empty (trivial case) or if
Success-story algorithm (SSA).
At every checkpoint of
we invoke the success-story algorithm (SSA):
1. WHILE SSC is not satisfiedUndo all2. Add the current checkpoint to-modifications made since the most recent checkpoint in
.
Remove that checkpoint from.
.
``Undoing'' a modification means restoring the preceding
--
this requires storing past values of
-components on a stack prior to modification.
(Components of
and elements of
can be stored
on the same stack -- see the appendix.)
Thus each
-modification that survived SSA is part of a
bias shift generated after a checkpoint marking a lifelong
reward speed-up: since
there has been more reward per
time than since
, for
(
).
Timing SSA calls.
Between two checkpoints
is temporarily
protected from SSA evaluations.
Since the way of setting checkpoints
depends on
itself,
can learn
to influence when it gets evaluated.
This evaluation-timing ability
is important in dealing with unknown reward delays.
SSA's generalization assumption.
At the end of each SSA call,
until the beginning of the next one,
the only temporary generalization
assumption for inductive inference is:
-modifications that survived all previous SSA calls
will remain useful. In the absence of empirical evidence to the
contrary, each surviving SLM
is assumed to have set the stage for
later successful SLM
.
Since life is one-way (time is never reset),
during each SSA call the system has
to generalize from
a single experience concerning
the utility of
-modifications executed after any given
previous point in time: the average reward per
time since then.
Implementing SSA.
Using stack-based backtracking methods such as those described
in [35,34] and
the appendix, one can guarantee that SSC will be satisfied
right after each new SLM-start, despite interference
from
,
, and
.
Although inequality 1 contains
fractions,
SSA can be implemented efficiently: only
the two SLMs on top of the stack need to be considered at
any given time in an SSA call (see details in appendix).
A single SSA call, however, may undo
many SLMs if necessary.
What has been described for
analogously holds for
.
The appendix describes a particular implementation
(the one used for the experiments)
based on an assembler-like programming language
similar to those used in
[34,26].