Here we will briefly review basic principles
[Schmidhuber, Zhao, SchraudolphSchmidhuber
et al.1997a].
An agent lives in environment
from time 0 to
unknown time
. Life is one-way: even if it is decomposable into numerous
consecutive ``trials'', time will never be reset.
The agent has a policy POL (a set of modifiable parameters)
and possibly an internal state
(e.g., for short-term memory that
can be modified by actions).
Both
and POL are
variable dynamic data structures influencing probabilities of
actions to be executed by the agent.
Between time 0 and
,
the agent repeats the following cycle
over and over again
(
denotes a set of possible actions):
REPEAT:
select and execute somewith conditional probability
![]()
Action
will consume time
and may change
,
, and POL.
Learning algorithms (LAs).
is the set of actions
that can modify POL.
's members are
called learning algorithms (LAs).
Previous work on metalearning
[Schmidhuber, Zhao, SchraudolphSchmidhuber
et al.1997a,Schmidhuber, Zhao, WieringSchmidhuber
et al.1997b]
left it to the system itself to compute execution
probabilities of LAs by constructing and
executing POL-modifying
algorithms. In this paper we will focus
on a simpler
case where all LAs are externally triggered procedures.
Formally this means that
if
is in a given ``policy modification state''
determined by the user,
and
otherwise. For instance, in some of the illustrative
experiments to be presented later, SHC's mutation
process will be the only LA. Alternatively we
may use LAs that are in fact GP-like crossover
strategies, or a wide variety of other policy-modifying
algorithms employed in traditional DS methods.
Checkpoints.
The entire lifetime of the agent can be partitioned into
time intervals separated by special times called checkpoints.
In general, checkpoints are computed dynamically during the agent's life
by certain actions in
executed according to
POL itself [Schmidhuber, Zhao, SchraudolphSchmidhuber
et al.1997a].
In this paper,
however, for simplicity all checkpoints will be set in advance.
The agent's
-th checkpoint is denoted
.
Checkpoints obey the following rules:
(1)
.
(2)
(3) Except for the first, checkpoints may not occur
before at least one POL-modification has been computed
by some LA since the previous checkpoint.
Sequences of POL-modifications (SPMs).
SPM
denotes the sequence of POL-modifications
computed by LAs in between
and
. Since LA execution probabilities may
depend on POL, POL may in principle
influence the way it modifies itself, which is interesting
for things such as metalearning.
In this paper's experiments, however, we will focus on
the case where exactly one LA (a simple mutation process
like the one used in stochastic hill-climbing) is invoked in
between two subsequent checkpoints.
Reward and goal.
Occasionally
provides
real-valued reward. The cumulative reward obtained by
the agent
between time 0 and time
is denoted
, where
.
Typically, in large (partially observable) environments,
maximizing cumulative expected reinforcement within
a limited life-time would be too ambitious a goal
for any method. Instead designers of direct policy
search methods are content with
methods that can be expected to find better
and better policies. But what exactly does ``better'' mean
in our context? Our agent's obvious goal at checkpoint
is to generate POL-modifications accelerating
reward intake: it wants to let
exceed the current average
speed of reward intake.
But to determine this speed it needs a previous
point
to compute
.
How can
be specified in a general yet reasonable way?
Or, to rephrase the central question of this paper:
if life involves unknown temporal delays between action
sequences and their observable effects and/or consists
of many successive ``trials''
with nondeterministic, uncertain outcomes,
how long should a trial last, and
how many trials should the agent look back into time to
evaluate its current performance?
The success-story algorithm (to be introduced next) addresses this question in a way that is prepared for the worst case: once a trial of a new policy change has begun it will never end unless the policy change itself gets undone (by a simple backtracking mechanism which ensures that trial starts mark long-term reward accelerations).
Enforcing success stories.
Let
denote the agent's time-varying set of past checkpoints that have
been followed by long-term reward accelerations. Initially
is empty.
denotes the
-th element of
in ascending order.
The success-story criterion (SSC)
is satisfied at time
if either
is empty (trivial case) or if
1. WHILE SSC is not satisfied: Undo all POL modifications made since the most recent checkpoint in, and remove that checkpoint from
.
2. Add the current checkpoint to.
``Undoing'' a modification means restoring the preceding POL --
this requires storing past values of
POL components on a stack prior to modification.
Thus each POL modification that survived SSA is part of a bias shift
generated after a checkpoint marking a lifelong reward speed-up:
the remaining checkpoints in
and the remaining
policy modifications represent a ``success story.''
Trials and ``effective'' trials.
All checkpoints in
represent starts of yet unfinished ``effective'' trials
(as opposed to externally defined trials with prewired
starts and ends).
No effective trial ever ends unless SSA restores the policy
to its state before the trial started.
The older some surviving SPM, the more time will have passed to
collect statistics concerning its long-term consequences, and the
more stable it will be if it is indeed useful and not
just there by chance.
Since trials of still valid policy modifications never end, they embody a principled way of dealing with unknown reward delays. No matter what's the maximal causal delay in a given environment, eventually the system's effective trials will encompass it.
Metalearning? An interesting example of long delays between actions and effects is provided by metalearning (learning a learning algorithm or credit assignment algorithm) [LenatLenat1983,SchmidhuberSchmidhuber1987,Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a]. For instance, suppose some ``metalearning action sequence" changes a learner's credit assignment strategy. To evaluate whether the change is good or bad, apparently we need something like a ``meta-trial" subsuming several lower-level ``normal" trials in which instances of additional policy changes produced by the modified learning strategy somehow get evaluated, to collect evidence concerning the quality of the modified learning strategy itself. Due to their very nature such meta-trials will typically consume a lot of time. SSA, however, addresses this issue in a natural and straight-forward way. There is no explicit difference between trials and meta-trials. But new effective trials do get opened within previously started, yet unfinished effective trials. What does this mean? It means that earlier SPMs automatically get evaluated as to whether they set the stage for later ``good'' SPMs. For instance, SSA will eventually discard an early SPM that changed the policy in a way that increased the probability of certain later SPMs causing a waste of time on evaluations of useless additional policy changes. That is, SSA automatically measures (in terms of reward/time ratios affected by learning and testing processes) the impact of early learning on later learning: SSA prefers SPMs making ``good'' future SPMs more likely. Given action sets that allow for composition of general credit assignment strategies from simple LAs, SSA will prefer probabilistic learning algorithms leading to better probabilistic learning algorithms. And it will end meta-trials as soon as they violate the constraints imposed by the success-story criterion, just like it does with ``normal" trials.
Implementing SSA.
SSA guarantees
that SSC will be satisfied after each checkpoint,
even in partially observable, stochastic environments
with unknown delays. (Of course, later SSA invocations
using additional, previously unavailable, delayed information
may undo policy changes
that survived the current checkpoint.)
Although inequality (1) contains
fractions,
SSA can be implemented efficiently: only
the two most recent still valid sequences of modifications
(those ``on top of the stack'') need to be considered at a given
time in an SSA invocation.
A single SSA invocation, however, may invalidate
many SPMs if necessary.