Here we will briefly review basic principles [Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a]. An agent lives in environment from time 0 to unknown time . Life is one-way: even if it is decomposable into numerous consecutive ``trials'', time will never be reset. The agent has a policy POL (a set of modifiable parameters) and possibly an internal state (e.g., for short-term memory that can be modified by actions). Both and POL are variable dynamic data structures influencing probabilities of actions to be executed by the agent. Between time 0 and , the agent repeats the following cycle over and over again ( denotes a set of possible actions):
select and execute some with conditional probability
Action will consume time and may change , , and POL.
Learning algorithms (LAs). is the set of actions that can modify POL. 's members are called learning algorithms (LAs). Previous work on metalearning [Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a,Schmidhuber, Zhao, WieringSchmidhuber et al.1997b] left it to the system itself to compute execution probabilities of LAs by constructing and executing POL-modifying algorithms. In this paper we will focus on a simpler case where all LAs are externally triggered procedures. Formally this means that if is in a given ``policy modification state'' determined by the user, and otherwise. For instance, in some of the illustrative experiments to be presented later, SHC's mutation process will be the only LA. Alternatively we may use LAs that are in fact GP-like crossover strategies, or a wide variety of other policy-modifying algorithms employed in traditional DS methods.
Checkpoints. The entire lifetime of the agent can be partitioned into time intervals separated by special times called checkpoints. In general, checkpoints are computed dynamically during the agent's life by certain actions in executed according to POL itself [Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a]. In this paper, however, for simplicity all checkpoints will be set in advance. The agent's -th checkpoint is denoted . Checkpoints obey the following rules: (1) . (2) (3) Except for the first, checkpoints may not occur before at least one POL-modification has been computed by some LA since the previous checkpoint.
Sequences of POL-modifications (SPMs). SPM denotes the sequence of POL-modifications computed by LAs in between and . Since LA execution probabilities may depend on POL, POL may in principle influence the way it modifies itself, which is interesting for things such as metalearning. In this paper's experiments, however, we will focus on the case where exactly one LA (a simple mutation process like the one used in stochastic hill-climbing) is invoked in between two subsequent checkpoints.
Reward and goal. Occasionally provides real-valued reward. The cumulative reward obtained by the agent between time 0 and time is denoted , where . Typically, in large (partially observable) environments, maximizing cumulative expected reinforcement within a limited life-time would be too ambitious a goal for any method. Instead designers of direct policy search methods are content with methods that can be expected to find better and better policies. But what exactly does ``better'' mean in our context? Our agent's obvious goal at checkpoint is to generate POL-modifications accelerating reward intake: it wants to let exceed the current average speed of reward intake. But to determine this speed it needs a previous point to compute . How can be specified in a general yet reasonable way? Or, to rephrase the central question of this paper: if life involves unknown temporal delays between action sequences and their observable effects and/or consists of many successive ``trials'' with nondeterministic, uncertain outcomes, how long should a trial last, and how many trials should the agent look back into time to evaluate its current performance?
The success-story algorithm (to be introduced next) addresses this question in a way that is prepared for the worst case: once a trial of a new policy change has begun it will never end unless the policy change itself gets undone (by a simple backtracking mechanism which ensures that trial starts mark long-term reward accelerations).
Enforcing success stories.
Let denote the agent's time-varying set of past checkpoints that have
been followed by long-term reward accelerations. Initially is empty.
denotes the -th element of in ascending order.
The success-story criterion (SSC)
is satisfied at time if either is empty (trivial case) or if
1. WHILE SSC is not satisfied: Undo all POL modifications made since the most recent checkpoint in , and remove that checkpoint from .
2. Add the current checkpoint to .
``Undoing'' a modification means restoring the preceding POL -- this requires storing past values of POL components on a stack prior to modification. Thus each POL modification that survived SSA is part of a bias shift generated after a checkpoint marking a lifelong reward speed-up: the remaining checkpoints in and the remaining policy modifications represent a ``success story.''
Trials and ``effective'' trials. All checkpoints in represent starts of yet unfinished ``effective'' trials (as opposed to externally defined trials with prewired starts and ends). No effective trial ever ends unless SSA restores the policy to its state before the trial started. The older some surviving SPM, the more time will have passed to collect statistics concerning its long-term consequences, and the more stable it will be if it is indeed useful and not just there by chance.
Since trials of still valid policy modifications never end, they embody a principled way of dealing with unknown reward delays. No matter what's the maximal causal delay in a given environment, eventually the system's effective trials will encompass it.
Metalearning? An interesting example of long delays between actions and effects is provided by metalearning (learning a learning algorithm or credit assignment algorithm) [LenatLenat1983,SchmidhuberSchmidhuber1987,Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a]. For instance, suppose some ``metalearning action sequence" changes a learner's credit assignment strategy. To evaluate whether the change is good or bad, apparently we need something like a ``meta-trial" subsuming several lower-level ``normal" trials in which instances of additional policy changes produced by the modified learning strategy somehow get evaluated, to collect evidence concerning the quality of the modified learning strategy itself. Due to their very nature such meta-trials will typically consume a lot of time. SSA, however, addresses this issue in a natural and straight-forward way. There is no explicit difference between trials and meta-trials. But new effective trials do get opened within previously started, yet unfinished effective trials. What does this mean? It means that earlier SPMs automatically get evaluated as to whether they set the stage for later ``good'' SPMs. For instance, SSA will eventually discard an early SPM that changed the policy in a way that increased the probability of certain later SPMs causing a waste of time on evaluations of useless additional policy changes. That is, SSA automatically measures (in terms of reward/time ratios affected by learning and testing processes) the impact of early learning on later learning: SSA prefers SPMs making ``good'' future SPMs more likely. Given action sets that allow for composition of general credit assignment strategies from simple LAs, SSA will prefer probabilistic learning algorithms leading to better probabilistic learning algorithms. And it will end meta-trials as soon as they violate the constraints imposed by the success-story criterion, just like it does with ``normal" trials.
Implementing SSA. SSA guarantees that SSC will be satisfied after each checkpoint, even in partially observable, stochastic environments with unknown delays. (Of course, later SSA invocations using additional, previously unavailable, delayed information may undo policy changes that survived the current checkpoint.) Although inequality (1) contains fractions, SSA can be implemented efficiently: only the two most recent still valid sequences of modifications (those ``on top of the stack'') need to be considered at a given time in an SSA invocation. A single SSA invocation, however, may invalidate many SPMs if necessary.