Basic set-up.
At a given time, the variable matrix
above represents
the system's current policy. Each call of the
procedure Adapt
(invoked by ALS)
modifies the policy.
Let us consider the complete sequence of such calls
spanning the entire
system life, which
starts at time 0 and ends at some point in the future
(time flows in one direction -- there are no resets to 0).
By definition,
the
-th call occurs at time
, is denoted Adapt
,
and generates a policy modification denoted by
.
In between two calls, a certain amount of time
is consumed by Levin search (details about how time is
measured will follow in the section on experiments).
Goal. Whenever ALS as above finds a solution,
the system receives a reward of
.
The goal is to receive as much reward as quickly as possible,
by generating policy changes that minimize the computation
time required by future calls of Levin search.
Let us denote the sum of all reinforcements between
time 0 and time
by
.
Reinforcement/time ratios.
Right before each call of Adapt, EIRA (see details below)
essentially
invalidates those policy modifications
that are not consistent with
the so-called reinforcement acceleration criterion (RAC).
To define RAC, we first introduce a measure indicating how
useful Adapt
has been until the current time
--
we simply compute the reinforcement/time ratio
:
(a), and
(b)such that
is still valid:
.
Obviously, RAC only holds if the history of still valid policy modification represents a history of long-term reinforcement accelerations -- each still valid modification has to be followed by more average reinforcement per time than all the previous ones. Note that the success of some Adapt call depends on the success of all later Adapt calls, for which it is ``setting the stage''! This represents an essential difference to previous performance criteria.
EIRA uses a stack to store information about policy modifications
computed by calls of Adapt. Right before
Adapt
is executed,
EIRA restores (if necessary) previous policies such that
RAC holds. EIRA is based on two processes:
(1) Pushing.
At time
,
EIRA pushes the following information on the stack:
,
, and the previous values of those columns
of
(representing probability distributions)
changed by Adapt
(this information may be needed for
later restoring
the old policy, as it used to be before
was
generated).
(2) Popping. Right before each call of Adapt, while none of the following conditions (1-3) holds, EIRA pops probability vectors off the stack and invalidates the corresponding policy modifications, by restoring the previous policies.
(1), where
and
are still valid, and
is the most recent valid policy modification generated earlier than
.
(2), where
is the only valid policy.
(3) the stack is empty.
Theoretical soundness. Using induction, it can be shown that this backtracking procedure ensures that RAC holds after each popping process [Schmidhuber, 1995b].
At any given time, EIRA's straight-forward generalization assumption is: modifications that survived the most recent popping process will remain useful. In general environments, what else could be assumed? Note that at any given time in system life, we have only one single ``training example'' to evaluate the current long-term usefulness of any given previous Adapt call, namely the average reinforcement per time since it occurred. During the next popping process, however, EIRA will reevaluate ``usefulness so far'' of still valid modifications.
To conclude: EIRA again and again implicitly evaluates each still valid policy modification as to whether it has been followed by long-term performance improvement (perhaps because the modification set the stage for later useful modifications). If there is evidence to the contrary, EIRA invalidates policy modifications until RAC is fulfilled again. EIRA's stack-based backtracking is efficient in the sense that only the two most recent still valid modifications have to be considered at a given time (although a single popping process may invalidate many modifications).