Previous work on reinforcement learning (e.g., [16,2,49,51]) requires strong assumptions about the environment. In many realistic settings, however, these assumptions do not hold. In particular, any event/action/experiment occurring early in the life of a learning system may influence events/actions/experiments at any later time. To address these important but previously mostly ignored issues, this paper describes a novel, general framework for single-life reinforcement learning with limited computational resources in rather general environments. The focus is on theoretical considerations. However, experiments in sections 3 and 4 will serve to illustrate the basic concepts.
Scenario.
Consider a learning system executing a lifelong
action sequence in an unknown environment.
For now, it won't even be necessary
to introduce a detailed formal model of the environment.
Different system actions may require different amounts of execution
time (like in scenarios studied in [25,4],
and in references given therein).
Occasionally the environment provides real-valued ``reinforcement''.
The sum of all reinforcements
obtained between ``system birth'' (at time 0) and time
is denoted by
. Throughout its lifetime, the system's
goal is to maximize
, the cumulative reinforcement at (initially
unknown) ``death''
.
There is only one life. Time flows in one
direction (no resets to zero).
Related, but less general scenarios were studied in
papers on ``bandit problems''
(e.g., [3,9] and references therein),
which also require to wisely use limited resources to perform experiments.
See also [10] and references therein for
work on finding search space elements with optimal expected utility,
subject to the resource constraint of spending only a feasible amount of time
on finding such an element.
Policy. The system's current policy is embodied by modifiable parameters of an arbitrary algorithm mapping environmental inputs and internal states to output actions and new internal states. For example, parameters may be bits of machine code, or weights of a neural net. The number of parameters needs not be fixed. For now, we won't need a detailed formal model of the policy.
Policy modification processes (PMPs).
Policy parameters are occasionally modified
by finite ``policy modification processes'' (PMPs),
sometimes also called ``learning processes''.
Different PMPs may take different amounts of time.
The
-th PMP
in system life is denoted PMP
,
starts at time
, ends at
,
, and computes a
policy modification denoted by
.
Later we will require that
a non-zero time-interval passes between
and
, the beginning of
PMP
.
While PMP
is running, the system's lifelong interaction
sequence with the environment may continue, and there may be
reinforcement signals, too. In fact, PMP
may use
environmental
feedback to compute modification
-- for instance, by
executing a known reinforcement learning
algorithm.
Later I will even take advantage of the possibility that PMPs may be
generated and executed according to information
embedded in the policy itself -- this is of interest in the context
of ``self-modifying'' policies that ``learn how to learn how to learn ...''.
For the moment, however, I do not care for what exactly
happens while PMP
is running: in what follows, I
won't need a formal model of the PMP details.
The following paragraphs are only
concerned with the question: how do we measure modification
's influence on the remaining
system life? In particular, how do we measure whether
successfully set the stage for later
,
?
The problem.
In general environments, events/actions/experiments
occurring early in system life may
influence events/actions/experiments at
any later time.
In particular, PMP
may affect the environmental
conditions for PMP
.
This is not addressed by
existing algorithms for adaptive control and reinforcement learning
(see, e.g., [16,2,49,51]),
and not even by naive, inefficient, but more general and supposedly
infallible exhaustive search
among all possible policies, as will be seen next.
What's wrong with exhaustive search?
Apart from the fact that exhaustive search is not considered
practical even for moderate search spaces, it also suffers
from another, more fundamental problem. Let
be the number
of possible policies.
For the sake of the argument,
suppose that
is small enough to
allow for systematic, sequential generate-and-test of all
policies within the system life time.
Suppose that after all policies have been
generated (and evaluated for some time),
during the remainder of system life, we keep
the policy whose test brought about maximal
reinforcement during the time it was tested.
In the general ``on-line'' situation considered in this paper,
this may be the wrong thing to do:
for instance, each policy test may change the
environment in a way that changes the
preconditions for policies considered earlier.
A policy discarded earlier may suddenly be the ``best'' policy
(but will never be considered again).
Similar things can be said about almost every other
search algorithm or learning algorithm
-- exhaustive search is just a
convenient, very simple, but representative example.
How to measure performance improvements? Obviously, the performance criterion used by naive exhaustive search (and other, less general search and learning algorithms) is not appropriate for the general set-up considered in this paper. We first have to ask: what is a reasonable performance criterion for such general (but typical) situations? More precisely: if we do not make any assumptions about the environment, can we still establish a sensible criterion according to which, at a certain time, (1) the system's performance throughout its previous life always kept improving or at least never got worse, and which (2) can be guaranteed to be achieved by the system? Indeed, such a criterion can be defined, as will be seen below. First we will need additional concepts.
Reinforcement/time ratios based on single observations.
At a given time
in system life,
we essentially have only one single ``training
example'' to evaluate the current long-term
usefulness of any given previous PMP,
namely the average reinforcement per
time since that learning process occurred.
Suppose PMP
started
execution at time
and completed itself at time
.
For
and
, the reinforcement/time ratio
is defined as
Why measure time starting from the beginning of
PMP
, instead of starting from its end? Later we will see
that the first evaluations
of PMP
's performance will be delayed
at least until after its end.
While PMP
is still running,
it is in a ``grace period'' which may be used to
collect delayed reinforcement to justify
-- this is important
when reinforcement signals occur long after
policy modifications took place. Indeed, later I
will make use of the fact that a ``self-modifying''
policy may determine the runtimes of its own PMPs,
thus in principle being able to learn how
long to wait for delayed reinforcement.
Currently valid modifications.
After
, PMP
's effects on the policy will
stay in existence until (a) being overwritten by later
PMP
,
, or until (b) being invalidated by
the ``success-story algorithm''
(SSA), the method to be described below. To define
valid modifications, only (b) is relevant:
after
, the policy modification
generated by PMP
will remain valid as long as SSA (see below) does not
discard
(by restoring the previous policy
right before PMP
started).
Success-story criterion (SSC).
At time
, SSC is satisfied if for
each PMP
that computed a still valid
modification
of the system's policy,
(a), and
(b) for all, where PMP
computed a still valid
:
.
In words: SSC is satisfied if the beginning of each completed PMP that computed a still valid modification has been followed by long-term reinforcement acceleration -- measured up until the current time. Note that the success of some PMP depends on the success of all later PMPs, for which it is ``setting the stage''. This represents an essential difference to previous performance criteria.
The method to achieve SSC is called ``success-story algorithm'' (SSA). SSA uses a stack to trace and occasionally invalidate certain policy modifications. Details are next.
Success-story algorithm (SSA). SSA uses an initially empty stack to store information about currently valid policy changes computed by PMPs. Occasionally, at times called ``checkpoints'', this information is used to restore previous policies, such that SSC holds (later we will see that a ``self-modifying'' policy can in principle learn to set checkpoints at appropriate times). SSA is based on two complementary kinds of processes:
(1) Pushing.
Recall that PMP
starts at time
and ends at
.
Let
denote a record consisting of
the values
,
(which will be needed to
compute
values at later times
), plus the
information necessary to restore
the old policy (before modification by PMP
).
is pushed onto the stack (this will consume time, of
course). By definition, PMP
is not finished until the entire pushing process is completed.
From now on, the modification
will remain valid as long as
will remain on the stack.
(2) Popping. At certain times called ``checkpoints'' (see below), do:
WHILE no timeis reached such that one of the following conditions (1-3) holds, DO: pop the topmost
-value off the stack, and invalidate the corresponding modification
, thus restoring the corresponding previous policy.
(1), where
, and
and
are the two most recent currently valid modifications (if existing),
(2), where
is the only currently valid modification (if existing),
(3) the stack is empty.For all PMP
Theorem 1. Whenever popping is finished, SSA will have achieved SSC.
Proof sketch. Induction over the stack contents after popping.
Note that SSA does not care for the nature of the PMPs
-- for all completed PMP
(based, e.g., on conventional
reinforcement learning algorithms), at each ``checkpoint'', SSA will
get rid of
if
was not followed by long-term reinforcement
speed-up (note that before countermanding
, SSA will already
have countermanded all
).
No modification
is guaranteed to remain valid forever.
Generalization assumption. After popping, until the next checkpoint, SSA's straight-forward generalization assumption is: modifications that survived the most recent popping process (because they appeared to contribute to speed-up of average reinforcement intake) will remain useful. In general environments, which other generalization assumption would make sense? Recall that since life is one-way (time is never reset), at each checkpoint the system has to generalize from a single experience concerning the usefulness of any given previous learning process: the average reinforcement per time since that learning process occurred. Learning from singular experiences contrasts other, less realistic kinds of reinforcement learning, which, in one way or another, require the assumption that it is possible to collect sufficient statistics from repeatable training sequences.
To summarize:
SSA again and again (at each checkpoint)
implicitly evaluates each
still valid policy modification as to whether it has been
followed by long-term performance improvement (perhaps
because the modification set the stage for later useful
modifications). If there is evidence
to the contrary, SSA invalidates policy modifications until
SSC is fulfilled again. SSA's stack-based backtracking
is efficient in the sense that only the two most recent
(``topmost'')
still valid modifications need to be considered at a given
time (although a single popping process may invalidate
many modifications).
SSA is a general framework --
you can use your favorite
learning algorithm
to generate the PMP. This makes
sense especially in situations where the applicability of
is questionable because the environment
does not satisfy the preconditions that would make
sound.
SSA can at least guarantee that those
of
's policy modifications that appear to have negative
long-term effects are countermanded.
Note that a trivial way of satisfying SSC is to never make a policy modification. But any system that cannot let modification probabilities entirely vanish occasionally will execute policy modifications, and keep those consistent with SSC. In this sense, it cannot help but getting better, if the environment does indeed provide a chance to improve performance, given the set of possible actions.
Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of the environment and the choice of the action set. This lack is shared by almost all other reinforcement learning approaches, though.
Costs of environment-independence. A price to pay for environment-independence is this: the beginning of the time interval considered to measure the current reinforcement/time ratio is marked by the beginning of the PMP that computed the most recent valid policy modification. The length of this time interval may vary unpredictably, because its beginning may wander back in time due to policy restaurations by SSA. In arbitrary environments, SSA cannot guarantee speed-up of reinforcement per fixed time interval (no algorithm can).
Benefits of environment independence. Next we will see that SSA provides a novel basis for two notoriously difficult issues in machine learning: (*) Multi-agent learning, (**) ``Learning how to learn''.
(*) Multi-agent learning. Consider the case where there are multiple, interacting, SSA-learning agents. For each agent, the others are part of the changing (possibly complex) environment. This is the main reason why previous approaches to multi-agent learning are heuristic by nature. Not so with SSA, however: since SSA is environment-independent, each agent will still be able to satisfy its SSC after each checkpoint. In cases where all agents try to speed up the same reinforcement signals, and where no agent can speed up reinforcement intake by itself, this automatically enforces ``learning to cooperate'' [35,36,40]. Section 2.3 will illustrate this with an application of a system consisting of multiple agents, where each agent is in fact just a connection in a neural net.
(**) Learning how to learn / Incremental self-improvement.
With SSA, the success of PMP
recursively depends
on the success of later PMP
,
:
the cumulative reinforcement collected after PMP
includes
the cumulative reinforcement collected after PMP
,
.
In particular,
performance improvements include those improvements that make future
additional improvements more likely: policy
modification
can prove its
long term usefulness by setting the stage for additional, useful
modifications
, etc. Now recall that SSA does
not care for the nature of the PMPs -- they
may depend on the system policy itself.
If we allow a system's policy to change itself by computing
and executing its own PMPs (this is easy to
implement, e.g. by using instructions of an
assembler-like, ``self-referential''
programming language as system actions -- see section 2.2),
then SSA will keep only
those self-modifications followed by reinforcement speed-ups,
in particular those leading to ``better'' future
self-modifications, etc., recursively:
embedding the learning
mechanism within the system policy
immediately and naturally leads to ``incremental
self-improvement'' and ``learning how to learn how to learn ...''
-- there is no circularity involved, and
the approach remains sound.
Outline of remainder of paper. The main contribution of this paper is given by what has been said above. The remainder of this paper mainly serves to illustrate the theory. In section 2, I will exemplify the ideas in paragraph (**) on incremental self-improvement and describe a particular, concrete, working, ``evolutionary'' system that implements them. Section 3 will then apply this system to various tasks, including a variant of Sutton's maze task [47]. One difference to Sutton's original task is that the policy environment continually changes because of actions generated by the system itself. Section 4 will then exemplify the ideas in paragraph (*) on multi-agent learning and also describe a particular, concrete, working, ``evolutionary'' system that implements them. With this system, each SSA-learning agent is in fact just a connection in a fully recurrent neural net. Each connection's policy is represented by its current weight. Connections do have a very limited policy space in comparison to typical, more complex agents used in standard AI -- but this is irrelevant: SSA does not care for the complexity of the agents. A by-product of this research is a general reinforcement learning algorithm for recurrent nets. Again, an application to a variant of Sutton's maze task will serve to illustrate the operation of the system. In this application, the environment of each connection's policy continually changes, because the policies of all the other connections keep changing.
Note. This paper is based on [32]. In the meantime we have published several additional papers on SSA, e.g., [50,53,40,42].