OOPS-Based Reinforcement Learning

At any given time, a reinforcement learner [27] will
try to find a *policy* (a strategy for future decision
making) that maximizes its expected future reward.
In many traditional reinforcement learning (RL) applications,
the policy that works best in a given set of training trials
will also be optimal in future test trials [51].
Sometimes, however, it won't. To see the difference between
searching (the topic of the previous sections) and
reinforcement learning (RL), consider an agent and two boxes.
In the -th trial the agent may open and collect the content of
exactly one box. The left box will contain
Swiss Francs, the right box Swiss Francs,
but the agent does not know this
in advance. During the first 9 trials
the optimal policy is *``open left box.''* This is what a good
searcher should find, given the outcomes of the first 9 trials.
But this policy will be suboptimal in trial 10.
A good reinforcement learner, however, should extract the
underlying regularity in the reward generation process
and predict the future reward, picking the right box in
trial 10, without having seen it yet.

The first general, asymptotically optimal reinforcement learner is the recent AIXI model [22,24] (Section 7). It is valid for a very broad class of environments whose reactions to action sequences (control signals) are sampled from arbitrary computable probability distributions. This means that AIXI is far more general than traditional RL approaches. However, while AIXI clarifies the theoretical limits of RL, it is not practically feasible, just like HSEARCH is not. From a pragmatic point of view, what we are really interested in is a reinforcement learner that makes optimal use of given, limited computational resources. In what follows, we will outline how to use OOPS-like bias-optimal methods as components of universal yet feasible reinforcement learners.

We need two OOPS modules. The first is
called the predictor or world model. The second is
an action searcher using the world model. The life of the entire system
should consist of a sequence of *cycles* 1, 2, ...
At each cycle, a limited amount of computation time will
be available to each module.
For simplicity we assume that during each cyle the system
may take exactly one action.
Generalizations to actions consuming several cycles are straight-forward though.
At any given cycle, the system executes the following procedure:

- For a time interval fixed in advance,
the predictor is first trained
in bias-optimal fashion to find a better world model,
that is, a program that predicts the inputs from the environment
(including the rewards, if there are any), given a history of
previous observations and actions.
So the -th task ()
of the first OOPS module is to find (if possible) a
better predictor than the best found so far.
- Once the current cycle's time for predictor improvement is used up,
the current world model (prediction program) found by
the first OOPS module will be used by
the second module, again in bias-optimal fashion,
to search for a future action sequence that maximizes
the predicted cumulative reward (up to some time limit).
That is, the -th task ()
of the second OOPS module will be to find
a control program that computes a control sequence of actions,
to be fed into the program representing the current world model
(whose input predictions are successively fed back to itself in
the obvious manner),
such that this control sequence leads to higher predicted
reward than the one generated by the best control program
found so far.
- Once the current cycle's time for control program search
is used up, we will execute the current action of the best control
program found in step 2. Now we are ready for the next cycle.

At any given time, until which temporal horizon should the predictor try to predict? In the AIXI case, the proper way of treating the temporal horizon is not to discount it exponentially, as done in most traditional work on reinforcement learning, but to let the future horizon grow in proportion to the learner's lifetime so far [24]. It remains to be seen whether this insight carries over to OOPS-based RL. In particular, is it possible to prove that certain OOPS-RL variants are a near-bias-optimal way of spending a given amount of computation time on RL problems? Or should we instead combine OOPS and Hutter's time-bounded AIXI model? We observe that certain important problems are still open.