At any given time, a reinforcement learner [24] will
try to find a policy (a strategy for future decision
making) that maximizes its expected
future reward.
In many traditional reinforcement learning (RL) applications,
the policy that works best in a given set of training trials
will also be optimal in future test trials [55].
Sometimes, however, it won't. To see the difference between
direct policy search (the topic of the previous sections) and
future-oriented reinforcement learning (RL), consider an agent and two boxes.
In the
-th trial the agent may open and collect the content of
exactly one box. The left box will contain
Swiss Francs, the right box
Swiss Francs,
but the agent does not know this
in advance. During the first 9 trials
the optimal policy is ``open left box.'' This is what a good
direct searcher should find, given the outcomes of the first 9 trials,
although the policy suggested by the first 9 trials
will turn out to be suboptimal in trial 10.
A good reinforcement learner, however, should extract the
underlying regularity in the reward generation process
and predict the future reward, picking the right box in
trial 10, without having seen it yet.
The first general, asymptotically optimal reinforcement learner is the recent AIXI model [19,21]. It is valid for a very broad class of environments whose reactions to action sequences (control signals) are sampled from arbitrary computable probability distributions, or even limit-computable probability distributions [57]. This means that AIXI is far more general than traditional RL approaches. However, while AIXI clarifies the theoretical limits of RL, it is not practically feasible, just like HSEARCH is not. From a pragmatic point of view, what we are really interested in is a reinforcement learner that makes optimal use of given, limited computational resources. In what follows, we will outline how to use OOPS-like bias-optimal methods as components of universal yet feasible reinforcement learners.
We need two OOPS modules. The first is called the predictor or world model. The second is an action searcher using the world model. The life of the entire system should consist of a sequence of cycles 1, 2, ... At each cycle, a limited amount of computation time will be available to each module. For simplicity we assume that during each cyle the system may take exactly one action. Generalizations to actions consuming several cycles are straight-forward though. At any given cycle, the system executes the following procedure:
At any given time, until which temporal horizon should the predictor try to predict? In the AIXI case, the proper way of treating the temporal horizon is not to discount it exponentially, as done in most traditional work on reinforcement learning, but to let the future horizon grow in proportion to the learner's lifetime so far [21]. It remains to be seen whether this insight carries over to OOPS-RL.
Despite the bias-optimality properties of OOPS for certain ordered task sequences, however, OOPS-RL is not necessarily the best way of spending limited time in general reinforcement learning situations. But it is possible to use OOPS as a sub-module of the recent, truly optimal, reinforcement learning Gödel machine mentioned in the next section.