Incremental learning. With many typical learning situations in the real world, however, there is more informative feedback. For instance, ``supervised'' gradient-based neural net algorithms like back-prop [Werbos, 1974,LeCun, 1985,Parker, 1985,Rumelhart et al., 1986] make use of information provided by error signals (distances between actual network outputs and target values). Unlike universal search, these algorithms incrementally adjust network weights in an iterative manner: solution candidates found in previous trials serve as a basis for additional improvements. Reinforcement learning algorithms [Watkins, 1989,Dayan and Sejnowski, 1994,Barto et al., 1983,Williams, 1988,Schmidhuber, 1989] receive less informative environmental feedback than supervised learning algorithms (see Barto (1989) for an overview), but they are designed to work in an incremental fashion as well. For instance, they tend to make use of the information provided by the magnitude of the rewards, and by the amount of time between rewarding events. Again, ``good'' solutions build the basis for ``better'' solutions. The same is true for simple hill-climbing and for ``evolutionary'' and ``genetic'' algorithms (GAs).
Previous suggestions. The original universal search procedure as formulated by Levin is not designed for incremental learning situations. Suggestions for extending universal search to deal with incremental learning were made in [Solomonoff, 1986,Paul and Solomonoff, 1991], and also in [Schmidhuber, 1994] -- the first paper presenting simulations based on incremental extensions.
Problems with previous suggestions. In realistic, unknown environments, each event / action / trial / learning process occurring in the life of a learning system may affect performance and learning processes at any later time. This fact is not properly taken into account by previous methods, neither by those mentioned above, nor by existing reinforcement learning algorithms, and not even by naive, inefficient, but supposedly infallible exhaustive search: in general, sequential and systematic generate-and-test of all policy candidates will change the environment, thus changing the conditions for policy candidates considered earlier. In fact, a reasonable performance criterion for such general (but typical) situations was missing.
Performance criterion for incremental learning. Recent work [Schmidhuber, 1996] defined such a criterion: the ``reward acceleration criterion'' (RAC). Although not essential for the message of the current paper, a brief review of the main concepts will follow.
Suppose a reinforcement learner's life in an unknown environment lasts from time 0 to unknown time . Occasionally provides real-valued reward. The cumulative reward obtained between time 0 and time is denoted by (where ). The learner's goal at time is to use experience to maximize .
Between time 0 and , the learner repeats the following cycle over and over again ( denotes a set of possible actions): Select and execute with probability , where the internal state is a vector of variable, real-valued components, and the modifiable policy is a variable, conditional probability distribution on the possible actions, given current . may change , , (environmental inputs are represented by certain components of ). Actions that modify are called learning algorithms. Since their execution probablities also depend on , essentially controls the way it modifies itself (``self-reference'').
Some actions group learning algorithms and other actions into sequences called ``policy modification processes'' (PMP). The -th PMP in system life is denoted PMP, starts at time , ends at , , and computes a modification denoted .
Certain actions trigger an evaluation of the system's
performance so far. Such evaluations may cause
policy modifications to be undone (by restoring
the previous policy -- in practical implementations,
this requires to store previous values of modified
policy components on a stack).
At a given time in the learner's life,
let denot the set of those previous whose
corresponding have not
been undone yet. If is not empty,
denote the -th valid time, ordered according to size.
RAC is satisfied if either is empty (trivial case) or if
Using stack-based backtracking methods such as those in [Schmidhuber, 1996,Wiering and Schmidhuber, 1996,Zhao and Schmidhuber, 1996,Schmidhuber et al., 1996], one can guarantee that RAC is satisfied each time a new PMP starts (of course, the time consumed by backtracking itself is taken into account by RAC). Moreover, in typical environments, RAC can be satisfied in a non-trivial way. Since the success of a policy modification recursively depends on the success of later modifications for which it is setting the stage, the framework provides a sound basis for ``learning how to learn.'' It is quite general: a wide variety of learning algorithms may be included in the action set . Numerous experiments in the papers mentioned above show its practical feasibility. For instance, in [Wiering and Schmidhuber, 1996], includes an extension of Levin search for generating the PMPs. This leads to significant performance speed-ups in experiments with sequences of more and more complex tasks.