Fundamental transfer limitations. Inductive transfer of knowledge from one task solution to the next (e.g., Caruana et al. 1995; Pratt and Jennings 1996) requires the solutions to share mutual algorithmic information. Since almost all sequences of solutions to well-defined problems are incompressible and have maximal Kolmogorov complexity [#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#], arbitrary task solutions almost never share mutual information. This implies that inductive transfer and ``generalization'' are almost always impossible -- see, e.g., Schmidhuber (1997a); for related results see Wolpert (1996). From a practical point of view, however, even the presence of mutual information is no guarantee of successful transfer. This is because concepts such as Kolmogorov complexity and algorithmic information do not take into account the time consumed by learning algorithms computing a new task's solution from previous ones. In typical machine learning applications, however, it is precisely the learning time that we want to minimize.
Reward acceleration. Given the observations above, all attempts at successful transfer must be limited to task sequences of a particularly friendly kind. In the context of reinforcement learning (RL) we will focus on task sequences that allow for speeding up the learner's long-term average reward intake. Fortunately, in our own highly atypical and regular universe such task sequences abound. For instance, often we encounter situations where high reward for some problem's solution can be achieved more quickly by first learning easier but related tasks yielding less reward.
Our learner's single life lasts from time 0 to time (time is not reset in case of new learning trials). Each modification of its policy corresponds to a shift of inductive bias (Utgoff, 1986). By definition, ``good'' bias shifts are those that help to accelerate long-term average reward intake. The learner's method for generating good bias shifts must take into account: (1) Bias shifts occurring early in the learner's life generally influence the probabilities of later bias shifts. (2) ``Learning'' (modifying the policy) and policy tests will consume part of the learner's limited life-time.
Previous RL approaches. To deal with issues (1) and (2), what can we learn from traditional RL approaches? Convergence theorems for existing RL algorithms such as Q-learning [#!WatkinsDayan:92!#] require infinite sampling size as well as strong (usually Markovian) assumptions about the environment, e.g., [#!Sutton:88!#,#!WatkinsDayan:92!#,#!Williams:92!#]. They are of great theoretical interest but not extremely relevant to our realistic limited life case. For instance, there is no proof that Q-learning will converge within finite given time, not even in Markovian environments. Also, previous RL approaches do not consider the computation time consumed by learning and policy tests in their objective function. And they do not explicitly measure long-term effects of early learning on later learning.
Basic ideas (see details in section 2). To address issues (1) and (2), we treat learning algorithms just like other time-consuming actions. Their probabilities of being executed at a given time may depend on the learner's current internal state and policy. Their only distinguishing feature is that they may also modify the policy. In case of policy changes or bias shifts, information necessary to restore the old policy is pushed on a stack. At any given time in system life there is only one single training example to estimate the long-term usefulness of any previous bias shift -- namely the reward per time since then. This includes all the reward collected after later bias shifts for which may have set the stage, thus providing a simple measure of earlier learning's usefulness for later learning. Occasionally the ``success-story algorithm'' (SSA) uses backtracking to undo those policy modifications that have not been empirically observed to trigger long-term reward accelerations (measured up until the current SSA call). For instance, certain bias shifts may have been too specifically tailored to previous tasks (``overfitting'') and may be harmful for future inductive transfer. Those bias shifts that survive SSA represent a lifelong success history. Until the next SSA call, they will build the basis for additional bias shifts and get another chance to justify their existence.
Due to unknown reward delays, there is no a priori good way of triggering SSA calls. In principle, however, it is possible to build policies that can learn to trigger SSA calls. Since learning algorithms are actions and can be combined (according to the policy) to form more complex learning algorithms, SSA also allows for embedding the learning strategy within the policy itself. There is no pre-wired difference between ``learning'', ``metalearning'', ``metametalearning'' etc.
Outline of remainder. Section 2 will describe the learner's basic cycle of operations and SSA details. It will explain how lifelong histories of reward accelerations can be enforced despite possible interference from parallel internal or external processes. Sections 3 and 4 will present two concrete implementations and inductive transfer experiments with complex, partially observable environments (POEs). Some of our POEs are bigger and more complex than POEs considered in most previous POE work.