1. Computing self-modifications. Like in sections 1-4, the initially highly random actions of the system actually are primitive instructions of a Turing machine equivalent programming language, which allows for implementing arbitrary (learning) algorithms. Action subsequences represent either (1) ``normal'' interactions with the environment, or (2) ``self-modification sequences''. Self-modification sequences can compute arbitrary modifications of probabilities of future action subsequences, including future self-modification sequences: the learning system is able to modify itself in a universal way. There is no explicit difference between ``learning'', ``meta-learning'', and other kinds of information processing.
2. Life is one-way. Each action of the learning system (including probability modifying actions executed by self-modification sequences) is viewed as a singular event in the history of system life. Unrealistic concepts such as ``exactly repeatable training iterations'', ``boundaries between trials'', ``epochs'', etc. are thrown overboard. In general, the environment cannot be reset. Life is one-way. There is only one lifelong training episode. Learning is inductive inference from non-repeatable experiences.
3. Evaluations of self-modification sequences. The system has a time-varying utility value, which is the average payoff per time since system start-up. Each completed self-modification sequence also has a time-varying utility value. This value is the average amount of payoff per time measured since the sequence began execution. Unlike with previous systems, evaluations of utility take into account all the computation time required for learning, including the time required for evaluating utility.
4. Recursive definition of ``useful'' self-modification sequences. The system keeps track of probability modifications computed by completed self-modification sequences that it considers useful. Usefulness is defined recursively. If there are no previous useful self-modification sequences (e.g. at system start-up), a completed self-modification sequence is considered useful only for as long as its utility value exceeds the system's utility value. More recent completed self-modification sequences are considered useful for as long as they have higher utility values than all preceding self-modification sequences currently considered useful.
Essentially, the system only keeps modifications to its probability values that originated from useful self-modification sequences.
5. Acceleration of payoff intake / Theoretical soundness. It can be shown that over time, the system tends to make better and better use of its computational resources. In fact, it can be shown that it accelerates payoff (reinforcement) intake in the long run in the following sense: at every time step in the life of the system (except during execution of self-modification sequences), all (self-computed) valid modifications to its strategy have been followed by faster average payoff intake than all previous valid modifications (and system start-up itself). Perhaps somewhat surprisingly, the nature of the environment does not matter (for instance, the interface to the possibly non-deterministic environment does not have to be Markovian).
Unlike the non-incremental system described in sections 2-4 of this paper, incremental self-improvement appears to be a promising way of dealing with lifelong incremental learning. A system based on incremental self-improvement has already been implemented and tested on simple toy tasks (Schmidhuber, 1994b). As expected, the experimental results are consistent with the theoretical predictions.