**1.**
**Computing self-modifications.** Like in sections 1-4,
the initially highly random actions of the system actually are
primitive instructions of a Turing machine equivalent
programming language,
which allows for implementing arbitrary
(learning) algorithms. Action subsequences
represent either (1) ``normal'' interactions with the environment,
or (2) ``self-modification sequences''.
Self-modification sequences
can compute arbitrary
modifications of probabilities of future action subsequences,
including future self-modification sequences:
the learning system is able to modify itself in a universal way.
There is no explicit difference between ``learning'',
``meta-learning'', and other kinds of information processing.

**2.**
**Life is one-way.**
Each action of the learning system
(including probability modifying actions executed by
self-modification sequences)
is viewed as a singular event in the history of system life.
Unrealistic concepts
such as ``exactly repeatable training iterations'',
``boundaries between trials'',
``epochs'', etc.
are thrown overboard.
In general, the environment cannot be reset.
Life is one-way.
There is only *one* lifelong training episode.
Learning is inductive inference from non-repeatable experiences.

**3.**
**Evaluations of self-modification sequences.**
The system has a time-varying
utility value, which is the average payoff per
time since system start-up.
Each completed self-modification sequence also has a
time-varying utility value.
This value is the average amount of payoff
per time measured since the sequence began execution.
*Unlike with previous systems,
evaluations of utility take into account
all the computation time required for learning,
including the time required for evaluating utility.*

**4.**
**Recursive definition of ``useful'' self-modification sequences.**
The system keeps track of probability modifications computed
by completed self-modification sequences that it considers *useful*.
*Usefulness* is defined recursively.
If there are no previous *useful* self-modification sequences
(e.g. at system start-up), a completed self-modification sequence
is considered
*useful* only for as long as its utility value exceeds the system's
utility value.
More recent completed self-modification sequences are
considered *useful* for as long as they have
higher utility values than all preceding self-modification sequences
currently considered *useful*.

Essentially, the system only keeps modifications to its probability
values that originated from *useful* self-modification sequences.

**5.**
**Acceleration of payoff intake / Theoretical soundness.**
It can be shown that over time, the system tends to make better
and better use of its computational resources. In fact, it can
be shown that it accelerates payoff (reinforcement) intake in the
long run in the following sense:
at every time step in the life of the system (except during execution
of self-modification sequences),
all (self-computed) valid modifications to its strategy
have been followed by faster average payoff intake than
all previous valid modifications (and system start-up itself).
Perhaps somewhat surprisingly, the nature of the environment does
not matter (for instance,
the interface to the possibly non-deterministic
environment does not have to be Markovian).

Unlike the non-incremental system described in sections 2-4 of this paper, incremental self-improvement appears to be a promising way of dealing with lifelong incremental learning. A system based on incremental self-improvement has already been implemented and tested on simple toy tasks (Schmidhuber, 1994b). As expected, the experimental results are consistent with the theoretical predictions.

Back to Optimal Universal Search page

Back to Program Evolution page

Back to Algorithmic Information page

Back to Speed Prior page