Outline.
The ``evolutionary''
system in this section (see also [32])
implements the ideas from section 1,
in particular those in paragraph (**)
on ``incremental self-improvement''.
To improve/speed up its own (initially very
dumb, highly random) learning strategy, the system policy
makes use of an assembler-like programming language
suitable to modify the policy itself. The policy actually is a set of
conditional, modifiable probability distributions (initially maximum
entropy distributions) on a set of assembler-like instructions.
These distributions are required to compute the probability of
the assembler-like instruction to be executed next (the system
executes a lifelong sequence of such instructions).
The PMP
from section 1 are
self-delimiting ``sequences of self-modifications'' (SSMs). SSMs are
special instruction subsequences (generated according to the current policy)
that define their own beginning and their own end
(by executing special instructions).
With the help of special ``self-referential'' instructions, SSMs
can compute almost arbitrary policy modifications (which actually
are sequences of probability modifications).
Policy modifications can be computed only by SSMs.
SSMs affect the probabilities of future SSMs, which affect
the probabilities of future SSMs, etc.
These recursive effects are automatically taken care of by SSA:
following the SSA principle, whenever an SSM computes a
policy modification,
information required to restore original probability distributions
and to compute reinforcement/time ratios for
currently valid modifications is pushed on a stack.
After each instruction executed outside an SSM,
the system runs a popping process (see section 1).
It can be guaranteed that the system,
after each action executed outside an SSM,
will achieve SSC (
``learning how to learn'').
In principle, the system can learn to deal with arbitrary
reward delays by determining the lengths of its SSMs -- SSMs define
checkpoints (see section 1) by themselves.
The system is ``evolutionary'' in the sense that it lets survive only those (initially highly random) policy modifications that were followed by long-term reinforcement acceleration. The system has been implemented and tested on (non-Markovian) toy tasks in changing environments. The experiments illustrate the theoretical predictions: the system computes action subsequences leading to faster and faster reinforcement intake (section 3). Due to the generality of the approach, however, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of the environment and the choice of initial primitive instructions.
Storage.
The system's storage is a single array of cells.
Each cell has an integer address in the interval
.
Max is a positive integer.
Min is a negative integer.
denotes the cell with address
.
The variable contents of
are denoted by
, and are
of type integer as well (
;
).
Special addresses,
,
,
,
and
,
are used to further divide storage into segments:
Min
InputStart
InputEnd
RegisterStart
ProgramStart
Max.
The input area is
the set of input cells
.
The register area
is the set of register cells
.
``Registers'' are convenient for
indirect-addressing purposes.
The program area is the set of program cells
.
Integer sequences in the program area
are interpreted as executable code.
The work area is the set of work cells
.
Instructions executed in the program area may read from
and write to the work area.
Both register area and input area are
subsets of the work area.
Environmental inputs. At every time step, new inputs from the environment may be written into the input cells.
Primitives.
The assembler-like programming language is partly inspired by
the one in [33,38].
There are
primitive instructions
(
).
Each ``primitive'' is represented by a unique number in the set
(due to the code being written in C).
The primitive with number
is denoted by
.
Primitives may have from zero to three arguments, each of which
has a value in
.
The semantics of the primitives and their corresponding
arguments are given in Table 1.
The non-self-referential (``normal'') primitives include actions for
comparisons, and conditional jumps, for copying storage contents,
for initializing certain storage cells with small integers, and
for adding, subtracting, multiplying, and dividing.
They also include
output actions for modifying
the environment, and input actions for perceiving
environmental states.
The ``self-referential'' primitives will be described in detail below.
Primitive and argument probabilities.
For each cell
in the program area,
there is a discrete probability distribution
over the set
of possible cell contents.
There is a variable InstructionPointer (IP)
which always points to one of the program cells (initially to the first one).
If IP
and
, then
denotes the probability of selecting
primitive
as the next instruction.
The restriction
is needed to leave
room for the instruction's possible arguments should it require any.
Once
is selected:
.
If
has a first argument, then
is
the probability of
being chosen as its actual value,
for
.
Once some
is selected:
.
Analoguously, if
has a second argument, then
is
the probability of
being chosen as its actual value,
for
.
Once some
is selected:
.
And finally, if
has a third argument, then
is
the probability of
being chosen as its actual value,
for
.
Once some
is selected:
.
Double indexed addressing. Although program cells can take on only few values, the use of double-indexed addressing allows for addressing the entire storage: a parameter in a program cell may point to a work cell with a small address. The content of this work cell, however, does not have essential limits, and may point to any address in storage. See [11,53,42,40,39], however, for alternative implementations without double indexed addressing.
Current policy.
The set of all current
-values defines the system's current policy.
Instruction cycle.
A single step of the interpreter works as follows:
if IP points to program cell
,
a primitive and the corresponding arguments
are chosen randomly according to the current
probability distributions, as already described.
They are sequentially written onto the program area, starting
from
. Syntax checks are performed. Rules for
syntactic correctness are given in the caption of Table 1.
If syntactically
correct,
the instruction gets executed.
This may result in modifications of
IP and/or
environment and/or storage.
If the instruction did not change the value
of IP (e.g. by causing a jump),
IP is set to the address of the cell following the last
argument of the current instruction.
If the instruction is syntactically
incorrect, IP is reset to the first
program cell.
This instruction cycle
represents the basic operation of the system.
System life.
At time step 0,
storage is initialized
with zeros.
The probability distributions of all program cells are
initialized with maximum entropy distributions [44].
That is, all
values are initialized to the
same value, so that there is no
bias for a particular value in any cell.
After initialization,
the instruction cycle is repeated over and over again
until system death at time
.
Recall that
does not have to be known in advance.
Storage as part of the environment. After initialization at time 0, the storage cells are never re-initialized: the storage has to be viewed as part of the total enviroment of the system policy. The storage is like an external notebook. Notes written into the notebook may influence future actions2. Notes may represent useful memories, or may have harmful long term effects.
Self-referential primitives.
Two special primitives, DecP and IncP,
may be used to address and
modify the current probability distribution
of any program cell (see Table 1).
With the action DecP,
the
value for a particular cell/value pair
can be decreased
by some factor in
,
,
,
.
The probabilities for that cell are then normalized.
Likewise, with the action IncP,
the complement (
) of the
value for
a particular cell
and value
can be decreased by a factor in
(and the cell
probabilities are again renormalized).
DecP and IncP
have no effect
if the indirectly addressed cell contents
(see Table 1) are not an integer between 1 and 99,
or if the corresponding probability modification would lead to at
least one
-value below MinP
(a small positive constant).
The primitive GetP can be used to write scaled versions of current probability values into work cells. GetP is potentially useful for purposes of introspection.
There is also another self-referential instruction, EndSelfMod(), that helps to group a sequence of probability-modifying instructions and other instructions into self-delimiting self-modification sequences (see below).
|
![]() |
Self-delimiting sequences of self-modifications (SSMs).
They correspond to the PMP from section 1.
The beginning of the
first
or
after an EndSelfMod()
action or after system
``birth'' begins an SSM.
The SSM ends itself by executing EndSelfMod().
Some of the (initially highly random)
action subsequences executed during system life
will indeed be SSMs.
They can compute almost arbitrary
sequences of modifications of
values,
resulting in almost arbitrary modifications of
context-dependent probabilities of future action subsequences,
including future SSMs.
However, due to MinP being positive,
the probability of
selecting and executing a particular instruction
at a particular time cannot entirely vanish.
In agreement with section 1, the
-th SSM is called PMP
.
Keeping track with a stack.
Following the SSA principle,
whenever the system uses
or
to
modify one of its probability distributions,
the following values are pushed onto a stack
:
the current time, total reinforcement so far, the address of the modified
program cell, its old probability distribution
right before the current modification
(represented by
real values),
and a pointer to the stack entry
corresponding to the first probability modification computed
by the current SSM.
This information
is needed by the SSA popping process to be described later.
More formally:
The
-th entry of
,
, is denoted
.
consists of
the following variables:
,
,
,
(a vector of
variables), and
.
The variable
points to the current topmost stack entry
(
at system startup).
The zeroth stack entry, which cannot be popped,
is initialized as follows:
;
;
. The remaining values are undefined.
If some SSM
modifies some probability
distribution
at time
(by using
or
),
is incremented,
;
;
;
before the modification
(represented by
real values).
If
marks the beginning of an SSM
then
.
Otherwise (in between beginning and end of an SSM),
.
SSA popping processes.
Popping occurs right before the first pushing process of each SSM,
and also after each instruction except when an
SSM is running (these are the possible additional checkpoints referred
to in section 1):
``old'' probability distributions
are successively popped and restored,
until SSC (see section 1) is satisfied.
Of course, the computation time required for
pushing and
popping probability distributions etc. is taken
into account when reinforcement/time ratios are computed.
More formally (
denotes the current time):
Whileand
do:.
Note:
increases during execution of the while loop.
Lemma 1. According to the SSA principle (section 1), after each instruction (except during the execution of an SSM), an SSA popping process ensures that the beginning of each completed SSM that computed still valid probability modifications has been followed by faster reinforcement intake than the beginnings of all previous such SSMs.
Proof sketch. Induction over the stack contents after popping, in analogy to proof of Theorem 1, section 1.
Arbitrary reward delays.
The EndSelfMod() instruction allows the system to
delay SSA's evaluations of probability modifications arbitrarily
(the expectation
of the delay remains finite, however, due to
being positive).
These checkpoint setting capabilities are important, for two reasons:
(1) In general, reinforcement events will be separated by long (unknown)
time lags. Hence,
novel probability modifications are not necessarily bad if they
do not lead to immediate reinforcement. In principle, the system itself
can learn how much time to spend on waiting for first reinforcement
events.
(2) Two successive modifications
of two particular probability distributions may turn
out to be beneficial,
while each by itself may be harmful.
For this reason,
the system is allowed to compute arbitrary sequences of
probability modifications, before facing SSA's evaluations.
Delaying evaluations does cost time, though, which is taken into account by SSA. In the long run, the system is encouraged to create useful SSMs of the appropriate size.
Accelerating reinforcement intake. Lemma 1 shows: the scheme keeps only modifications caused by SSMs followed by faster and faster reinforcement intake, thus satisfying SSC from section 1. Due to non-vanishing probabilities of arbitrary action sequences (recall that MinP is positive), the system will eventually discover a way to improve its current performance, if there is any.
Learning how to learn. Performance improvements include improvements that make future improvements more likely: SSMs can prove their long term usefulness by setting the stage for additional, useful SSMs, which potentially include SSMs executing known (and not yet known) learning algorithms. The success of an SSM recursively depends on the success of all later SSMs. SSA automatically takes care of this, thus recursively encouraging ``learning how to learn how to learn ...''. This represents an essential difference to previous approaches to ``continual'' learning, see [24].
Improvement speed?
Due to the generality of the approach, no reasonable
statements can be made about improvement speed, which
indeed highly depends on the nature of the environment and the choice
of initial primitive instructions. This lack is shared
by almost all other reinforcement learning approaches, though.
Note, however, that unlike previous, less general
systems, the novel system in principle can exploit almost arbitrary
environmental regularities
[14,5,45,21]
(if there are any) to speed up
performance improvement,
simply because it can run almost arbitrary learning
algorithms. For instance, in principle,
unlike previous evolutionary and genetic algorithms
[23,43,13,12,15],
the system can learn to focus its modifications
on interfaces between useful ``subprograms''
(``divide and conquer''), instead of mutating the
subprograms themselves
(if this proves to be beneficial in a given environment), thus
creating a higher-level, more abstract search space
(
directed mutations as opposed to random mutations).
Just as evolution ``discovered''
that having the ``genetic crossover operator'' was a ``good thing'',
the system is potentially able to discover that
various more directed search strategies are ``good things''.
There is just no feasible way of predicting
the precise nature of such speed-ups.
How many environments are regular? The last paragraph mentioned that the novel system in principle can exploit almost arbitrary environmental regularities, if there are any. One might ask: how likely is it that a given environment contains regularities at all? How many environments do indeed allow for successful generalization from previous experience? A recent debate in the machine-learning community highlighted a fact that appears discouraging at first glance: in general, generalization cannot be expected, inductive inference is impossible, and nothing can be learned. See, e.g., [8,26,52,31,38,37]. Paraphrasing from a previous argument [38,37]: let the task be to learn some relation between finite bitstrings and finite bitstrings. Somehow, a training set is chosen. In almost all cases, the shortest algorithm computing a (non-overlapping) test set essentially has the same size as the whole test set. This is because most computable objects are irregular and incompressible [14,5]. The shortest algorithm computing the test set, given the training set, isn't any shorter. In other words, the relative algorithmic complexity of the test set, given the training set, is maximal, and the mutual algorithmic information between test set and training set is zero (ignoring an additive constant independent of the problem -- see [14,5,45,21]). Therefore, in almost all cases, (1) knowledge of the training set does not provide any clues about the test set, (2) there is no hope for generalization, and (3) inductive inference does not make any sense. Similarly, almost all possible environments are ``hostile'' in the sense that they don't provide regularities exploitable by any learning algorithm. This may seem discouraging.
Atypical real world. Apparently, however, generalization and inductive inference do make sense in the real world! One reason for this may be that the real world is run by a short algorithm. See, e.g., [37]. Anyway, problems that humans consider to be typical are atypical when compared to the general set of all well-defined problems. Otherwise, things like ``learning by analogy'', ``learning by chunking'', ``incremental learning'', ``continual learning'', ``learning from invariances'', ``learning by knowledge transfer'' etc. would not be possible, and experience with previous problems could not help to sensibly adjust the prior distribution of solution candidates in the search space for a new problem (shift of inductive bias, e.g. [48]).
To repeat: an interesting feature of incremental self-improvement is that its theoretical potential for exploiting environmental regularities, if there are any, exceeds the corresponding potential of previous learning systems.