Next: Experiment 3: Function Approximation Up: Implementation 2: Incremental Self-Improvement Previous: Implementation 2: Incremental Self-Improvement

## Policy and Program Execution

Storage / Instructions. The learner makes use of an assembler-like programming language similar to but not quite as general as the one in [#!Schmidhuber:95kol!#]. It has addressable work cells with addresses ranging from 0 to . The variable, real-valued contents of the work cell with address are denoted . Processes in the external environment occasionally write inputs into certain work cells. There also are addressable program cells with addresses ranging from 0 to . The variable, integer-valued contents of the program cell with address are denoted . An internal variable Instruction Pointer (IP) with range always points to one of the program cells (initially to the first one). There also is a fixed set of integer values , which sometimes represent instructions, and sometimes represent arguments, depending on the position of IP. IP and work cells together represent the system's internal state (see section 2). For each value in , there is an assembler-like instruction with integer-valued parameters. In the following incomplete list of instructions ( ) to be used in experiment 3, the symbols stand for parameters that may take on integer values between and (later we will encounter additional instructions):

:
Add() : (add the contents of work cell and work cell , write the result into work cell ).

:
Sub() : .

:
Mul() : .

:
Mov() : .

:
JumpHome: IP (jump back to 1st program cell).

Instruction probabilities / Current policy. For each program cell there is a variable probability distribution on . For every possible , ( , specifies for cell the conditional probability that, when pointed to by IP, its contents will be set to . The set of all current -values defines a probability matrix with columns . is called the learner's current policy. In the beginning of the learner's life, all are equal (maximum entropy initialization). If IP , the contents of , namely , will be interpreted as instruction (such as Add or Mul), and the contents of cells that immediately follow will be interpreted as 's arguments, to be selected according to the corresponding -values. See Figure 4.

Self-modifications. To obtain a learner that can explicitly modify its own policy (by running its own learning strategies), we introduce a special self-modification instruction IncProb not yet mentioned above:

:
IncProb() : Increase by percent, where and (this construction allows for addressing a broad range of program cells), and renormalize (but prevent P-values from falling below a minimal value , to avoid near-determinism). Parameters may take on integer values between and . In the experiments, we will use .

In conjunction with other primitives, may be used in instruction sequences that compute directed policy modifications. Calls of represent the only way of modifying the policy.

Self-delimiting self-modification sequences (SMSs). SMSs are subsequences of the lifelong action sequence. The first after the learner's birth'' or after each SSA call (see section 2) begins an SMS. The SMS ends by executing another yet unmentioned primitive:

:
EndSelfMod(). Temporarily disable IncProb, by preventing future instructions from causing any probability modifications, until ( ) additional non-zero reward signals have been received -- this will satisfy the EVALUATION CRITERION in the basic cycle (section 2).

Some of the (initially highly random) action subsequences executed during system life will indeed be SMSs. Depending on the nature of the other instructions, SMSs can compute almost arbitrary sequences of modifications of values. This may result in almost arbitrary modifications of context-dependent probabilities of future action subsequences, including future SMSs. Policy changes can be generated only by SMSs. SMSs build the basis for metalearning'': SMSs are generated according to the policy, and may change the policy. Hence, the policy can essentially change itself, and also the way it changes itself, etc.

SMSs can influence the timing of backtracking processes, because they can influence the times at which the EVALUATION CRITERION will be met. Thus SMSs can temporarily protect the learner from performance evaluations and policy restaurations.

Plugging SMSs into SSA. We replace step 1 in the basic cycle (see section 2) by the following procedure:

1.
REPEAT the following UNTIL the EVALUATION CRITERION is satisfied or the Boolean variable MODIFICATION-CRITERION (initially FALSE) is TRUE:

1.1.
Randomly generate an integer according to matrix column (the distribution of the program cell pointed to by , initially 0 at system birth). Set program cell contents . Translate into the corresponding current instruction . Look up the number of cells required to store 's parameters. If IP, reset IP to 0, go to step 1. Otherwise generate instruction arguments for the cells immediately following according to their probability distributions , ..., , and set IP to .

1.2.
IF is a learning instruction and not currently disabled by a previous EndSelfMod instruction, THEN set MODIFICATION-CRITERION , exit the current REPEAT loop, and go to step 2 of the basic cycle.

1.3.
Execute . IF is EndSelfMod and the topmost entry in the stack is not a tag'', THEN set the integer variable equal to the first parameter of plus one (this will influence the time at which EVALUATION CRITERION will be reached).

1.4.
IF there is a new environmental input, THEN let it modify .

1.5.
IF and non-zero reward occurred during the current cycle, THEN decrement . IF is zero, THEN set EVALUATION CRITERION .

We also change step 3 in the SSA cycle as follows:

3.
IF MODIFICATION-CRITERION , THEN push copies of those to be modified by (from step 1.2) onto , and execute .

Next: Experiment 3: Function Approximation Up: Implementation 2: Incremental Self-Improvement Previous: Implementation 2: Incremental Self-Improvement
Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page