next up previous
Next: COMMENTS Up: DISCOVERING NEURAL NETS WITH Previous: History spotlights / Selected


Levin's universal search algorithm was considered of interest for theoretical purposes (see, e.g., [Allender, 1992] and [Li and Vitányi, 1993]). However, it seems that nobody implemented it for experimental applications, perhaps in fear of the ominous ``constant factor'' which may be large. To my knowledge, general universal search was implemented for the first time during the project that led to this paper. See also Heil's 1995 MSc thesis at Technische Universität München. In what follows, however, I will focus on the implementation of a slightly different probabilistic algorithm (also based on Levin complexity and strongly inspired by universal search). The experimental results obtained with the probabilistic algorithm (see section 4) are very similar to those obtained by the original universal search procedure.

Overview. The method described in this section searches and finds algorithms that compute solutions to a given problem specified by possibly a quite (very) limited ``training data.'' The goal is to discover solutions with high generalization performance on ``test data'' unavailable during the search phase. Towards this purpose, the probabilistic search algorithm randomly generates programs written in a general assembler-like programming language based on sequences of integers. Programs may influence their own storage size and runtime. Each program computes a solution candidate which is tested on the training data. The probability of generating a program $p$ and an upper bound $t_{max}$ for its runtime essentially equals the quotient of the probability of guessing $p$, and $t_{max}$. This implies that candidates with low Levin complexity are preferred over candidates with high Levin complexity. To measure generalization performance, candidates fitting the training data are evaluated on test data. In the experiments (section 4), solution candidates will be weight matrices for a neural net supposed to solve certain generalization tasks that are difficult for -- or impossible to solve by -- conventional neural net algorithms.

``Universal'' programming language. First we need a ``universal'' set of primitive instructions (called ``primitives'') that may be composed to form arbitrary algorithms for computing arbitrary (partial recursive) functions. The only limitation we are willing to accept is on the storage size of our machine.

It is easy to devise ``universal'' sets of primitives. Which ones do we prefer? Informally, there is one general constraint to obey: whatever is computable on the used hardware, should be computable just as efficiently (up to a small constant factor) by a program composed from primitives. For instance, on a typical serial digital machine we would like to use primitives exploiting the fast storage addressing mechanisms. We would not want to limit ourselves to the simulation of, say, a slow one tape Turing machine. Likewise, on a machine with many parallel processors we would like to use a set of primitives allowing for programs with maximal parallelism. In what follows, the focus will be on conventional digital machines. Here is a description of the set-up used in the experiments to be described later:

Storage. Programs are sequences of integers. They are stored in the storage, consisting of a single array of cells. Each cell has an integer address in the interval $[-s_w, s_p]$. Both $s_w$ and $s_p$ are positive integers. The program tape is the set of cells with addresses in $[0, s_p]$. The work tape is the set of cells with addresses in $[-s_w, -1]$. Cells with non-negative addresses belong to the program tape. Cells with negative addresses belong to the work tape. The contents of the cell with address $i$ is denoted by $c_i \in [-maxint, maxint]$, and is of type integer as well (in the implemented version, maxint equals 10000). During execution of a program, the used portion of the program tape may increase. The used portion of the work tape may increase or decrease. At any time step, the variable $Max$ ( $-1 \leq Max \leq s_p$) denotes the topmost address of the used storage. The variable $Min$ ( $-s_w \leq Min \leq 0$) denotes its smallest address. At any given time, legal addresses are in the dynamic range $[Min, OracleAddress]$, where $OracleAddress = Max + 1$, by definition. At any given time, the integer sequence on the program tape (up to address $Max$) is called the current program. If $Max = -1$ then the current program is ``empty.''

Instructions. At any given time, the variable InstructionPointer may equal the address of one of the cells, whose contents may be interpretable as an instruction. There are $n_{ops}$ different possible instructions (in the implemented version, $n_{ops} = 13$). Each instruction is uniquely represented by an instruction number from the set $\{0, ...,n_{ops}-1\}$. An instruction may have up to three arguments (all of integer type), or none. Arguments are stored in the addresses following the address of the instruction. For each argument of each instruction, there is a legal argument range (a set of integer values the argument is allowed to take on). Within certain limits, legal argument ranges can be dynamically modified by programs, as will be seen shortly.

Initialization, time limits, time probability. In the beginning of the execution of a ``program'' or ``run,'' the variables OracleAddress, InstructionPointer, Min, and CurrentRuntime are all set to zero. The variable CurrentTimeLimit is used to define an upper bound for the runtime of the current program. To obtain a probabilistic variant of universal search, CurrentTimeLimit is chosen randomly as follows: elements from the set $\{0,1\}$ are drawn with equal probability until the first ``1'' is drawn. Let $n_t$ denote the number of trials. CurrentTimeLimit is set to $UnitTime \times 2^{n_t}$, where $UnitTime$ equals 16 time steps (each program will be allowed to execute at least 16 instructions - but it may choose to halt earlier). If CurrentTimeLimit exceeds $MaxTimeLimit$, then it is replaced by MaxTimeLimit $=2^{24} = 16,777,216$. The time probability of the current program is defined by $max(\frac{1}{2^{n_t}}, \frac{UnitTime}{MaxTimeLimit})$. Short runtimes are more likely than long runtimes.

Instruction cycle and oracles. A single step of the program interpreter works as follows: if the InstructionPointer equals $OracleAddress$ ($= Max +1$), then this is interpreted as the request for an oracle. A primitive and the corresponding arguments are chosen randomly from the set of legal options (to be described below). They are sequentially written onto the program tape, starting from $OracleAddress$. $Max$ and $OracleAddress$ are increased accordingly, to reflect the growth of used program tape. Then the new primitive gets executed (except when growth beyond $s_p$ halts the program). If there is no oracle request and if the InstructionPointer equals $i$, then if the content $c_i \in [0, n_{ops}-1]$, the corresponding number of arguments $n_i$ and the corresponding legal argument ranges are looked up and checked against the contents of the $n_i$ addresses following the current address. If the instruction is ``syntactically correct,'' it gets executed. Otherwise the current program is halted. If the executed primitive did not change the value of the InstructionPointer (e.g., by causing a jump), the InstructionPointer is set to point to the address following the address of (the last argument of) the current instruction. If an instruction was executed, CurrentRuntime is incremented. If the CurrentTimeLimit is reached, the program is halted.

Runs, programs, and space probability. After initialization, the instruction cycle is repeated until a halt situation is encountered. The space probability of a program is defined as the product of the probabilities of all parameters and primitives requested and executed during its runtime. Essentially, the space probability is the probability of guessing the executed content of the program tape.

Probabilistic search. Programs are generated randomly and executed as described above, and their results are evaluated until some problem-specific performance criterion is met. Obviously, results with low Levin complexity are preferred over results with high Levin complexity. (In an alternative implementation, the original universal search algorithm was used to systematically generate all solution candidates in order of their Levin complexities. See also Heil's MSc thesis at Technische Universität München (1995), and [Wiering and Schmidhuber, 1996]).

Used primitives. The instruction numbers and the semantics of the primitives used in the experiments are listed below. An expression of the form ``$address$ $i$'' denotes the value (interpreted as an address) found in the $i$th cell following the one containing the current instruction (indirect addressing is used throughout). The following list assumes syntactical correctness of the instructions. Rules for legal argument ranges and syntactical correctness will be given shortly.

Jumpleq(address1, address2, address3). If the contents of address1 is less than or equal to the contents of address2, the InstructionPointer is set equal to address3.

Output(...). A primitive for interaction with an external environment. It corresponds to the TM action of ``writing the output tape'' (see section 2). In the experiments, ``output'' will be called ``WriteWeight.'' It will be used to generate weights for a neural network. Variants of it will be specified where needed.

Jump(address1). The InstructionPointer is set equal to address1.

Stop(). Halt the current program.

Add(address1, address2, address3). The contents of address1 is added to the contents of address2, the result is written into address3.

GetInput(address1, address2). Another primitive for interaction with an external environment. It requires $n_I$ separate ``input fields'' that may be modified by the environment (in our experiments, $n_I$ will equal 20). GetInput reads the current value of the $i$th input field into address2, where $i$ is the value found in address1. In conjunction with primitives changing the environmental state, GetInput provides an opportunity for exploiting the computing resources of the ``outside world.'' In the applications below, however, GetInput will be pretty useless - all the input fields will remain zero all the time.

Move(address1, address2). The contents of address1 is copied to address2.

Allocate(address1). The size of the work tape is increased by the value found in address1, the new cells are initialized with zeros. $Min$ is updated accordingly (growth beyond $-s_w$ halts the program). No variable can be written before enough space has been Allocated on the work tape. As will be explained below, Allocate is essential for self-sizing programs.

Increment(address1). The contents of address1 is incremented.

Decrement(address1). The contents of address1 is decremented.

Subtract(address1, address2, address3). The contents of address1 is subtracted from the contents of address2, the result is written into address3.

Multiply(address1, address2, address3). The contents of address1 is multiplied by the contents of address2, the result is written into address3.

Free(address1). The size of the work tape is decreased by the value found in address1. $Min$ is updated accordingly. This primitive complements Allocate.

Rules for legal argument ranges and syntactical correctness. Jumps may lead to any address in the dynamic range $[Min, Max+1]$ (recall that $Max + 1$ always equals the current value of $OracleAddress$). Operations that read the contents of certain cells (like add, move, jumpleq etc.) may read only from addresses in $[Min, Max]$. Operations that change the contents of certain cells may write only into work tape addresses in $[Min, -1]$. Thus, the program tape is ``read/execute'' only, except for random writes requested by moves of the InstructionPointer to $OracleAddress$. This makes reruns easy, as will be seen below. The work tape is ``read/write/execute.'' Results of arithmetic operations leading to underflow or overflow are replaced by $-maxint$ or $maxint$, respectively. No more than 5 work tape cells may be Allocated or Freed at a time.

next up previous
Next: COMMENTS Up: DISCOVERING NEURAL NETS WITH Previous: History spotlights / Selected
Juergen Schmidhuber 2003-02-12

Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page