Given are disks of different sizes, stacked in decreasing size on the first of three pegs. Moving some peg's top disk to the top of another (possibly empty) peg, one disk at a time, but never a larger disk onto a smaller, transfer all disks to the third peg. Remarkably, the fastest way of solving this famous problem requires moves .
Untrained humans find it hard to solve instances . Anderson  applied traditional reinforcement learning methods and was able to solve instances up to , solvable within at most 7 moves. Langley  used learning production systems and was able to solve Hanoi instances up to , solvable within at most 31 moves. Traditional nonlearning planning procedures systematically explore all possible move combinations. They also fail to solve Hanoi problem instances with , due to the exploding search space (Jana Koehler, IBM Research, personal communication, 2002). OOPS, however, is searching in program space instead of raw solution space. Therefore, in principle it should be able to solve arbitrary instances by discovering the problem's elegant recursive solution: given and three pegs (source peg, auxiliary peg, destination peg), define procedure
The -th task is to solve all Hanoi instances up to instance . We represent the dynamic environment for task on the -th task tape, allocating addresses for each peg, to store its current disk positions and a pointer to its top disk (0 if there isn't any). We represent pegs by numbers 1, 2, 3, respectively. That is, given an instance of size , we push onto ds the values . By doing so we insert substantial, nontrivial prior knowledge about problem size and the fact that it is useful to represent each peg by a symbol.
We add three instructions to the 68 instructions of our FORTH-like programming language: mvdsk() assumes that are represented by the first three elements on ds above the current base pointer , and moves a disk from peg to peg . Instruction xSA() exchanges the representations of and , xAD() those of and (combinations may create arbitrary peg patterns). Illegal moves cause the current program prefix to halt. Overall success is easily verifiable since our objective is achieved once the first two pegs are empty.
Within reasonable time (a week) on an off-the-shelf personal computer (1.5 GHz) the system was not able to solve instances involving more than 3 disks. This gives us a welcome opportunity to demonstrate its incremental learning abilities: we first trained it on an additional, easier task, to teach it something about recursion, hoping that this would help to solve the Hanoi problem as well. For this purpose we used a seemingly unrelated symmetry problem based on the context free language : given input on the data stack ds, the goal is to place symbols on the auxiliary stack Ds such that the topmost elements are 1's followed by 2's. We add two more instructions to the initial programming language: instruction 1toD() pushes 1 onto Ds, instruction 2toD() pushes 2. Now we have a total of five task-specific instructions (including those for Hanoi), with instruction numbers 1, 2, 3, 4, 5, for 1toD, 2toD, mvdsk, xSA, xAD, respectively.
So we first boost (Section 3) instructions c1, c2 for the first training phase where the -th task is to solve all symmetry problem instances up to . Then we undo the symmetry-specific boosts of c1, c2 and boost instead the Hanoi-specific ``instruction number pushers'' for the subsequent training phase where the -th task (again ) is to solve all Hanoi instances up to .
Results. Within roughly 0.3 days, OOPS found and froze code solving the symmetry problem. Within 2 more days it also found a universal Hanoi solver, exploiting the benefits of incremental learning ignored by nonincremental HSEARCH and LSEARCH. It is instructive to study the sequence of intermediate solutions. In what follows we will transform integer sequences discovered by OOPS back into readable programs (to fully understand them, however, one needs to know all side effects, and which instruction has got which number).
For the symmetry problem, within less than a second, OOPS found silly but working code for . Within less than 1 hour it had solved the 2nd, 3rd, 4th, and 5th instances, always simply prolonging the previous code without changing the start address . The code found so far was unelegant: (defnp 2toD grt c2 c2 endnp boostq delD delD bsf 2toD fromD delD delD delD fromD bsf by2 bsf by2 fromD delD delD fromD cpnb bsf). But it does solve all of the first 5 instances. Finally, after 0.3 days, OOPS had created and tested a new, elegant, recursive program (no prolongation of the previous one) with a new increased start address , solving all instances up to 6: (defnp c1 calltp c2 endnp). That is, it was cheaper to solve all instances up to 6 by discovering and applying this new program to all instances so far, than just prolonging old code on instance 6 only. In fact, the program turns out to be a universal symmetry problem solver. On the stack, it constructs a 1-argument procedure that returns nothing if its input argument is 0, otherwise calls the instruction 1toD whose code is 1, then calls itself with a decremented input argument, then calls 2toD whose code is 2, then returns. Using this program, within an additional 20 milliseconds, OOPS had also solved the remaining 24 symmetry tasks up to .
Then OOPS switched to the Hanoi problem. 1 ms later it had found trivial code for : (mvdsk). After a day or so it had found fresh yet bizarre code (new start address ) for : (c4 c3 cpn c4 by2 c3 by2 exec). Finally, after 3 days it had found fresh code (new ) for : (c3 dec boostq defnp c4 calltp c3 c5 calltp endnp). This already is an optimal universal Hanoi solver. Therefore, within 1 additional day OOPS was able to solve the remaining 27 tasks for up to 30, reusing the same program again and again. Recall that the optimal solution for takes mvdsk operations, and that for each mvdsk several other instructions need to be executed as well!
The final Hanoi solution profits from the earlier recursive solution to the symmetry problem. How? The prefix (c3 dec boostq) (probability 0.003) temporarily rewrites the search procedure (this illustrates the benefits of metasearching!) by exploiting previous code: Instruction c3 pushes 3; dec decrements this; boostq takes the result 2 as an argument and thus boosts the probabilities of all components of the 2nd frozen program, which happens to be the previously found universal symmetry solver. This leads to an online bias shift that greatly increases the probability that defnp, calltp, endnp, will appear in the suffix of the online-generated program. These instructions in turn are helpful for building (on the data stack ds) the double-recursive procedure generated by the suffix (defnp c4 calltp c3 c5 calltp endnp), which essentially constructs a 4-argument procedure that returns nothing if its input argument is 0, otherwise decrements the top input argument, calls the instruction xAD whose code is 4, then calls itself, then calls mvdsk whose code is 5, then calls xSA whose code is 3, then calls itself again, then returns (compare the standard Hanoi solution).
The total probability of the final solution, given the previous codes, is . On the other hand, the probability of the essential Hanoi code (defnp c4 calltp c3 c5 calltp endnp), given nothing, is only , which explains why it was not quickly found without the help of an easier task. So in this particular setup the incremental training due to the simple recursion for the symmetry problem indeed provided useful training for the more complex Hanoi recursion, speeding up the search by a factor of roughly 1000.
The entire 4 day search tested 93,994,568,009 prefixes corresponding to 345,450,362,522 instructions costing 678,634,413,962 time steps (some instructions cost more than 1 step, in particular, those making copies of strings with length , or those increasing the probabilities of more than one instruction). Search time of an optimal solver is a natural measure of initial bias. Clearly, most tested prefixes are short -- they either halt or get interrupted soon. Still, some programs do run for a long time; the longest measured runtime exceeded 30 billion steps. The stacks of recursive invocations of Try for storage management (Section 2.1) collectively never held more than 20,000 elements though.
Different initial bias will yield different results. E.g., we could set to zero the initial probabilities of most of the 73 initial instructions (most are unnecessary for our two problem classes), and then solve all tasks more quickly (at the expense of obtaining a non-universal initial programming language). The point of this experimental section, however, is not to find the most reasonable initial bias for particular problems, but to illustrate the general functionality of the first general near-bias-optimal incremental learner. In ongoing research we are equipping OOPS with neural network primitives and are applying it to robotics. Since OOPS will scale to larger problems in essentially unbeatable fashion, the hardware speed-up factor of expected for the next 30 years appears promising.