Given are
disks of
different sizes,
stacked in decreasing size on the first of three pegs.
Moving some peg's top disk to the top of another (possibly empty) peg,
one disk at a time, but never a larger disk onto a smaller,
transfer all disks to the third peg. Remarkably,
the fastest way of solving this famous problem requires
moves
.
Untrained humans find it hard to solve instances
.
Anderson [1]
applied traditional reinforcement learning methods
and was able to solve instances up to
, solvable within
at most 7 moves.
Langley [5] used
learning production systems and was able to solve Hanoi instances up to
,
solvable within at most 31 moves.
Traditional nonlearning planning procedures
systematically explore all possible move combinations.
They also fail to solve Hanoi problem instances with
,
due to the exploding search space (Jana Koehler, IBM Research,
personal communication, 2002).
OOPS, however, is searching in program space instead of
raw solution space. Therefore, in principle it should be able to solve
arbitrary instances by discovering
the problem's elegant recursive solution:
given
and three pegs
(source peg, auxiliary peg, destination peg),
define procedure
The
-th task is to solve all Hanoi instances up
to instance
. We represent the
dynamic environment for task
on the
-th task tape,
allocating
addresses for each peg,
to store its current disk positions and
a pointer to its top disk (0 if there isn't any).
We represent pegs
by numbers 1, 2, 3, respectively.
That is, given an instance of size
, we push onto ds the values
. By doing so we insert substantial, nontrivial
prior knowledge about problem size and the fact
that it is useful to represent each peg by a symbol.
We add three instructions to the 68 instructions of our
FORTH-like programming language:
mvdsk() assumes that
are represented by the first three elements
on ds above the current base pointer
,
and moves a disk from peg
to peg
.
Instruction xSA() exchanges the representations of
and
,
xAD() those of
and
(combinations may create arbitrary
peg patterns).
Illegal moves cause the current program prefix to halt.
Overall success is easily verifiable since
our objective is achieved once the first two pegs are empty.
Within reasonable time (a week) on an off-the-shelf
personal computer (1.5 GHz) the system was not able to solve
instances involving more than 3 disks.
This gives us a welcome opportunity to demonstrate
its incremental learning abilities: we
first trained it on an additional, easier task, to
teach it something about recursion, hoping that this would help
to solve the Hanoi problem as well.
For this purpose we used a seemingly unrelated
symmetry problem based on
the context free language
: given input
on the data stack ds, the goal is to
place symbols on the auxiliary stack Ds such that
the
topmost elements are
1's followed by
2's.
We add two more instructions to the initial programming language:
instruction 1toD() pushes 1 onto Ds, instruction 2toD() pushes 2.
Now we have a total of five task-specific instructions (including those
for Hanoi), with instruction numbers 1, 2, 3, 4, 5, for
1toD, 2toD, mvdsk, xSA, xAD, respectively.
So we first boost
(Section 3)
instructions c1, c2
for the first training phase where
the
-th task
is to solve all symmetry problem
instances up to
.
Then we undo the symmetry-specific boosts of c1, c2
and boost instead
the Hanoi-specific ``instruction number pushers''
for the subsequent training phase where
the
-th task (again
)
is to solve all Hanoi instances up to
.
Results. Within roughly 0.3 days, OOPS found and froze code solving the symmetry problem. Within 2 more days it also found a universal Hanoi solver, exploiting the benefits of incremental learning ignored by nonincremental HSEARCH and LSEARCH. It is instructive to study the sequence of intermediate solutions. In what follows we will transform integer sequences discovered by OOPS back into readable programs (to fully understand them, however, one needs to know all side effects, and which instruction has got which number).
For the symmetry problem, within less than a second,
OOPS found silly but
working code for
.
Within less than 1 hour
it had solved the 2nd, 3rd, 4th, and 5th instances,
always simply prolonging the previous code without changing
the start address
. The code found so far was unelegant:
(defnp 2toD grt c2 c2 endnp boostq delD delD
bsf 2toD fromD delD delD delD fromD
bsf by2 bsf by2 fromD delD delD fromD cpnb bsf).
But it does solve all of the first 5 instances.
Finally, after 0.3 days, OOPS had
created and tested a new, elegant, recursive
program (no prolongation of the previous one)
with a new increased start address
, solving all instances up to 6:
(defnp c1 calltp c2 endnp).
That is, it was cheaper to solve all instances up to 6 by discovering
and applying this new program to all instances so far, than just prolonging
old code on instance 6 only.
In fact, the program turns out to be a universal symmetry problem solver.
On the stack, it
constructs a 1-argument procedure that returns nothing
if its input argument is 0,
otherwise calls the instruction 1toD whose code is 1, then calls
itself with a decremented input argument, then calls 2toD whose code is 2,
then returns. Using this program,
within an additional 20 milliseconds,
OOPS had also solved the remaining 24 symmetry tasks up to
.
Then OOPS switched to the Hanoi problem.
1 ms later
it had found trivial code for
:
(mvdsk). After a day or so
it had found fresh yet bizarre
code (new start address
) for
:
(c4 c3 cpn c4 by2 c3 by2 exec).
Finally, after 3 days it had found fresh code (new
) for
:
(c3 dec boostq defnp c4 calltp c3 c5 calltp endnp).
This already is an optimal universal Hanoi solver.
Therefore, within 1 additional day
OOPS was able to solve
the remaining 27 tasks for
up to 30, reusing
the same program
again and again.
Recall that the optimal solution for
takes
mvdsk
operations, and that for each mvdsk several other instructions need to be
executed as well!
The final Hanoi solution profits from the earlier recursive solution to the symmetry problem. How? The prefix (c3 dec boostq) (probability 0.003) temporarily rewrites the search procedure (this illustrates the benefits of metasearching!) by exploiting previous code: Instruction c3 pushes 3; dec decrements this; boostq takes the result 2 as an argument and thus boosts the probabilities of all components of the 2nd frozen program, which happens to be the previously found universal symmetry solver. This leads to an online bias shift that greatly increases the probability that defnp, calltp, endnp, will appear in the suffix of the online-generated program. These instructions in turn are helpful for building (on the data stack ds) the double-recursive procedure generated by the suffix (defnp c4 calltp c3 c5 calltp endnp), which essentially constructs a 4-argument procedure that returns nothing if its input argument is 0, otherwise decrements the top input argument, calls the instruction xAD whose code is 4, then calls itself, then calls mvdsk whose code is 5, then calls xSA whose code is 3, then calls itself again, then returns (compare the standard Hanoi solution).
The total probability of the final solution, given the previous
codes, is
.
On the other hand, the probability of the essential Hanoi code
(defnp c4 calltp c3 c5 calltp endnp), given nothing,
is only
, which explains why it was not
quickly found without the help of an easier task.
So in this particular setup the incremental training due to
the simple recursion for the symmetry problem
indeed provided useful training for the more complex Hanoi recursion,
speeding up the search by a factor of roughly 1000.
The entire 4 day search tested
93,994,568,009 prefixes corresponding to
345,450,362,522 instructions costing
678,634,413,962 time steps
(some instructions cost more than 1 step, in particular, those
making copies of strings with length
, or those increasing the
probabilities of more than one instruction).
Search time of an optimal solver is a natural measure
of initial bias.
Clearly, most tested prefixes are short -- they either halt or get interrupted soon.
Still, some programs do run for a long time; the longest measured
runtime exceeded 30 billion steps.
The stacks
of recursive invocations of
Try for storage management (Section 2.1)
collectively never held more than 20,000 elements though.
Different initial bias will yield different results.
E.g., we could set to zero the initial probabilities of most of
the 73 initial instructions (most are unnecessary for our
two problem classes), and then solve all
tasks
more quickly
(at the expense of obtaining a non-universal
initial programming language).
The point of this experimental section, however, is not to find the most
reasonable initial bias for particular problems, but to illustrate
the general functionality of the first general near-bias-optimal
incremental learner.
In ongoing research we are equipping OOPS with
neural network primitives and are applying it to robotics.
Since OOPS will scale to larger problems
in essentially unbeatable fashion,
the hardware speed-up factor of
expected for
the next 30 years appears promising.