The final 10-token Hanoi solution demonstrates the benefits of
incremental learning: it greatly profits from
rewriting the search procedure with the help of information
conveyed by the earlier recursive solution to the
-problem. How?
The prefix (c3 dec boostq) (probability 0.003)
prepares the foundations:
Instruction c3 pushes 3; dec
decrements this; boostq takes the result 2 as an argument
(interpreted as an address)
and thus boosts the probabilities of all components of the 2nd
frozen program, which happens to be the previously found universal
-solver. This causes an online bias shift
on the space of possible suffixes:
it greatly increases the probability that
defnp, calltp, endnp, will appear
in the remainder of the online-generated program.
These instructions in turn are helpful for building
(on the data stack ds)
the double-recursive procedure
generated by the suffix
(defnp c4 calltp c3 c5 calltp endnp), which essentially
constructs (on data stack
)
a 4-argument procedure that returns nothing if its input argument is 0,
otherwise decrements the top input argument, calls the instruction
xAD whose code is 4,
then calls itself on a copy of the top 4 arguments,
then calls mvdsk whose code is 5,
then calls xSA whose code is 3,
then calls itself on another copy of the top 4 arguments,
then makes yet another (unnecessary) argument copy, then returns
(compare the standard Hanoi solution).
The total probability of the final solution, given the previous
codes, is calculated as follows:
since
, given the boosts of
c3, c4, c5, by2, dec, boostq,
we have probability
for the prefix (c3 dec boostq);
since this prefix further boosts defnp, c1, calltp, c2, endnp,
we have probability
for the suffix (defnp c4 calltp c3 c5 calltp endnp).
That is, the probability of the complete 10-symbol code is
.
On the other hand, the probability of the essential Hanoi-specific suffix
(defnp c4 calltp c3 c5 calltp endnp), given just the initial boosts,
is only
,
which explains why it was not
quickly found without the help of the solution to an easier problem set.
(Without any initial boosts its probability would actually have
been similar:
.)
This would correspond to a search time of several years,
as opposed to a few days.
So in this particular setup the simple recursion for the
-problem
indeed provided useful incremental training for the more complex Hanoi recursion,
speeding up the search by a factor of 1000 or so.
On the other hand, the search for the universal solver for all
-problems
(first found with instance
)
did not at all profit from solutions to earlier solved tasks
(although instances
did profit).
Note that the time spent by the final 10-token Hanoi solver on increasing the probabilities of certain instructions and on constructing executable code on the data stack (less than 50 time steps) quickly becomes negligible as the Hanoi instance size grows. In this particular application, most time is spent on executing the code, not on constructing it.
Once the universal Hanoi solver was discovered, why
did the solution of the remaining Hanoi tasks
substantially increase the total time (by roughly 25 %)?
Because the sheer runtime of the discovered, frozen,
near-optimal program on the remaining tasks was already comparable to
the previously consumed search time for this program, due to the very
nature of the Hanoi task:
Recall that a solution for
takes more than a billion mvdsk
operations, and that for each mvdsk several other instructions need to be
executed as well. Note that experiments with traditional reinforcement
learners [24] rarely involve problems whose
solution sizes exceed a few thousand steps.
Note also that we could continue to solve Hanoi tasks up to
.
The execution time required to solve such instances with an optimal solver
greatly exceeds the search time required for finding the solver itself.
There it does not matter much whether OOPS
already starts with a prewired Hanoi solver, or
first has to discover one, since the initial search time for the
solver becomes negligible anyway.
Of course, different initial bias can yield dramatically different results.
For example, using hindsight we could set to zero the probabilities
of all 73 initial instructions (most are unnecessary for the 30
Hanoi tasks) except for the 7 instructions used
by the Hanoi-solving suffix, then make those 7 instructions equally likely,
and thus obtain a comparatively high Hanoi solver
probability of
.
This would allow for finding the solution to the 10 disk Hanoi problem
within less than an hour, without having to learn easier tasks first
(at the expense of obtaining a nonuniversal
initial programming language).
The point of this experimental section, however, is not
to find the most
reasonable initial bias for particular problems, but to illustrate
the basic functionality of the first general, near-bias-optimal,
incremental learner.