Self-generated reduction of numbers of probability modifications.
In the beginning, the system computed
a lot of probability modifications but later preferred to decrease
the number of probability modifications per time interval.
After
time steps, there were
about 350,000 probability modifications per
time steps.
After
time steps, there were
about 40,000 probability modifications per
time steps.
Towards system death, there were
about 20,000 probability modifications per
time steps.
Most of the useful SSMs computed either
one or two probability modifications.
Speed-up of payoff intake.
After
time steps, the system already
behaved much more deterministically than in the beginning.
Average payoff per payoff event had increased from 1.4 to 15.8 (the
optimal value being 30.0, of course), and the stack had 70 entries.
These entries corresponded to 66 modifications
of single cell probability distributions, computed by
45 SSMs -- each being more
``useful'' than all the previous ones.
Storage already looked very messy. For instance, almost all
cells in the work area were filled with (partly big) integers
quite different from the initial values.
Recall that the storage is never re-initialized and has to be
viewed as part of the policy environment.
First maximal payoff. After 1,436,383 payoff events, the system correctly had written all 30 variables for the first time, and received maximal payoff 30.0. Due to remaining non-determinism in the system, the current average payoff per payoff event (measured shortly afterwards, at time step 1,500,000,000) was about 21.7.
After 3,000,000 payoff events, current average payoff per payoff event was 25.6. But the stack had only 206 entries (corresponding to 174 ``useful'' SSMs). After 5,000,000 payoff events (at ``system death''), the current average was about 26.0, with ongoing tendency to increase. By then, there were 224 stack entries. They corresponded to 192 SSMs, each being more ``useful'' than all the previous ones.
Temporary speed-ups of performance improvement. Performance did not increase smoothly during the lifetime of the system. Sometimes, no significant improvement took place for a time interval comparable to the entire learning time so far. Such ``boring'' time intervals were somtimes ended by unexpected sequences of rather quick improvements. Then progress slowed down again. Such temporary speed-ups of performance improvement indicate useful shifts of inductive bias, which may later be replaced by inductive bias created by the next ``breakthrough''.
Evidence of ``learning how to learn''?
A look at the stack entries revealed that
many (but far from all) useful probability modifications focused
on few program cells.
Often, SSMs directly changing the probabilities
of future SSMs were considered useful.
For instance,
9 of the 224 stack entries at time step
corresponded to
``useful'' probability modifications of the (self-referential)
action of the second program cell.
Numerous entries corresponded to
``useful'' modifications of the EndSelfMod probability
of various cells.
Such stack entries may be interpreted as results
of ``adjusting the prior on the space of solution candidates''
or ``fine-tuning search space structure'' or ``learning
to create directed mutations'' or ``learning how to learn''.