Details.
In addition to the 17 general primitives from Table 1 (not counting
input/output primitives),
there are two problem-specific primitives.
Each has two integer arguments:
writes the contents of the storage cell
indirectly addressed by the first argument
into the variable indirectly addressed by the second argument.
writes the contents of the variable
indirectly addressed by the second argument
into the work cell indirectly addressed by the second argument.
See Table 2.
Between two payoff events, each variable may be written only once
(additional write operations have no effect).
Write and read operations outside the valid ranges
halt the current run.
|
Since
,
all initial probabilities of all
possible contents of all
program cells are equal to
.
Parameters for storage size etc. are:
,
,
,
,
,
.
To inform the system about what is going on,
the following values are written into special input cells
whenever they change:
IP,
sp, and the
remainder of
(integer division, where
denotes the current time).
Measuring time.
By definition,
each computation that requires the consideration
of all
probabilities of some program cell
(such as selecting an instruction,
selecting a parameter, pushing or popping probability
distributions, etc.)
costs one time step.
Other computations do not cost anything. This ensures that
measured time is of the order of total cpu-time.
The somewhat unelegant way of measuring time was
introduced because measuring cpu-time directly on
our multi-user machine turned
out to be somewhat unreliable.
How difficult is this task?
For a number of reasons, the task is non-trivial -- the system does
not appear to have much built-in bias towards the task:
(1) Only one of the 19
primitives (
)
may affect variable contents at all.
But initially, the system does not even have such
seemingly trivial knowledge -- there is no built-in
idea about which actions may lead to payoff.
Therefore, it has to find out on its own.
(2) The values referred to by the two
arguments of
have to be identical and within the
required ranges to lead to a useful result.
(3) There are 30 different variables
with 30 different values. Only one of them, namely
,
is correctly re-initialized with its own address
after each payoff event.
(4) The most significant difficulty, however,
is the continually changing policy environment.
Changing policy environment. The fact that the external variables are occasionally reset does not imply that the policy environment does not change. In fact, many actions executed by the system change the contents of storage cells, which are never reset: as mentioned above, storage cells are like an external notebook and have to be viewed as part of the environment -- typically, there are no exactly repeatable trials with identical initial conditions. And the storage cells are not just a negligible nuisance -- they are essential for computing parameters for (e.g., self-modifying) instructions. To achieve significant performance improvement, the environmental conditions have to change due to actions executed by the system itself. This is an aspect that cannot be dealt with by any traditional reinforcement learning system, not even by naive exhaustive search, in a way that is sound.
Comparison. Performance was measured with and
without self-modification capabilities.
In the latter case, the primitives
and
had no effect.
Both versions were run for
time steps,
corresponding to
payoff events.
Note that the optimal cumulative payoff
is
. This value can be achieved only by a system
with ``optimal'' prior bias -- starting at birth,
such a system keeps executing optimal actions without
having to learn anything.