The explorer's life in environment
lasts
from time 0 (birth) to unknown time
(death).
Two
matrices
(
and
)
of modifiable real values represent
the learner's modules.
's
-th variable, vector-valued column is denoted
;
its
-th real-valued component
; similarly for
.
A variable InstructionPointer (IP)
with range
always points
to one of the module pair's columns.
is the learner's variable internal state with
-th
component ![]()
![]()
for
.
Throughout its life the learner repeats the following
basic cycle over and over:
select and execute
instruction
with probability
(here
denotes a set of
possible instructions), where
![]() |
Reward.
Occasionally
may distribute real-valued external reward
equally between both modules.
is each module's cumulative external
reward obtained between time 0 and time
, where
.
In addition there may be occasional
surprise rewards triggered by
the special instruction Bet!(x,y,c,d)
.
Suppose the two modules have agreed on executing a Bet! instruction,
given a certain IP value. The probabilities of its four
arguments
and
depend on both modules: the arguments are selected
according to the four probability distributions
,
,
,
(details in the appendix).
If
then no module will be rewarded.
means that
bets on ![]()
![]()
,
while
bets on
![]()
.
means that
bets on
![]()
,
while
bets on ![]()
![]()
.
Execution of the instruction will result in an immediate test:
which module is wrong, which is right?
The former will be punished (surprise reward minus 1),
the other one will be rewarded (
).
Note that in this particular implementation the modules always
bet a fixed amount of 1 -- alternative implementations, however,
may compute real-valued bids via appropriate instruction sequences.
Interpretation of surprise rewards.
Instructions are likely to be executed only if both modules
collectively assign them high conditional probabilities,
given
and
.
In this sense both must ``agree''
on each executed instruction of the lifelong computation process.
In particular,
both collectively set arguments
in the case they
decide to execute a Bet! instruction. By setting
they express that their predictions of the Bet! outcome differ.
Hence
will be rewarded for
luring
into agreeing upon instruction subsequences
(algorithms)
that include Bet! instructions demonstrating that certain
calculations yield results different from
what
expects, and vice versa.
Thus each module is motivated to discover algorithms whose
outcomes surprise the other;
but each also may reduce the probability of
algorithms it does not expect to surprise the other.
The sheer existence of the Bet! instruction motivates each module to act not only to receive external reward but also to obtain surprise reward, by discovering algorithmic regularities the other module still finds surprising -- a type of curiosity.
Trivial and random results.
Why not provide reward if ![]()
![]()
and
(meaning both modules
rightly believe in ![]()
![]()
)?
Because then both would soon focus on this particular way of
making a correct and rewarding prediction, and do nothing else.
Why not provide punishment if ![]()
![]()
and
(meaning that both modules are wrong)?
Because then both modules would soon be discouraged from
making any prediction at all.
In case
the truth of ![]()
![]()
is considered a
well-known, ``trivial'' fact whose confirmation does not deserve reward.
In case
the truth of ![]()
![]()
is
considered a subjectively ``random,'' irregular result.
Surprise rewards
can occur only in the case both modules' opinions differ.
They reflect one module's disappointed confident expectation,
and the other's justified one,
where, by definition, ``confidence'' translates into ``agreement on the
surprising instruction sequence'' --
no surprise without such confidence.
Examples of learnable regularities.
A Turing-equivalent instruction set
(one that permits construction of a
universal Turing machine) allows
exploitation of arbitrary computable regularities
[11,2,39,14]
to trigger or avoid surprises.
For instance, in partially predictable environments the following
types of regularities may help to reliably generate matches of
computational outcomes. (1) Observation/prediction: selected inputs
computed via the environment (observations obtained through ``active
perception'') may match the outcomes of earlier internal computations
(predictions). (2) Observation/explanation: memorized inputs
computed via the environment may match the outcomes of later internal
computations (explanations). (3) Planning/acting: outcomes
of internal computations (planning processes) may match desirable
inputs later computed via the environment (with the help of
environment-changing instructions). (4) ``Internal'' regularities:
the following computations yield matching results --
subtracting 2 from 14, adding 3, 4, and 5, multiplying 2 by 6.
Or: apparently, the computation of the truth value of
``
is the sum of two primes'' yields the same result
for each even integer
(Goldbach's conjecture).
(5) Mixtures of (1-4).
For instance, raw inputs may be too noisy to be precisely predictable.
Still, there may be internally computable, predictable, informative input
transformations [33]. For example, hearing the first
two words of the sentence ``John eats chips'' does not make the word
``chips'' predictable, but at least it is likely that the third word will
represent something edible. Examples (1-5) mainly differ in the degree
to which the environment is involved in the computation processes, and
in the temporal order of the computations.
Curiosity's utility? The two-module system is supposed to solve self-generated tasks in an unsupervised manner. But can the knowledge collected in this way help to solve externally posed tasks? Intuition suggests that the more one knows about the world the easier it will be to maximize external reward. In fact, later I will present an example where a curious system indeed outperforms a non-curious one. This does not reflect a universal law though: in general there is no guarantee that curiosity will not turn out to be harmful (for example, by ``killing the cat'' [40]).
Relative reward weights?
Let
and
denote
's and
's
respective total cumulative rewards obtained
between time 0 and time
.
The sum of both modules' surprise rewards always remains zero:
we have
for all
.
If we adopt the traditional hope that exploration will contribute to accelerating environmental rewards, then zero-sum surprise rewards seem to afford less need to worry about the relative weights of surprise versus other rewards than the surprise rewards of previous approaches, which did not add up to zero.
Enforcing fairness. To avoid situations where one module consistently outperforms the other, the instruction set includes a special LI that copies the currently superior module onto the other (see the appendix for details). This LI (with never-vanishing probability) will occasionally bring both modules on a par with each other.
In principle, each module could learn to outsmart the other by executing
subsequences of instructions that include LIs. But how can we ensure that
each module indeed improves? Note that arithmetic actions affecting
and jump instructions affecting IP cause a highly non-Markovian
setting and prevent traditional RL algorithms based on dynamic
programming from being applicable. For such reasons I use Incremental
Self-improvement (IS) [35,34] to
deal with both modules' complex spatio-temporal credit assignment problem.