The basic principle can be formulated as follows: Learn a mapping from actions (or action sequences) to the expectation of future performance improvement of the world model. Encourage action sequences where this expectation is high.
One way to do this is the following (section 4 will describe alternatives):
Model
the reliability of the predictions of the adaptive predictor
as described in section 2.
At time
, spend reinforcement for the model-building control system in
proportion to the current change of reliability
of the adaptive predictor.
The `curiosity goal' of the control system
(it might have additional `pre-wired' goals)
is to maximize the expectation of the cumulative sum of future
positive or negative changes in prediction reliability.
More formally:
The control system's
curiosity goal
at time
is to maximize
For instance, if method 1 or method 3 from section 2 is employed, then
, where
is
's response to
after having
adjusted
at time
.
So far the discussion did not have to refer to a particular reinforcement learning algorithm. Every sensible reinforcement learning algorithm ought to be useful (e.g [1][16][13][9]). For instance, [6] describes how adaptive critics [1][15] can be used to build a `curious' model-building control system based on the principle described above. The following subsection focusses on Watkins' recent `Q-learning' method.