**Collecting ML estimates.**
For each state/action pair (or experiment)
, the
agent has a counter whose value at time ,
, equals the number of the agent's previous experiences
with .
In addition, for each state/action pair , there
are counters , . The value
of at time ,
, equals the number of the agent's previous experiences
with , where the next state was .
Note that
.
At time , if , then

denotes the agent's current unbiased estimate of . If , then we define (somewhat arbitrarily) . Note that, as a consequence, before the agent has conducted any experiments of the type , the do not satisfy the requirements of a probability distribution. For , the build a maximum likelihood model (consistent with the previous experiences of the agent) of the probabilities of the possible next states.

**Measuring information gain.**
If the agent performs an experiment by executing
action in state , and the new state
is ,
then in general will be different from
.
By observing the outcome of the experiment, the agent has
acquired a piece of information.
To measure its progress,
we compute the
information theoretic
difference between what the agent knew before
the experiment, at time , and what the agent knew after
the experiment, at time .
One natural way of doing this is to use the
Kullback-Leibler distance (or asymmetric divergence)
between the probability distributions represented
by the and
.
We define

(1) |

A related (but less informative)
measure of progress is the entropy difference of
the probability distributions represented
by the and
,

(2) |

**In all cases, best policies are found by
using as the reinforcement
for the Q-Learning algorithm from section 2.**
Since an experiment at time affects only
estimates (the
associated with
and ),
and since can always be computed within
operations,
the algorithm's overall complexity per time step is bounded by
.

Back to Active Learning - Exploration - Curiosity page

Back to Reinforcement Learning page