Next: 3. SIMULATIONS OF RDIA Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 1. INTRODUCTION

## 2. MODEL BUILDING WITH RDIA

Our agent's task is to build a model of the transition probabilities . The problem is studied in isolation from goal-directed reinforcement learning tasks: RDIA embodies a kind of unsupervised reinforcement learning''. It extends recent previous work on active exploration'' (e.g. [9,8,11]). Previous approaches (1) were limited to deterministic environments (they did not address the general problem of learning a model of the statistical properties of a non-deterministic NME), and (2) were based on ad-hoc elements instead of building on concepts from information theory.

Collecting ML estimates. For each state/action pair (or experiment) , the agent has a counter whose value at time , , equals the number of the agent's previous experiences with . In addition, for each state/action pair , there are counters , . The value of at time , , equals the number of the agent's previous experiences with , where the next state was . Note that . At time , if , then

denotes the agent's current unbiased estimate of . If , then we define (somewhat arbitrarily) . Note that, as a consequence, before the agent has conducted any experiments of the type , the do not satisfy the requirements of a probability distribution. For , the build a maximum likelihood model (consistent with the previous experiences of the agent) of the probabilities of the possible next states.

Measuring information gain. If the agent performs an experiment by executing action in state , and the new state is , then in general will be different from . By observing the outcome of the experiment, the agent has acquired a piece of information. To measure its progress, we compute the information theoretic difference between what the agent knew before the experiment, at time , and what the agent knew after the experiment, at time . One natural way of doing this is to use the Kullback-Leibler distance (or asymmetric divergence) between the probability distributions represented by the and . We define

 (1)

where

A related (but less informative) measure of progress is the entropy difference of the probability distributions represented by the and ,

 (2)

for . Again, if (before the agent has conducted any experiments of type ), the entropy of the corresponding MLM is taken to be zero. In this case, will be zero, too. Another simple, related performance measure is for , and for . Initial experiments seem to indicate that the particular definition of does not make an essential difference.

In all cases, best policies are found by using as the reinforcement for the Q-Learning algorithm from section 2. Since an experiment at time affects only estimates (the associated with and ), and since can always be computed within operations, the algorithm's overall complexity per time step is bounded by .

Next: 3. SIMULATIONS OF RDIA Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 1. INTRODUCTION
Juergen Schmidhuber 2003-02-28

Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page