Next: Bibliography Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 2. MODEL BUILDING WITH

## 3. SIMULATIONS OF RDIA

We compared the performance of several RDIA variants as described above to the performance of conventional random exploration (variants of random exploration are the methods employed by most authors).

Table 1: For random search and two RDIA variants, the evolutions of the sum of Kullback-Leibler distances between estimated and true probability distributions are shown. In the beginning, RDIA takes a while to find out where it can expect to learn something. But then it quickly surpasses random search.
 # Experiments Random Search RDIA (entropy) RDIA (prob. diff.) 1 1024 2048 4096 8192 16384 32768 65536 131072 262144

A small environment. The first test environment consists of states. There are possible actions, and 100 possible experiments. The transistion probabilities are:

and otherwise. The only state that allows to acquire a lot of information is . After a while, RDIA (with parameters , , and ) discovers this and establishes a policy that causes the agent to move as quickly as possible to from every other state. Random exploration, however, wastes most of the time on the soon useless (uninformative) examination of the states ... . This can be seen from table 1, which compares random search and the two RDIA variants that worked best: (1) RDIA based on changes in entropy (equation 2), (2) RDIA based on weighted probability changes. In the beginning, RDIA takes a while to find out where it can expect to learn something. Then it quickly catches on and surpasses random search.

A bigger environment. The second test environment consists of states. There are possible actions, and 10000 possible experiments. The transistion probabilities are:

and otherwise. The information content of the second environment (the sum of the entropies of the true transition probability distributions associated with all state/action pairs) is 460.517019.

For random search and for RDIA based on entropy changes (with parameters , , and ), table 2 shows the number of time steps required to achieve given entropy values. The only state allowing for acquisition of a lot of information is . RDIA quickly discovers this and establishes a policy that causes the agent to move as quickly as possible to from every other state. Random exploration, in contrast, wastes much of its time on the states ... . Again, for small entropy margins, the advantage of reinforcement driven information acquisition is not as pronounced as in later stages, because Q-learning needs some time to fix the strategy for performing experiments. As the entropy margin approaches the optimum, however, reinforcement driven information acquisition becomes much faster, by at least an order of magnitude.

Table 2: For random search and for RDIA based on entropy differences, this table shows the number of time steps required to achieve given entropy values. The optimal value (the true information content of the environment) is 460.517019. As the entropy margin approaches the optimum, RDIA becomes much faster. The entry marked unknown'' was not computed due to limited computation time.
 Goal entropy # Experiments: Random Search #Experiments: RDIA 170.0 370.0 459.0 460.0 unknown

Future work. 1. Exploitation/exploration trade-off''. In this paper, exploration was studied in isolation from exploitation. Is there an optimal'' way of combining both? For which kinds of goal-directed learning should RDIA be recommended? It is always possible to design environments where curiosity'' (the drive to explore the world) may kill the cat'', or at least may have a negative influence on exploitation performance. This is illustrated by additional experiments presented in [10]: In one environment described therein, exploration helps to speed up exploitation. But with a different environment, curiosity slows down exploitation. The exploitation/exploration trade-off'' remains an open problem.

2. Additional experimental comparisons. It will be interesting to compare RDIA to better competitors than random exploration, like e.g. Kaelbling's Interval Estimation algorithm [5].

3. Function approximators. It also will be interesting to replace the Q-table by function approximators like backprop networks. Previous experimental work by various authors indicates that in certain environments this might improve performance, despite the fact that theoretical foundations of combinations of Q-learning and function approximators are still weak.

Next: Bibliography Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 2. MODEL BUILDING WITH
Juergen Schmidhuber 2003-02-28

Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page