|
A small environment.
The first test environment consists of
states. There are
possible actions, and 100 possible experiments.
The transistion probabilities are:
A bigger environment.
The second test environment consists of
states. There are
possible actions, and 10000 possible experiments.
The transistion probabilities are:
For random search and for RDIA based on entropy changes
(with parameters
,
, and
),
table 2 shows the number
of time steps required to achieve given
entropy values.
The only state
allowing for acquisition of a lot of information is
.
RDIA quickly
discovers this and establishes a policy that causes the agent to
move as quickly as possible to
from every other state.
Random exploration, in contrast, wastes much of its time
on the states
...
.
Again, for small entropy margins,
the advantage
of reinforcement driven information acquisition
is not as pronounced as
in later stages, because Q-learning needs some time to fix
the strategy for performing experiments.
As the entropy margin approaches the optimum,
however, reinforcement driven information acquisition
becomes much faster, by at least an order of magnitude.
|
Future work. 1. ``Exploitation/exploration trade-off''. In this paper, exploration was studied in isolation from exploitation. Is there an ``optimal'' way of combining both? For which kinds of goal-directed learning should RDIA be recommended? It is always possible to design environments where ``curiosity'' (the drive to explore the world) may ``kill the cat'', or at least may have a negative influence on exploitation performance. This is illustrated by additional experiments presented in [10]: In one environment described therein, exploration helps to speed up exploitation. But with a different environment, curiosity slows down exploitation. The ``exploitation/exploration trade-off'' remains an open problem.
2. Additional experimental comparisons. It will be interesting to compare RDIA to better competitors than random exploration, like e.g. Kaelbling's Interval Estimation algorithm [5].
3. Function approximators. It also will be interesting to replace the Q-table by function approximators like backprop networks. Previous experimental work by various authors indicates that in certain environments this might improve performance, despite the fact that theoretical foundations of combinations of Q-learning and function approximators are still weak.