Summary. HQ-learning is a novel method for reinforcement learning in partially observable environments. ``Non-Markovian'' tasks are automatically decomposed into subtasks solvable by memoryless policies, without intermediate external reinforcement for ``good'' subgoals. This is done by an ordered sequence of agents, each discovering both a local control policy and an appropriate subgoal. At each time step, the only type of memory is carried by the ``name'' of the agent that is active. Our experiments involve (model-free, deterministic) POMDPs with many more states than most POMDPs found in the literature. The results demonstrate HQ-learning's ability to quickly learn optimal or near-optimal policies.
Future work. The current HQ version is restricted to learning single linearly ordered subgoal sequences. For very complex POMDPs, generalized HQ-architectures based on directed acyclic (or even recurrent) graphs may turn out to be useful. In our point of view, however, the most challenging problem is exploration: ``destructive'' exploration rules will unlearn good subgoal sequences. How to improve POMDP exploration is still an open question.