The problem. Sensory information is usually insufficient to infer the environment's state (perceptual aliasing, Whitehead 1992). This complicates goal-directed behavior. For instance, suppose your instructions for the way to the station are: ``Follow this road to the traffic light, turn left, follow that road to the next traffic light, turn right, there you are.''. Suppose you reach one of the traffic lights. To do the right thing you need to know whether it is the first or the second. This requires at least one bit of memory -- your current environmental input by itself is not sufficient.
The most widely used reinforcement learning (RL) algorithms, such as Q-learning (Watkins 1989; Watkins and Dayan 1992) and TD() (Sutton 1988), fail if the task requires to create short-term memories of relevant events to disambiguate identical sensory inputs observed in different states. They are limited to Markov decision problems (MDPs): at any given time the probabilities of the possible next states depend only on the current state and action.
Partially observable Markov decision problems (POMDPs, e.g., Littman 1996) are more difficult than simple MDPs: a particular observation may demand different action responses depending on the temporal context. POMDPs are generally considered difficult because of their particularly nasty temporal credit assignment problem: it is usually hard to figure out which observations are relevant, and how they should affect short-term memory contents. Even deterministic finite horizon POMDPs are NP-complete (Littman 1996) -- general and exact algorithms are feasible only for small problems. This explains the recent interest in heuristic methods for finding good but not necessarily optimal solutions, e.g., Schmidhuber (1991c), McCallum (1993), Ring (1994), Cliff and Ross (1994), Jaakkola, Singh and Jordan (1995).
Unfortunately, however, previous methods do not scale up very well (Littman, Cassandra and Kaelbling 1995). This paper presents HQ-learning, a novel approach based on finite state memory implemented in a sequence of agents. HQ does not need a model of the POMDP and appears to scale up more reasonably than other approaches. For alternative approaches to larger scale POMDPs, see also Schmidhuber, Zhao and Wiering (1997b), Wiering and Schmidhuber (1996).
Inspiration. To select the optimal next action it is often not necessary to memorize the entire past (in general, this would be infeasible). A few memories corresponding to important previously achieved subgoals can be sufficient. To see this recall the traffic light scenario. While you are on your way, only a few memories are relevant, such as ``I already passed the first traffic light''. Between two such subgoals a memory-independent, reactive policy (RP) will carry you safely.
Overview. HQ-learning attempts to exploit such situations. Its divide-and-conquer strategy discovers a subgoal sequence decomposing a given POMDP into a sequence of reactive policy problems (RPPs). RPPs can be solved by RPs: all states causing identical inputs require the same optimal action. The only ``critical'' points are those corresponding to transitions from one RP to the next.
To deal with such transitions HQ uses multiple RPP-solving subagents. Each agent's RP is an adaptive mapping from observations to actions. At a given time only one agent can be active, and the system's only type of short-term memory is embodied by a pointer indicating which one. How many bits of information are conveyed by such a limited kind of short-term memory? The answer is: not more than the logarithm of the number of agents (the additional information conveyed by the system's RPs and subgoal generators tends to require many more bits, of course).
RPs of different agents are combined in a way learned by the agents themselves. The first active agent uses a subgoal table (its HQ-table) to generate a subgoal for itself (subgoals are represented by desired inputs). Then it follows the policy embodied by its Q-function until it achieves its subgoal. Then control is passed to the next agent, and the procedure repeats itself. After the overall goal is achieved or a time limit is exceeded, each agent adjusts both its RP and its subgoal. This is done by two learning rules that interact without explicit communication: (1) Q-table adaptation is based on slight modifications of Q(-learning. (2) HQ-table adaptation is based on tracing successful subgoal sequences by Q(-learning on the higher (subgoal) level. Effectively, subgoal/RP combinations leading to higher rewards become more likely to be chosen.
Although each agent's RP is represented by a memoryless lookup table, the whole system can solve ``non-Markovian'' tasks impossible to learn with single lookup tables. Unlike, e.g., Singh's system (1992) and Lin's hierarchical learning method (1993), ours does not depend on an external teacher who provides a priori information about ``good'' subtasks. Unlike Jaakkola et al.'s method (1995), ours is not limited to finding suboptimal, memoryless, stochastic policies for POMDPs with optimal, memory-based, deterministic solutions.
Outline. Section 2 describes HQ-learning details, including learning rules for both Q- and HQ-tables. Section 3 describes experiments with relatively complex partially observable mazes (up to 960 world states). They demonstrate HQ's ability to decompose POMDPs into several appropriate RPPs. Section 4 reviews related work. Section 5 summarizes HQ's advantages and limitations. Section 6 concludes and lists directions for future research.