Other authors proposed hierarchical reinforcement learning techniques, e.g., Schmidhuber (1991b), Dayan and Hinton (1993), Moore (1993), Tham (1995), Sutton (1995). Their methods, however, have been designed for MDPs. Since the focus of our paper is on POMDPs, this section is limited to a brief summary of previous POMDP approaches with specific advantages and disadvantages.
Recurrent neural networks. Schmidhuber (1991c) uses two interacting, gradient-based recurrent networks. The ``model network'' serves to model (predict) the environment, the other one uses the model net to compute gradients maximizing reinforcement predicted by the model (this extends ideas by Nguyen and B. Widrow 1989; and Jordan and Rumelhart 1990). To our knowledge this work presents the first successful reinforcement learning application to simple non-Markovian tasks (e.g., learning to be a flipflop). Lin (1993) also uses combinations of controllers and recurrent nets. He compares time-delay neural networks (TDNNs) and recurrent neural networks.
Despite their theoretical power, standard recurrent nets run into practical problems in case of long time lags between relevant input events. Although there are recent attempts at overcoming this problem (e.g., Schmidhuber 1992; Hihi and Bengio 1995; Hochreiter and Schmidhuber 1997, and references therein), there are no reinforcement learning applications yet.
Belief vectors. Kaelbling, Littman and Cassandra (1995) hierarchically build policy trees to calculate optimal policies in stochastic, partially observable environments. For each possible environmental state the ``belief vector'' provides the agent's estimate of the probability of currently being in this state. After each observation the belief vector is updated using action/observation models and Bayes' formula. This compresses the history of previous events into a probability distribution. Based on this ``belief state'' an optimal action can be chosen (Sondik 1971). Dynamic programming algorithms are used to compute optimal policies based on the belief states. Problems with this approach are that the nature of the underlying MDP needs to be known, and that it is computationally very expensive. Methods for speeding it up focus on constructing more compact policy trees. For instance, Littman (1996) uses the ``witness algorithm'' to accelerate policy tree construction. Zhang and Liu (1996) propose a scheme which (a) speeds up the dynamic programming updates and (b) uses an oracle providing additional state information to decrease uncertainty.
Boutilier and Poole (1996) use Bayesian networks to represent POMDPs, and use these more compact models to accelerate policy computation. Parr and Russell (1995) use gradient descent methods on a continuous representation of the value function. Their experiments show significant speed-ups on certain small problems.
Littman et al. (1995) compare different POMDP algorithms using belief vectors. They report that ``small POMDPs'' (with less than 10 states and few actions) do not pose a very big problem for most methods. Larger POMDPs (50 to 100 states), however, cause major problems. This indicates that the problems in the current paper (which involve 62 and 960 states) can hardly be solved by such methods. HQ-learning, by contrast, is neither computationally complex nor requires knowledge of the underlying MDP. In absence of prior knowledge this can be a significant advantage. An advantage of the other methods, however, is that they can deal with very noisy perceptions and actions.
A possible HQ extension could use belief vectors to assign selection probabilities to each agent and to weigh their Q-values. In very noise environments this may work better than simple HQ.
Hidden Markov Models. McCallum's utile distinction memory (1993) is an extension of Chrisman's perceptual distinctions approach (1992), which combines Hidden Markov Models (HMMs) and Q-learning. The system is able to solve simple POMDPs (maze tasks with only a few fields) by splitting ``inconsistent'' HMM states whenever the agent fails to predict their utilities (but instead experiences quite different returns from these states). One problem of the approach is that it cannot solve problems in which conjunctions of successive perceptions are useful for predicting reward while independent perceptions are irrelevant. HQ-learning does not have this problem -- it deals with perceptive conjunctions by using multiple agents if necessary.
Memory bits. Littman (1994) uses branch-and-bound heuristics to find suboptimal memoryless policies extremely quickly. To handle mazes for which there is no safe, deterministic, memoryless policy, he replaces each conventional action by two actions, each having the additional effect of switching on or off a ``memory bit''. Good results are obtained with a toy problem. The method does not scale though, due to search space explosion caused by adding memory bits.
Cliff and Ross (1994) use Wilson's (1994) classifier system (ZCS) for POMDPs. There are memory bits which can be set and reset by actions. ZCS is trained by bucket-brigade and genetic algorithms. The system is reported to work well on small problems but to become unstable in case of more than one memory bit. Also, it is usually not able to find optimal deterministic policies. Wilson (1995) recently described a more sophisticated classifier system which uses prediction accuracy for calculating fitness, and a genetic algorithm working in environmental niches. His study shows that this makes the classifiers more general and more accurate. It would be interesting to see how well this system can use memory for solving POMDPs.
One problem with memory bits is that tasks such as those in section 3 require (1) switching on/off memory bits at precisely the right moment, and (2) keeping them switched on/off for long times. During learning and exploration, however, each memory bit will be very unstable and change all the time -- algorithms based on incremental solution refinement will usually have great difficulties in finding out when to set or reset it. Even if the probability of changing a memory bit in response to a particular observation is low it will eventually change if the observation is made frequently. HQ-learning does not have such problems. Its memory is embodied solely by the active agent's number, which is rarely incremented during a trial. This makes it much more stable.
Program evolution with memory cells (MCs). Certain techniques for automatic program synthesis based on evolutionary principles can be used to evolve short-term memorizing programs that read and write MCs during runtime (e.g., Teller, 1994). A recent such method is Probabilistic Incremental Program Evolution (PIPE -- Salustowicz and Schmidhuber, 1997). PIPE iteratively generates successive populations of functional programs according to an adaptive probability distribution over all possible programs. On each iteration it uses the best program to refine the distribution. Thus it stochastically generates better and better programs. An MC-based PIPE variant has been successfully used to solve tasks in partially observable mazes. Unlike the memory bit approach mentioned in the previous paragraph, population-based approaches will not easily unlearn programs that make good use of memory. On serial machines, however, their evaluation tends to be computationally much more expensive than HQ.
Learning control hierarchies. Ring's system (1994) constructs a bottom-up control hierarchy. The lowest level nodes are primitive perceptual and control actions. Nodes at higher levels represent sequences of lower level nodes. To disambiguate inconsistent states, new higher-level nodes are added to incorporate information hidden ``deeper'' in the past, if necessary. The system is able to quickly learn certain non-Markovian maze problems but often is not able to generalize from previous experience without additional learning, even if the optimal policies for old and new task are identical. HQ-learning, however, can reuse the same policy and generalize well from previous to ``similar'' problems.
McCallum's U-tree (1996) is quite similar to Ring's system. It uses prediction suffix trees (see Ron and al. 1994) in which the branches reflect decisions based on current or previous inputs/actions. Q-values are stored in the leaves, which correspond to clusters of instances collected and stored during the entire learning phase. Statistical tests are used to decide whether instances in a cluster correspond to significantly different utility estimates. If so, the cluster is split. McCallum's recent experiments demonstrate the algorithm's ability to learn reasonable policies in large state spaces.
One problem with Ring's and McCallum's approaches is that they depend on the creation of an -th order Markov model, where is the size of the ``time window'' used for sampling observations. Hence for large the approach will suffer from the curse of dimensionality.
Consistent Representations. Whitehead (1992) uses the ``Consistent Representation (CR) Method'' to deal with inconsistent internal states which result from ``perceptual aliasing'' due to ambiguous input information. CR uses an ``identification stage'' to execute perceptual actions which collect the information needed to define a consistent internal state. Once a consistent internal state has been identified, a single action is generated to maximize future discounted reward. Both identifier and controller are adaptive. One limitation of his method is that the system has no means of remembering and using any information other than that immediately perceivable. HQ-learning, however, can profit from remembering previous events for very long time periods.
Levin Search. Wiering and Schmidhuber (1996) use Levin search (LS) through program space (Levin 1973) to discover programs computing solutions for large POMDPs. LS is of interest because of its amazing theoretical properties: for a broad class of search problems, it has the optimal order of computational complexity. For instance, suppose there is an algorithm that solves a certain type of maze task in steps, where is a positive integer representing the problem size. Then LS will solve the same task in at most steps. Wiering and Schmidhuber show that LS may have substantial advantages over other reinforcement learning techniques, provided the algorithmic complexity of the solutions is low.
Success-Story Algorithm. Wiering and Schmidhuber (1996) also extend LS to obtain an incremental method for generalizing from previous experience (``adaptive LS''). To guarantee that the lifelong history of policy changes corresponds to a lifelong history of reinforcement accelerations, they use the success-story algorithm (SSA, e.g., Schmidhuber, Zhao and Schraudolph 1997a; Zhao and Schmidhuber 1996, Schmidhuber, Zhao and Wiering 1997b). This can lead to further significant speed-ups. SSA is actually not LS-specific, but a general approach that allows for plugging in a great variety of learning algorithms. For instance, in additional experiments with a ``self-referential'' system that embeds its policy-modifying method within the policy itself, SSA is able to solve huge POMDPs with more than states (Schmidhuber et al. 1997a). It may be possible to combine SSA with HQ-learning in an advantageous way.
Multiple Q-learners. Like HQ-learning, Humphrys' W-learning (1996) uses multiple Q-learning agents. A major difference is that his agents' skills are prewired -- different agents focus on different input features and receive different rewards. ``Good'' reward functions are found by genetic algorithms. An important goal is to learn which agent to select for which part of the input space. Eight different learning methods implementing cooperative and competitive strategies are tested in a rather complex dynamic environment, and seem to lead to reasonable results.
Digney (1996) describes a nested Q-learning technique based on multiple agents learning independent, reusable skills. To generate quite arbitrary control hierarchies, simple actions and skills can be composed to form more complex skills. Learning rules for selecting skills and for selecting actions are the same, however. This may make it hard to deal with long reinforcement delays. In experiments the system reliably learns to solve a simple maze task. It remains to be seen, however, whether the system can reliably learn to decompose solutions of complex problems into stable skills.