The most challenging open issues in sequential decision making include partial observability of the decision maker's environment, hierarchical and other types of abstract credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.
Policy learning. A learner's modifiable parameters that determine its behavior are called its policy. An algorithm that modifies the policy is called a learning algorithm. In the context of sequential decision making based on reinforcement learning (RL) there are two broad classes of learning algorithms: (1) methods based on dynamic programming (DP) [BellmanBellman1961], and (2) direct search (DS) in policy space. DP-based RL (DPRL) learns a value function mapping input/action pairs to expected discounted future reward and uses online variants of DP for constructing rewarding policies [SamuelSamuel1959,Barto, Sutton, AndersonBarto et al.1983,SuttonSutton1988,WatkinsWatkins1989,Watkins DayanWatkins Dayan1992,Moore AtkesonMoore Atkeson1993,Bertsekas TsitsiklisBertsekas Tsitsiklis1996]. DS runs and evaluates policies directly, possibly building new policy candidates from those with the highest evaluations observed so far. DS methods include variants of stochastic hill-climbing (SHC), evolutionary strategies [RechenbergRechenberg1971,SchwefelSchwefel1974], genetic algorithms (GAs) [HollandHolland1975], genetic programming (GP) [CramerCramer1985,Banzhaf, Nordin, Keller, FranconeBanzhaf et al.1998], Levin Search [LevinLevin1973,LevinLevin1984], and adaptive extensions of Levin Search [SolomonoffSolomonoff1986,Schmidhuber, Zhao, WieringSchmidhuber et al.1997b].
Outline. DS offers several advantages over DPRL, but also has some drawbacks. I will list advantages first (section 2), then describe an illustrative task unsolvable by DPRL but trivially solvable by DS (section 3), then mention a few theoretical results concerning DS in general search spaces (section 4), then point out a major problem of DS (section 5), and offer a remedy (section 6 and section 7).