Overview. As mentioned above, DS in policy space does not require certain assumptions about the environment necessary for traditional RL. For instance, while the latter is essentially limited to memory-free, reactive mappings from inputs to actions, DS can be used to search among fairly arbitrary, complex, event-memorizing programs (using memory to deal with partial observability), at least in simulated, deterministic worlds. In realistic settings, however, reliable policy evaluations are complicated by (a) unknown delays between action sequences and observable effects, and (b) stochasticity in policy and environment. Given a limited life-time, how much time should a direct policy searcher spend on policy evaluations to obtain reliable statistics? Despite the fundamental nature of this question it has not received much attention yet. Here I evaluate an efficient approach based on the success-story algorithm (SSA). It provides a radical answer prepared for the worst case: it never stops evaluating any previous policy modification except those it undoes for lack of empirical evidence that they have contributed to lifelong reward accelerations. I identify SSA's fundamental advantages over traditional DS on problems involving unknown delays and stochasticity.
The problem. In realistic situations DS exhibits several fundamental drawbacks: (1) Often there are unknown temporal delays between causes and effects. In general we cannot be sure that a given trial was long enough to observe all long-term rewards/punishments caused by actions executed during the trial. We do not know in advance how long a trial should take. (2) The policy may be stochastic, i.e., the learner's actions are selected nondeterministically according to probability distributions conditioned on the policy. Stochastic policies are widely used to prevent learners from getting stuck. Results of policy evaluations, however, will then vary from trial to trial. (3) Environment and reward may be stochastic, too. And even if the environment is deterministic it may appear stochastic from an individual learner's perspective, due to partial observability.
Time is a scarce resource. Hence all direct methods face a central question: to determine whether some policy is really useful in the long run or just appears to be (e.g., because the current trial was too short to encounter a punishing state, or because it was a lucky trial), how much time should the learner spend on its evaluation? In particular, how much time should a single trial with a given policy take, and how many trials are necessary to obtain statistically significant results without wasting too much time? Despite the fundamental nature of these questions not much work has been done to address them.
Basic idea. There is a radical answer to such questions: Evaluate a previous policy change at any stage of the search process by looking at the entire time interval that has gone by since the change occurred -- at any given time aim to use all the available empirical data concerning long-term policy-dependent rewards! A change is considered ``good'' as long as the average reward per time since its creation exceeds the corresponding ratios for previous ``good'' changes. Changes that eventually turn out to be ``bad'' get undone by an efficient backtracking scheme called the success-story algorithm (SSA). SSA always takes into account the latest information available about long-term effects of changes that have appeared ``good'' so far (``bad'' changes, however, are not considered again). Effectively SSA adjusts trial lengths retrospectively: at any given time, trial starts are determined by the occurrences of the remaining ``good'' changes representing a success story. The longer the time interval that went by since some ``good'' change, the more reliable the evaluation of its true long-term benefits. No trial of a ``good'' change ever ends unless it turns out to be ``bad'' at some point. Thus SSA is prepared for the worst case. For instance, no matter what's the maximal time lag between actions and consequences in a given domain, eventually SSA's effective trials will encompass it.
What's next. The next section will explain SSA in detail, and show how most traditional direct methods can be augmented by SSA. Then we will experimentally verify SSA's advantages over traditional direct methods in case of noisy performance evaluations and unknown delays.