- ...
time
- Note that even for the more limited DPRL algorithms
until very recently there have not been any theorems guaranteeing
finite convergence time [Kearns SinghKearns Singh1999].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...

- Instead of
using the expression policy for the conditional probability
distribution
itself we reserve it for
the agent's modifiable data structure POL.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... 0.
-
The problem resembles another two-armed bandit problem
for which there is an optimal method due to
Gittins [GittinsGittins1989].
Our unknown reward delays, however,
prevent this method from being applicable -- it cannot discover
that the current reward does not depend on the current input but on
an event in the past. In addition, Gittins' method needs to
discount future rewards relative to immediate rewards.
Anyway, this footnote is besides the point of this paper whose
focus is on direct policy search -- Gittins' method is not a direct one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
take?
- In fact, population-based approaches
will suffer even more than simple SHC from unknown
delays and stochasticity,
simply because they need to test
many policies, not just one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.