Some DPRL variants such as
are limited to a very special
kind of exponentially decaying short-term memory. Others simply ignore
memory issues by focusing on suboptimal, memory-free solutions to
problems whose optimal solutions do require some form of short-term
memory [Jaakkola, Singh, JordanJaakkola
et al.1995]. Again others can in principle
find optimal solutions even in partially observable environments
(POEs) [Kaelbling, Littman, CassandraKaelbling
et al.1995,Littman, Cassandra, KaelblingLittman
et al.1995], but they (a) are
practically limited to very small problems [LittmanLittman1996],
and (b) do require knowledge of a discrete state space model of
the environment. To various degrees, problem (b) also holds for
certain hierarchical RL approaches to memory-based input disambiguation
[RingRing1991,RingRing1993,RingRing1994,McCallumMcCallum1996,Wiering SchmidhuberWiering Schmidhuber1998]. Although no
discrete models are necessary for DPRL systems with function approximators
based on recurrent neural networks [SchmidhuberSchmidhuber1991c,LinLin1993],
the latter do suffer from a lack of theoretical foundation, perhaps even
more so than the backgammon player.
DS, however, does not care at all for Markovian conditions
and full observability of the environment. While DPRL is
essentially limited to learning reactive policies mapping
current inputs to output actions, DS in principle
can be applied to search
spaces whose elements are general algorithms or programs with
time-varying variables that can be used for memory purposes
[WilliamsWilliams1992,TellerTeller1994,SchmidhuberSchmidhuber1995,Wiering SchmidhuberWiering Schmidhuber1996,Sa
ustowicz SchmidhuberSa
ustowicz Schmidhuber1997].