POMDP specification.
System life is separable into ``trials''.
A trial consists of at most
discrete time steps
, where
if the agent solves the problem
in fewer than
time steps.
A POMDP is specified by
,
where
is a finite
set of environmental states,
is the initial state,
is a finite set of observations,
the function
is a many-to-one mapping of
states to (ambiguous) observations,
is a finite set of actions,
maps state-action pairs to scalar
reinforcement signals,
is a discount factor which trades off
immediate rewards against future rewards, and
is a state transition function,
where
denotes the probability of transition to state
given
, where
is the environmental state at time
,
and
is the action executed at time
.
The system's goal is to obtain maximal
(discounted) cumulative reinforcement during the trial.
POMDPs as RPP sequences.
The optimal policy of any deterministic POMDP with
final goal state is decomposable into
a finite sequence of optimal policies for appropriate RPPs,
along with subgoals determining transitions from one RPP to the next.
The trivial decomposition consists of single-state RPPs
and the corresponding subgoals.
In general, POMDPs whose only decomposition is trivial are
hard -- there is no efficient algorithm for solving them.
HQ, however, is aimed at situations that require few transitions
between RPPs. HQ's architecture implements such transitions by passing
control from one RPP-solving subagent to the next.
Architecture.
There is an ordered sequence of
agents
,
,
...
,
each equipped with
a Q-table, an HQ-table, and a
control transfer unit, except for
,
which only has a Q-table (see figure 1).
Each agent is responsible for learning part of the system's policy.
Its Q-table represents its local policy for executing an action given
an input. It is given by
a matrix of size
, where
is the
number of different possible observations
and
the number of possible actions.
denotes
's Q-value (utility) of
action
given observation
.
The agent's current subgoal is generated with the help of
its HQ-table, a vector with
elements.
For each possible observation there
is an HQ-table entry representing its estimated value as a subgoal.
denotes
's HQ-value (utility) of
selecting
as its subgoal.
The system's current policy is the policy of the
currently active agent.
If
is active at time step
,
then we will denote this by
.
The variable
represents the only
kind of short-term memory in the system.
Architecture limitations. The sequential architecture restricts the POMDP types HQ can solve. To see this consider the difference between RL goals of (1) achievement and (2) maintenance. The former refer to single state goals (e.g., find the exit of a maze), the latter to maintaining a desirable state over time (such as keeping a pole balanced). Our current HQ-variant handles achievement goals only. In case of maintenance goals it will eventually run out of agents -- there must be an explicit final desired state (this restriction may be overcome with different agent topologies beyond the scope of this paper).
Selecting a subgoal.
In the beginning
is
made active.
Once
is active, its HQ-table is used to
select a subgoal for
.
To explore different subgoal sequences
we use the Max-Random rule:
the subgoal with maximal
value is selected with probability
, a random subgoal is selected with probability
.
Conflicts between
multiple subgoals with maximal
-values
are solved by randomly selecting one.
denotes the subgoal selected by agent
.
This subgoal is only used in transfer of control as defined below and
should not be confused with an observation.
Selecting an action.
's action choice depends only
on the current observation
.
During learning, at time
,
the active agent
will select actions
according to the Max-Boltzmann distribution:
with probability
take the action
with maximal
value;
with probability
select an action according to the traditional Boltzmann or Gibbs distribution, where
the probability of selecting action
given observation
is:
Transfer of control.
Control is transferred from one active agent to the next
as follows.
Each time
has executed an action,
its control transfer unit
checks whether
has reached the goal.
If not, it checks whether
has solved its subgoal to decide whether control should be passed on to
.
We let
denote the time at which
agent
is made active
(at system start-up, we set
).
IF no goal state reached AND current subgoal =![]()
ANDAND
![]()
THENAND
![]()