**POMDP specification.**
System life is separable into ``trials''.
A trial consists of at most discrete time steps
, where
if the agent solves the problem
in fewer than time steps.
A POMDP is specified by
,
where
is a finite
set of environmental states,
is the initial state,
is a finite set of observations,
the function
is a many-to-one mapping of
states to (ambiguous) observations,
is a finite set of actions,
maps state-action pairs to scalar
reinforcement signals,
is a discount factor which trades off
immediate rewards against future rewards, and
is a state transition function,
where
denotes the probability of transition to state
given , where is the environmental state at time ,
and is the action executed at time .
The system's goal is to obtain maximal
(discounted) cumulative reinforcement during the trial.

**POMDPs as RPP sequences.**
The optimal policy of any deterministic POMDP with
final goal state is decomposable into
a finite sequence of optimal policies for appropriate RPPs,
along with subgoals determining transitions from one RPP to the next.
The trivial decomposition consists of single-state RPPs
and the corresponding subgoals.
In general, POMDPs whose only decomposition is trivial are
hard -- there is no efficient algorithm for solving them.
HQ, however, is aimed at situations that require few transitions
between RPPs. HQ's architecture implements such transitions by passing
control from one RPP-solving subagent to the next.

**Architecture.**
There is an ordered sequence of agents
,
,
...
,
each equipped with
a Q-table, an HQ-table, and a
control transfer unit, except for
,
which only has a Q-table (see figure 1).
Each agent is responsible for learning part of the system's policy.
Its Q-table represents its local policy for executing an action given
an input. It is given by
a matrix of size
, where is the
number of different possible observations
and the number of possible actions.
denotes
's Q-value (utility) of
action given observation .
The agent's current subgoal is generated with the help of
its HQ-table, a vector with elements.
For each possible observation there
is an HQ-table entry representing its estimated value as a subgoal.
denotes
's HQ-value (utility) of
selecting as its subgoal.

The system's current policy is the policy of the
currently *active* agent.
If
is active at time step ,
then we will denote this by
.
The variable represents the only
kind of short-term memory in the system.

**Architecture limitations.**
The sequential architecture
restricts the POMDP types HQ can solve.
To see this consider the difference between
RL goals of (1) achievement and (2) maintenance.
The former refer to single state goals (e.g., find the exit of a maze),
the latter to maintaining a desirable state over time (such as
keeping a pole balanced).
Our current HQ-variant handles achievement goals only.
In case of maintenance goals it will eventually
run out of agents -- there
must be an explicit final desired state
(this restriction may be overcome with different agent topologies
beyond the scope of this paper).

**Selecting a subgoal.**
In the beginning
is
made active.
Once
is active, its HQ-table is used to
select a subgoal for
.
To explore different subgoal sequences
we use the Max-Random rule:
the subgoal with maximal value is selected with probability
, a random subgoal is selected with probability .
Conflicts between
multiple subgoals with maximal -values
are solved by randomly selecting one.
denotes the subgoal selected by agent
.
This subgoal is only used in transfer of control as defined below and
should not be confused with an observation.

**Selecting an action.**
's action choice depends only
on the current observation .
During learning, at time ,
the active agent
will select actions
according to the Max-Boltzmann distribution:
with probability
take the action with maximal value;
with probability
select an action according to the traditional Boltzmann or Gibbs distribution, where
the probability of selecting action given observation is:

The ``temperature'' adjusts the degree of randomness involved in agent 's action selection in case the Boltzmann rule is used. Conflicts between multiple actions with maximal Q-values are solved by randomly selecting one.

**Transfer of control.**
Control is transferred from one active agent to the next
as follows.
Each time
has executed an action,
its control transfer unit
checks whether
has reached the goal.
If not, it checks whether
has solved its subgoal to decide whether control should be passed on to
.
We let denote the time at which
agent
is made active
(at system start-up, we set
).

IF no goal state reached AND current subgoal =

AND AND

THEN AND

Back to Reinforcement Learning and POMDP page

Back to Subgoal Learning - Hierarchical Learning