In the comparatively simple case considered here,
the controller
is a standard back-propagation network.
There are discrete time steps.
Each fovea trajectory involves
discrete time steps 1 ...
.
At time step
of trajectory
,
's input is
the real-valued vector
which is determined by sensory perceptions from
the artificial `fovea'.
's output at time step
of
trajectory
is the vector
.
At each time step
motoric actions like
`move fovea left', `rotate fovea' are based
on
. The actions cause a new input
.
The final desired input
of the trajectory
is a
predefined activation pattern
corresponding to the target to be found in
a static visual scene.
The task is to sequentially
generate fovea trajectories such that for each trajectory
matches
.
The final input error
at
the end of trajectory
(externally interrupted
at time step
) is
Thus
is determined by the differences between the
desired final inputs and the actual final inputs.
In order to allow credit assignment to past output actions
of the control network, we first train the
model network
(another standard back-propagation network)
to emulate the visible environmental
dynamics. This is done by training
at
a given time to
predict
's next input, given
the previous input and output of
.
The following discussion refers to the case where both
and
learn in parallel. In some of the experiments below we use
two separate training phases for
and
. However, the
modifications are straight-forward and mainly notational.
's input vector at time
of trajectory
is the
concatenation of
and
.
's real-valued output vector at time
of trajectory
is
,
where
. (Here
is the
dimension of
,
has as many
output units as there are input units for
.)
is
's
prediction of
.
The error of
's prediction at time
of trajectory
is
's goal is to minimize
, which is done
by conventional back-propagation
[17][7][4][9]:
's training phase is more complex than
's.
It is assumed that
is a differentiable function of
,
where
is
's weight vector. To approximate
Here
is
's
increment caused by the back-propagation procedure,
and
is the learning rate of the controller.
Note that
the differences between target inputs and actual final inputs
at the end of each trajectory
are used for computing error signals for the controller. We
do not use
the differences
between desired final inputs and predicted final inputs.
To apply the `unfolding in time' algorithm
[9][18]
to the recurrent combination of
and
, do the
following:
For all trajectories
:
1. During the activation spreading phase of
,
for each time step
of
create a copy of
(called
) and a copy of
(called
).
2. Construct a large `unfolded' feed-forward
back-propagation network consisting of
sub-modules by doing the following:
2.a) For
replace each input unit
of
by the unit in
which predicted
's activation.
2.b) For
:
Replace each input unit of
whose activation was provided
by an output unit
of
by
.
3. Propagate the difference
back
through the entire `unfolded' network constructed in step 2.
Change each weight of
in proportion to the sum of the
partial derivatives
computed for the corresponding
connection copies in the unfolded
network. Do not change the weights of
.
Since the weights remain constant during the activation
spreading phase of one trajectory, the practical algorithm used in
the experiments
does not really create copies of the weights.
It is more efficient to introduce one additional variable
for each controller weight: This variable is used for
accumulating the corresponding
sum of weight changes.
During trajectory execution, it is convenient to push
the time-varying activations of
the units in
and
on
stacks of activations, one for each unit.
During the back-propagation phase these
activations can be successively popped off for the
computation of error signals.