In the comparatively simple case considered here,
the controller is a standard back-propagation network.
There are discrete time steps.
Each fovea trajectory involves discrete time steps 1 ... .
At time step of trajectory , 's input is
the real-valued vector
which is determined by sensory perceptions from
the artificial `fovea'.
's output at time step of
trajectory is the vector
.
At each time step motoric actions like
`move fovea left', `rotate fovea' are based
on . The actions cause a new input .
The final desired input of the trajectory
is a
predefined activation pattern
corresponding to the target to be found in
a static visual scene.
The task is to sequentially
generate fovea trajectories such that for each trajectory
matches .
The *final input error* at
the end of trajectory (externally interrupted
at time step ) is

Thus is determined by the *differences between the
desired final inputs and the actual final inputs*.

In order to allow credit assignment to past output actions of the control network, we first train the model network (another standard back-propagation network) to emulate the visible environmental dynamics. This is done by training at a given time to predict 's next input, given the previous input and output of . The following discussion refers to the case where both and learn in parallel. In some of the experiments below we use two separate training phases for and . However, the modifications are straight-forward and mainly notational.

's input vector at time of trajectory is the
concatenation of and .
's real-valued output vector at time of trajectory is ,
where
. (Here is the
dimension of , has as many
output units as there are input units for .) is 's
prediction of .
The error of 's prediction at time of trajectory is

's goal is to minimize
, which is done
by conventional back-propagation
[17][7][4][9]:

Here is 's weight vector, its change caused by the back-propagation procedure, and is 's constant learning rate. (In the experiments described below we will deviate from pure gradient descent by changing 's weights after each time step of each trajectory.)

's training phase is more complex than 's.
It is assumed that
is a differentiable function of ,
where is 's weight vector. To approximate

it is assumed

Here
is 's
increment caused by the back-propagation procedure,
and is the learning rate of the controller.
Note that
the differences between target inputs and *actual* final inputs
at the end of each trajectory
are used for computing error signals for the controller. We
do *not* use
the differences
between desired final inputs and *predicted* final inputs.

To apply the `unfolding in time' algorithm [9][18] to the recurrent combination of and , do the following:

*For all trajectories :
*

*1. During the activation spreading phase of ,
for each time step of
create a copy of (called ) and a copy of (called ).
*

*2. Construct a large `unfolded' feed-forward
back-propagation network consisting of
sub-modules by doing the following:
*

*2.a) For replace each input unit of by the unit in
which predicted 's activation.
*

*2.b) For :
Replace each input unit of whose activation was provided
by an output unit of by .
*

*3. Propagate the difference
back
through the entire `unfolded' network constructed in step 2.
Change each weight of in proportion to the sum of the
partial derivatives
computed for the corresponding connection copies in the unfolded
network. Do not change the weights of .
*

Since the weights remain constant during the activation spreading phase of one trajectory, the practical algorithm used in the experiments does not really create copies of the weights. It is more efficient to introduce one additional variable for each controller weight: This variable is used for accumulating the corresponding sum of weight changes. During trajectory execution, it is convenient to push the time-varying activations of the units in and on stacks of activations, one for each unit. During the back-propagation phase these activations can be successively popped off for the computation of error signals.

Back to Reinforcement Learning page

Back to main page on Learning attentive vision