For
we define
| (1) |
Furthermore we define
The algorithm is a cross between the BPTT
and the RTRL-algorithm.
The description of
the algorithm will be interleaved with its derivation and some comments
concerning complexity. The basic idea is: Decompose the calculation
of the gradient into blocks, each covering
time steps.
For each block perform
BPTT-like passes, one pass for
calculating error derivatives, and
passes for calculating
derivatives of the net-inputs to the
non-input units
at the end of each block. Perform
RTRL-like calculations
for integrating the results of these BPTT-like passes into the
results obtained from previous blocks.
The algorithm starts by setting the variable
.
represents the beginning of the current block.
Note that
for all possible
.
The main loop of the algorithm consists of 5 steps.
STEP1: Set
(I recommend:
).
The quantity
for all
is already known and
is known
for all appropriate
.
There is an efficient way of computing the contribution
of
to the change in
:
STEP2: Let the network run from time step
to time step
according to the activation dynamics specified in equation (1).
If it turns out that the current training
sequence has less than
time steps
(i.e.,
) then
. If
then EXIT.
STEP3:
Perform a combination of a BPTT-like phase
with an RTRL-like calculation
for computing error
derivatives
as described next. We write
![]() |
(2) |
The first term of (2) is already known. Consider the third term:
STEP4:
To compute
for all possible
,
perform
combinations of a BPTT-like phase
with an RTRL-like calculation
(one such combination for each
)
for computing
as follows:
![]() |
(3) |
For a given
, the computation of
(3) for all
requires
operations. Therefore
STEP3 and STEP4 together require
operations
spread over
time steps.
Since
,
computations are spread over
time steps.
This implies
an average of
computations per time step.
The final step of the algorithm's main loop is
STEP5: Set
and go to STEP1.
The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time STEP4 is completed.
As formulated above, the algorithm needs
computations
at its peak,
every
-th time step. Nothing prevents us, however, from
distributing these
computations more evenly over
time steps. One way of achieving this is to perform one
of the
BPTT-like phases of STEP 4 at each time step of the
next `block' of
time steps.