Next: CONCLUDING REMARKS Up: A FIXED SIZE STORAGE Previous: INTRODUCTION

# THE ALGORITHM

The notation will be similar to the notation of [Williams and Peng, 1990]. is the set of indices such that at the discrete time step the quantity is the output of a non-input unit in the network. is the set of indices such that is an external input for input unit at time . denotes the set of indices for which there exists a specified target value at time . Each input unit has a directed connection to each non-input unit. Each non-input unit has a directed connection to each non-input unit. The weight of the connection from unit to unit is denoted by . To distinguish between different instances' of at different times, we let denote a variable for the weight of the connection from unit to unit at time . This is just for notational convenience: for all to be considered. One way to visualize the is to consider them as weights of connections to the -th non-input layer of a feed-forward network constructed by unfolding' the recurrent network in time (e.g. [Williams and Peng, 1990]). A training sequence with time steps starts at time step 0 and ends at time step . The algorithm below is of interest if (otherwise it is preferable to use BPTT).

For we define

 (1)

where is a differentiable (usually semi-linear) function. For all and for all we define

Furthermore we define

The algorithm is a cross between the BPTT and the RTRL-algorithm. The description of the algorithm will be interleaved with its derivation and some comments concerning complexity. The basic idea is: Decompose the calculation of the gradient into blocks, each covering time steps. For each block perform BPTT-like passes, one pass for calculating error derivatives, and passes for calculating derivatives of the net-inputs to the non-input units at the end of each block. Perform RTRL-like calculations for integrating the results of these BPTT-like passes into the results obtained from previous blocks.

The algorithm starts by setting the variable . represents the beginning of the current block. Note that for all possible . The main loop of the algorithm consists of 5 steps.

STEP1: Set (I recommend: ).

The quantity for all is already known and is known for all appropriate . There is an efficient way of computing the contribution of to the change in :

where is a learning rate constant.

STEP2: Let the network run from time step to time step according to the activation dynamics specified in equation (1). If it turns out that the current training sequence has less than time steps (i.e., ) then . If then EXIT.

STEP3: Perform a combination of a BPTT-like phase with an RTRL-like calculation for computing error derivatives as described next. We write

 (2)

The first term of (2) is already known. Consider the third term:

where

For a given , can be computed for all with a single step BPTT-pass of the order operations:

What remains is the computation of the second term of (2) for all , which requires operations:

STEP4: To compute for all possible , perform combinations of a BPTT-like phase with an RTRL-like calculation (one such combination for each ) for computing as follows:

 (3)

where

For a given , a given and for all the quantity can be computed with a single step BPTT-like operation of the order :

For a given , the computation of (3) for all requires operations. Therefore STEP3 and STEP4 together require operations spread over time steps. Since , computations are spread over time steps. This implies an average of computations per time step.

The final step of the algorithm's main loop is

STEP5: Set and go to STEP1.

The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time STEP4 is completed.

As formulated above, the algorithm needs computations at its peak, every -th time step. Nothing prevents us, however, from distributing these computations more evenly over time steps. One way of achieving this is to perform one of the BPTT-like phases of STEP 4 at each time step of the next `block' of time steps.

Next: CONCLUDING REMARKS Up: A FIXED SIZE STORAGE Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-13

Back to Recurrent Neural Networks page