next up previous
Next: CONCLUDING REMARKS Up: A FIXED SIZE STORAGE Previous: INTRODUCTION

THE ALGORITHM

The notation will be similar to the notation of [Williams and Peng, 1990]. $U$ is the set of indices $k$ such that at the discrete time step $t$ the quantity $x_k(t)$ is the output of a non-input unit $k$ in the network. $I$ is the set of indices $k$ such that $x_k(t)$ is an external input for input unit $k$ at time $t$. $T(t)$ denotes the set of indices $k \in U$ for which there exists a specified target value $d_k(t)$ at time $t$. Each input unit has a directed connection to each non-input unit. Each non-input unit has a directed connection to each non-input unit. The weight of the connection from unit $j$ to unit $i$ is denoted by $w_{ij}$. To distinguish between different `instances' of $w_{ij}$ at different times, we let $w_{ij}(t)$ denote a variable for the weight of the connection from unit $j$ to unit $i$ at time $t$. This is just for notational convenience: $w_{ij}(t) = w_{ij}$ for all $t$ to be considered. One way to visualize the $w_{ij}(t)$ is to consider them as weights of connections to the $t$-th non-input layer of a feed-forward network constructed by `unfolding' the recurrent network in time (e.g. [Williams and Peng, 1990]). A training sequence with $s+1$ time steps starts at time step 0 and ends at time step $s$. The algorithm below is of interest if $s >> n$ (otherwise it is preferable to use BPTT).

For $k \in U$ we define

\begin{displaymath}
net_k(0)=0,
~~\forall t \geq 0: x_k(t) = f_k(net_k(t)),
~~\f...
...l t>0:~~
net_k(t+1) = \sum_{l \in U \cup I} w_{kl}(t+1)x_l(t),
\end{displaymath} (1)

where $f_k$ is a differentiable (usually semi-linear) function. For all $w_{ij}$ and for all $l \in U, t \geq 0$ we define

\begin{displaymath}
q_{ij}^l(t) = \frac{\partial net_l(t) } {\partial w_{ij}}
= ...
...tau =1}^{t} \frac{\partial net_l(t) } {\partial w_{ij}(\tau)}.
\end{displaymath}

Furthermore we define

\begin{displaymath}
e_k(t) = d_k(t) - x_k(t)~~if~k \in T(t)~and~0~otherwise,
\end{displaymath}


\begin{displaymath}
E(t) = \frac{1}{2} \sum_{k \in U} (e_k(t))^2,~~
E^{total}(t',t) = \sum_{\tau = t'+1}^t E(\tau).
\end{displaymath}

The algorithm is a cross between the BPTT and the RTRL-algorithm. The description of the algorithm will be interleaved with its derivation and some comments concerning complexity. The basic idea is: Decompose the calculation of the gradient into blocks, each covering $O(n)$ time steps. For each block perform $n+1$ BPTT-like passes, one pass for calculating error derivatives, and $n$ passes for calculating derivatives of the net-inputs to the $n$ non-input units at the end of each block. Perform $n+1$ RTRL-like calculations for integrating the results of these BPTT-like passes into the results obtained from previous blocks.

The algorithm starts by setting the variable $t_0 \leftarrow 0$. $t_0$ represents the beginning of the current block. Note that for all possible $l,w_{ij}:~
q_{ij}^l(0) = 0,~
\frac{\partial E^{total}(0,0) } {\partial w_{ij}} = 0$. The main loop of the algorithm consists of 5 steps.


STEP1: Set $h \leftarrow O(n)$ (I recommend: $h \leftarrow n$).

The quantity $\frac{\partial E^{total}(0,t_0) }
{\partial w_{ij}}$ for all $w_{ij}$ is already known and $q_{ij}^l(t_0)$ is known for all appropriate $l,i,j$. There is an efficient way of computing the contribution of $E^{total}(0,t_0+h)$ to the change in $w_{ij}, \triangle w_{ij}(t_0+h)$:

\begin{displaymath}
\triangle w_{ij}(t_0+h) = - \alpha \frac{\partial E^{total}(...
... \frac{\partial E^{total}(0,t_0+h) }
{\partial w_{ij}(\tau)},
\end{displaymath}

where $\alpha$ is a learning rate constant.


STEP2: Let the network run from time step $t_0$ to time step $t_0+h$ according to the activation dynamics specified in equation (1). If it turns out that the current training sequence has less than $t_0+h$ time steps (i.e., $h > s-t_0$) then $h \leftarrow s-t_0$. If $h=0$ then EXIT.


STEP3: Perform a combination of a BPTT-like phase with an RTRL-like calculation for computing error derivatives as described next. We write

\begin{displaymath}
\frac{\partial E^{total}(0,t_0+h) } {\partial w_{ij}}
=
\f...
...\frac{\partial E^{total}(t_0,t_0+h) }
{\partial w_{ij}(\tau)}
\end{displaymath}


\begin{displaymath}
=
\frac{\partial E^{total}(0,t_0) } {\partial w_{ij}}
+
\sum...
...\frac{\partial E^{total}(t_0,t_0+h) }
{\partial w_{ij}(\tau)}
\end{displaymath} (2)

The first term of (2) is already known. Consider the third term:


\begin{displaymath}
\sum_{\tau =t_0 + 1}^{t_0+h} \frac{\partial E^{total}(t_0,t_...
...)}
=
- \sum_{\tau = t_0+1}^{t_0+h} \delta_i(\tau) x_j(\tau -1)
\end{displaymath}

where

\begin{displaymath}
\delta_i(\tau) =
- \frac{\partial E^{total}(t_0,t_0+h) }
{\partial net_i(\tau)}.
\end{displaymath}

For a given $t_0$, $\delta_i(\tau)$ can be computed for all $i \in U, t_0 \leq \tau \leq t_0+h$ with a single $h$ step BPTT-pass of the order $O(hn^2)$ operations:

\begin{displaymath}
\delta_i(\tau) = f_i'(net_i(\tau))e_i(\tau)
~~if~~\tau = t_0+h
\end{displaymath}


\begin{displaymath}
\delta_i(\tau) = f_i'(net_i(\tau))(e_i(\tau) +
\sum_{l \in U} w_{li}\delta_l(\tau +1))
~~if~~t_0 \leq \tau < t_0+h
\end{displaymath}

What remains is the computation of the second term of (2) for all $w_{ij}$, which requires $O(n^3)$ operations:

\begin{displaymath}
\sum_{\tau =1}^{t_0} \frac{\partial E^{total}(t_0,t_0+h) }
...
...t_k(t_0)}
\frac{\partial net_k(t_0) } {\partial w_{ij}(\tau)}
\end{displaymath}


\begin{displaymath}
=
\sum_{k \in U} - \delta_k(t_0)
\sum_{\tau =1}^{t_0} \frac{...
... w_{ij}(\tau)}
=
- \sum_{k \in U} \delta_k(t_0) q_{ij}^k(t_0).
\end{displaymath}

STEP4: To compute $q_{ij}^l(t_0+h)$ for all possible $l,i,j$, perform $n$ combinations of a BPTT-like phase with an RTRL-like calculation (one such combination for each $l$) for computing as follows:


\begin{displaymath}
q_{ij}^l(t_0+h)=
\frac{\partial net_l(t_0+h) } {\partial w_...
...{t_0+h} \frac{\partial net_l(t_0+h) } {\partial w_{ij}(\tau)}
\end{displaymath}


\begin{displaymath}
=
\sum_{\tau=1}^{t_0} \sum_{k \in U} \frac{\partial net_l(t_...
...i(\tau)}
\frac{\partial net_i(\tau)}{\partial w_{ij}(\tau)}
\end{displaymath}


\begin{displaymath}
=
\sum_{k \in U} \gamma_{lk}(t_0) h^k_{ij}(t_0)
+
\sum_{\tau = t_0+1}^{t_0+h} \gamma_{li}(\tau) x_j(\tau -1)
\end{displaymath} (3)

where

\begin{displaymath}
\gamma_{lk}(\tau) = \frac{\partial net_l(t_0+h)} {\partial net_k(\tau)}.
\end{displaymath}

For a given $t_0$, a given $l \in U$ and for all $i \in U, t_0 \leq \tau \leq t_0+h$ the quantity $\gamma_{li}(\tau)$ can be computed with a single $h$ step BPTT-like operation of the order $O(hn^2)$:

\begin{displaymath}
if~~\tau = t_0+h:~~if~~l=i~~then~~\gamma_{li}(\tau) = 1~~else~~
\gamma_{li}(\tau) = 0
\end{displaymath}


\begin{displaymath}
if~~t_0 \leq \tau < t_0+h:~~
\gamma_{li}(\tau) = f_i'(net_i(\tau)) \sum_{a \in U} w_{ai} \gamma_{la}(\tau + 1)
\end{displaymath}

For a given $l$, the computation of (3) for all $w_{ij}$ requires $O(n^3+hn^2)$ operations. Therefore STEP3 and STEP4 together require $(n+1)O(hn^2 + n^3)$ operations spread over $h$ time steps. Since $h =O(n)$, $O(n^4)$ computations are spread over $O(n)$ time steps. This implies an average of $O(n^3)$ computations per time step.

The final step of the algorithm's main loop is



STEP5: Set $t_0 \leftarrow t_0+h$ and go to STEP1.

The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time STEP4 is completed.

As formulated above, the algorithm needs $O(n^4)$ computations at its peak, every $n$-th time step. Nothing prevents us, however, from distributing these $O(n^4)$ computations more evenly over $n$ time steps. One way of achieving this is to perform one of the $n$ BPTT-like phases of STEP 4 at each time step of the next `block' of $n$ time steps.


next up previous
Next: CONCLUDING REMARKS Up: A FIXED SIZE STORAGE Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-13

Back to Recurrent Neural Networks page