next up previous
Next: On-Line Versus Off-Line Learning Up: LEARNING TO CONTROL FAST-WEIGHT Previous: The Task

The Architecture and the Algorithm

The basic idea is to use a slowly learning feed-forward network $S$ (with a set of randomly initialized weights $W_S$) whose input at time $t$ is the vector $x(t)$ and whose output is transformed into immediate weight changes for a second `fast-weight' network $F$. The input to $F$ at time $t$ is also $x(t)$, its $m$-dimensional output is $y(t)$, and its set of weight variables is $W_F$. $F$ serves as a short term memory: At different time steps, the same input event may be processed in different ways depending on the time-varying state of $W_F$.

The standard method for processing temporal sequences is to employ a recurrent net with feedback connections. The feedback connections allow for a short-term memory of information earlier in a sequence. The present work suggests a novel approach to building a short-term memory by employing fast weights that can be set and reset by the 'memory controller' $S$. Fast weights can hold on to information over time because they remain essentially invariant unless they are explicitly modified.

One potential advantage of the method over the more conventional recurrent net algorithms is that it does not necessarily require full-fledged units - experiencing some sort of feedback - for storing temporal information. A single weight may be sufficient. Because there are many more weights than units in most networks, this property represents a potential for storage efficiency. For related reasons, the novel representation of past inputs is well-suited for solving certain problems involving temporary variable binding in a natural manner: $F$'s current input may be viewed as a representation of the addresses of a set of variables; $F$'s current output may be viewed as the representation of the current contents of this set of variables. In contrast with recurrent nets, temporary bindings can be established very naturally by temporary connectivity patterns instead of temporary activation patterns (see section 3.2 for an illustrative experiment).

For initialization reasons we introduce an additional time step 0 at the beginning of an episode. At time step $0$ each weight variable $w_{ab} \in W_F$ of a directed connection from unit $a$ to unit $b$ is set to $\Box w_{ab}(0)$ (a function of $S$'s outputs as described below). At time step $t >0 $, the $w_{ab}(t-1)$ are used to compute the output of $F$ according to the usual activation spreading rules for back-propagation networks (e.g. [Werbos, 1974]). After this, each weight variable $w_{ab} \in W_F$ is altered according to

\begin{displaymath}
w_{ab}(t) = \sigma(w_{ab}(t-1), \Box w_{ab}(t)),
\end{displaymath} (1)

where $\sigma$ (e.g. a sum-and-squash function) is differentiable with respect to all its parameters and where the activations of $S$'s output units (again computed according to the usual activation spreading rules for back-propagation networks) serve to compute $\Box w_{ab}(t)$ by a mechanism specified below. $\Box w_{ab}(t)$ is $S$'s contribution to the modification of $w_{ab}$ at time step $t$.

Equation (1) is essentially identical to Möller and Thrun's equation (1) in [Möller and Thrun, 1990]. Unlike [Möller and Thrun, 1990], however, the current paper derives an exact gradient descent algorithm for time-varying inputs and outputs for this kind of architecture.

For all weights $w_{ij} \in W_S$ (from unit $i$ to unit $j$) we are interested in the increment

\begin{displaymath}
\triangle w_{ij} = - \eta
\frac{\partial \bar{E}}{\partial...
... w_{ab}(t-1)}
\frac{\partial w_{ab}(t-1)}{\partial w_{ij}} .
\end{displaymath} (2)

Here $\eta$ is a constant learning rate. At each time step $t >0 $, the factor

\begin{displaymath}
\delta_{ab}(t)= \frac{\partial E(t)}{\partial w_{ab}(t-1)}
\end{displaymath}

can be computed by conventional back-propagation (e.g. [Werbos, 1974]). For $t >0 $ we obtain the recursion

\begin{displaymath}
\frac{\partial w_{ab}(t)}{\partial w_{ij}} =
\frac{\partial...
...w_{ab}(t)}
\frac{\partial \Box w_{ab}(t)}{\partial w_{ij}} .
\end{displaymath}

We can employ a method similar to the one described in [Robinson and Fallside, 1987] and [Williams and Zipser, 1989]: For each $w_{ab} \in W_F$ and each $w_{ij} \in W_S$ we introduce a variable $p_{ij}^{ab}$ (initialized to zero at the beginning of an episode) which can be updated at each time step $t >0 $:
\begin{displaymath}
p_{ij}^{ab}(t)=
\frac{\partial \sigma(w_{ab}(t-1), \Box w_{...
..._{ab}(t)}
\frac{\partial \Box w_{ab}(t)}{\partial w_{ij}} .
\end{displaymath} (3)

$\frac{\partial \Box w_{ab}(t)}{\partial w_{ij}} $ depends on the interface between $S$ and $F$. With a given interface (two possibilities are given below) an appropriate back-propagation procedure for each $w_{ab} \in W_F$ gives us $\frac{\partial \Box w_{ab}(t)}{\partial w_{ij}} $ for all $w_{ij} \in W_S$. After having updated the $p_{ij}^{ab}$ variables, (2) can be computed using the formula

\begin{displaymath}
\frac{\partial E(t)}{\partial w_{ij}} =
\sum_{w_{ab} \in W_F} \delta_{ab}(t)
p_{ij}^{ab}(t-1).
\end{displaymath}

A simple interface between $S$ and $F$ would provide one output unit $s_{ab} \in S$ for each weight variable $w_{ab} \in W_F$, where

\begin{displaymath}
\Box w_{ab}(t):=s_{ab}(t),
\end{displaymath} (4)

$s_{ab}(t)$ being the output unit's activation at time $t \geq 0$.

A disadvantage of (4) is that the number of output units in $S$ grows in proportion to the number of weights in $F$. An alternative is the following: Provide an output unit in $S$ for each unit in $F$ from which at least one fast weight originates. Call the set of these output units FROM. Provide an output unit in $S$ for each unit in $F$ to which at least one fast weight leads. Call the set of these output units TO. For each weight variable $w_{ab} \in W_F$ we now have a unit $s_a \in $ FROM and a unit $s_b \in$ TO. At time $t$, define $\Box w_{ab}(t):=g(s_a(t),s_b(t))$, where $g$ is differentiable with respect to all its parameters. As a representative example we will focus on the special case of $g$ being the multiplication operator:

\begin{displaymath}
\Box w_{ab}(t):=s_a(t)s_b(t).
\end{displaymath} (5)

Here the fast weights in $F$ are manipulated by the outputs of $S$ in a Hebb-like manner, assuming that $\sigma$ is just a sum-and-squash function as employed in the experiments described below.

One way to interpret the FROM/TO architecture is to view $S$ as a device for creating temorary associations by giving two parameters to the short term memory: The first parameter is an activation pattern over FROM representing a key to a temporary association pair, the second parameter is an activation pattern over TO representing the corresponding entry. Note that both key and entry may involve hidden units.

(4) and (5) differ in the way that error signals are obtained at $S$'s output units: If (4) is employed, then we use conventional back-propagation to compute $\frac{\partial s_{ab}(t)}{\partial w_{ij}} $ in (3). If (5) is employed, note that

\begin{displaymath}
\frac{\partial \Box w_{ab}(t)}{\partial w_{ij}} =
s_b(t) \...
...l w_{ij}} +
s_a(t) \frac{\partial s_b(t)}{\partial w_{ij}} .
\end{displaymath} (6)

Conventional back-propagation can be used to compute $\frac{\partial s_a(t)}{\partial w_{ij}} $ for each output unit $a$ and for all $w_{ij}$. The results can be kept in $\mid W_S \mid * \mid FROM \cup TO \mid$ variables. This makes it easy to solve (6) in a second pass.

The algorithm is local in time, its update-complexity per time step is $O(\mid W_F \mid \mid W_S \mid)$. However, it is not local in space (see [Schmidhuber, 1990b] for a definition of locality in space and time).



Subsections
next up previous
Next: On-Line Versus Off-Line Learning Up: LEARNING TO CONTROL FAST-WEIGHT Previous: The Task
Juergen Schmidhuber 2003-02-13

Back to Recurrent Neural Networks page