Memory cells and gate units: basic ideas. LSTM's basic unit is called a memory cell. Within each memory cell, there is a linear unit with a fixed-weight self-connection (compare Mozer's time constants [12]). This enforces constant, non-exploding, non-vanishing error flow within the memory cell. A multiplicative input gate unit learns to protect the constant error flow within the memory cell from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit learns to protect other units from perturbation by currently irrelevant memory contents stored in the memory cell. The gates learn to open and close access to constant error flow. Why is constant error flow important? For instance, with conventional ``backprop through time'' (BPTT, e.g., [20]) or RTRL (e.g., [15]), error signals ``flowing backwards in time'' tend to vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights. For the first theoretical error flow analysis see [7]. See [3] for a more recent, independent, essentially identical analysis.
LSTM details.
In what follows,
denotes the weight on the connection from unit
to unit
.
are net input and activation
of unit
(with activation function
)
at time
.
For all non-input units that aren't memory cells (e.g. output units),
we have
,
where
.
The
-th memory cell is denoted
.
Each memory cell is built around a central linear unit with
a fixed self-connection
(weight 1.0) and identity function as activation function
(see definition of
below).
In addition to
,
also gets input from a special unit
(the ``output gate''),
and from another special unit
(the ``input gate'').
's activation at time
is denoted by
.
's activation at time
is denoted by
.
,
are viewed as ordinary hidden units.
We have
,
where
,
.
The summation indices
may stand for
input units,
gate units,
memory cells, or even
conventional hidden units if there are any
(see also paragraph on ``network topology'' below).
All these different types of
units may convey useful information about
the current state of the net.
For instance, an input gate (output gate)
may use inputs from other memory cells to
decide whether to store (access)
certain information in its memory cell.
There even may be recurrent self-connections like
.
It is up to the user to define the network topology.
At time
,
's output
is computed
in a sigma-pi-like fashion:
, where
Why gate units?
controls the error flow
to memory cell
's input connections
.
controls the error flow
from unit
's output connections.
Error signals trapped within a memory cell cannot change -
but different error signals flowing into the cell (at different times)
via its output gate may get superimposed.
The output gate will have to learn which errors
to trap in its memory cell, by appropriately scaling them.
Likewise, the input gate will have to learn when
to release errors.
Gates open and close access to constant error flow.
Network topology. There is one input, one hidden, and one output layer. The fully self-connected hidden layer contains memory cells and corresponding gate units (for convenience, we refer to both memory cells and gate units as hidden units located in the hidden layer). The hidden layer may also contain ``conventional'' hidden units providing inputs to gate units and memory cells. All units (except for gate units) in all layers have directed connections (serve as inputs) to all units in higher layers.
Memory cell blocks.
memory cells sharing
one input gate and one
output gate form a
``memory cell block of size
''.
They can facilitate information storage.
Learning with excellent computational complexity -- see details in
appendix of [8].
We use a variant of RTRL which
properly takes into account the altered
(sigma-pi-like)
dynamics caused by input and output gates.
However, to ensure constant error backprop,
like with truncated BPTT
[20],
errors arriving at ``memory cell net inputs''
(for cell
,
this
includes
,
,
)
do not get propagated back further in time (although they
do serve to change the incoming weights).
Only within
memory cells, errors are propagated
back through previous internal states
.
This enforces constant error flow within memory cells.
Thus
only the derivatives
need to be stored and updated.
Hence, the algorithm is very efficient, and
LSTM's update complexity per time
step is excellent in comparison to
other approaches such as RTRL:
given
units and a fixed number of output units,
LSTM's update complexity per time step is at most
,
just like BPTT's.