Next: Dilemma: Avoiding gradient decay
Up: Exponential error decay
Previous: Intuitive explanation of equation
The following, slightly extended vanishing error analysis
also takes
, the number of units, into account.
For
, formula (2) can be rewritten as
where the weight matrix
is defined by
,
's outgoing weight vector
is defined by
,
's incoming weight vector
is defined by
, and
is the diagonal matrix of first order
derivatives defined as:
if
, and
otherwise.
Here
is the transposition operator,
is the element in the
-th column and
-th row of
matrix
, and
is the
-th component of vector
.
Using a matrix norm
compatible with vector norm
,
we define
For
we get
Since
we obtain the following inequality:
This inequality results from
and
where
is the unit vector whose components are 0 except
for the
-th component, which is 1.
Note that this is a weak, extreme case upper bound -- it will
be reached only if
all
take on maximal values,
and if the contributions of all paths across which error flows back from
unit
to unit
have the same sign.
Large
, however, typically result in
small values of
, as confirmed by
experiments (see, e.g., [11]).
For example, with
norms
and
we have
for the logistic sigmoid.
We observe that
if
then
will result in exponential decay; by setting
,
we obtain
We refer to Hochreiter's thesis [11] for more details.
Next: Dilemma: Avoiding gradient decay
Up: Exponential error decay
Previous: Intuitive explanation of equation
Juergen Schmidhuber
2003-02-19