Below the algorithm in pseudo code (using fast multiplication as in appendix A.3.2). Comments are marked by ``**''. Note: the pseudo code was omitted from the version for Neural Computation. We recommend not to blindly reimplement the algorithm from the pseudo code, but to make a serious effort to understand it, by consulting the body of the paper as well. We believe that this will greatly facilitate proper, problem-specific use of the algorithm.
Notation.
In what follows, the variable integers
and
stand for units.
Variables
are reserved for output units only.
is the number of layers, where the
th layer is the output layer and the
st layer is the input layer.
The current pattern consists of input vector
and target vector
(see section 2).
is the component of the input vector corresponding to input unit
.
is the component of the output vector corresponding to output unit
.
is the real-valued weight on the connection from
unit
to unit
(see
in
section 2).
is the net input of unit
(see equation (34) and text thereafter).
is the activation function of unit
,
is the first
derivative,
is the second derivative of
(see appendix A.2).
is the activation of the
-th unit (see appendix A.2).
is the error of the
-th output unit.
is
(see equation (43)).
is
(see equation (44)).
is
(see equation (45)).
is
,
the gradient of the quadratic error.
abs
denotes the absolute value of real
.
kron
returns
if
and
otherwise.
sign
returns the sign of real
.
are variables (used to compute the right hand side of equation (40)).
.
(see
the sums over
in (40)).
(see the
denumerator in the second line
of equation (40)).
(see the numerator in the second line
of equation (40)).
stands for the number of weights that make a
significant contribution to the computation of the current pattern.
is an approximation of
's precision
(approximation because
is unknown).
marks whether or not
provides
a significant contribution to the computation of the current pattern.
is
(see equation (40)).
is
(see equation (46)).
is
(see equation (47)).
is
(see equation (49)).
is
(see equation (48)).
is
,
the gradient
of the additional error term
(see equation (50)).
is the current pattern's quadratic error (see section 2 and appendix A.1).
is the learning rate for the quadratic error.
is the learning rate for the additional
error term (the following values are used
to make
updates according to Weigend et al., 1991,
see also section 5.6).
is a parameter needed for updating
.
is a parameter needed for updating
, typical
value is 0.5.
is the most recent epoch's average error.
is the current epoch's average error.
is the exponentially weighted average epoch error.
is the parameter for the exponentially weighted error --
a typical value is 0.9 or 0.99, depending on training set size.
is the tolerable error level - it depends on the task
to be solved (see section 2 and appendix A.1, equation (13)).
is the number of exemplars observed during training.
is the length of an epoch, measured in number of
presented training patterns.
is the current number of
epochs so far, where
represents integer division.
are variables required for normalizing
' gradient to the length of
's gradient.
is TRUE if
was alive for at least one pattern
presented during
the previous epoch, and FALSE otherwise.
is TRUE if
is TRUE
and if
was
always TRUE during all epochs since
was set alive for the last time
(otherwise
is FALSE).
denotes the assignment operator.
Additional comments.
Speeding up the algorithm.
It makes sense to
separate the algorithm into two phases.
Phase 1 is conventional backprop,
phase 2 is FMS. The backprop phase consists of the
forward pass, the first backward pass, and the weight update
based on
(marked in the algorithm).
Start with phase 1. Switch to phase 2 if
. Switch again to phase 1 if
(the values
and
can be
changed).
Two-phase learning does not sensitively depend on
(but avoid
values that are always too
small).
Two-phase learning is justified because weights
with large
(and small
) hardly influence
's gradient -- it makes sense to let FMS focus
on low-precision weights, and let backprop
focus on the others.
ALGORITHM.
Initialization.
Set
or
(the exponent is the difference
between the numbers of
significant digits required for maximal and minimal precision).
Set
to an arbitrary small value.
Set
and
to 0.
Initialize
for all
.
Initialize
(typically,
),
and provide a criterion for stopping the learning procedure.
Set
, for instance,
or
or
.
Set
to some desired error value after learning.
Set
to 0.
Set
.
Set
to 0.
Set
TRUE for all
.
Set
FALSE for all
.
Set
to a large value.
Set
FALSE for non-existing
.
While training not stopped do
begin(while)
select pattern pair
.
** The following variables can be set after they were
used for the last time in the previous loop. **
set all components of
to 0
set all components of
to FALSE
set
to 0
** Forward pass. **
for all input units
do
begin
end
for
to
do
begin
for all units
in layer
do
begin
for all units
in layer
do
begin
if (
) do
begin
end
end
end
end
** Compute the error. **
for all output units
do
begin
end
** Compute the error gradient **
** 1. backward pass. **
for all output units
do
begin
for
to
do
begin
for all units
in layer
do
begin
(IF
THEN: for all units
in layer
ELSE:
) do
begin
if (
) do
begin
set abs
1E-5 ** to avoid division overflow **
end
end
end
end
end
** End of conventional backprop (phase 1). **
** compute
**
** we recommend to
introduce additional local variables for inner loops, such as
and
**
for all output units
do
begin
for all
, such that (
) do
begin
end
end
** some weights are insignificant to compute the current pattern **
for all
, such that (
) do
begin
if (
) do
begin
end
end
for all
, such that (
) do
begin
if (
) do
begin
TRUE
for all output units
do
begin
end
end
else do
begin
TRUE
end
end
** update variables after having marked the current
pattern's insignificant weights **
for all output units
do
begin
for all
, such that (
AND NOT
) do
begin
end
end
for all output units
do
begin
for all
, such that (
AND NOT
) do
begin
for all output units
do
begin
end
end
end
** Forward pass. **
for all output units
do
begin
for all input units
do
begin
end
for
to
do
begin
(IF
THEN: for all units
in layer
ELSE:
) do
begin
for all units
in layer
do
begin
if (
AND NOT
) do
begin
end
end
end
end
end
** 2. backward pass. **
for all output units
do
begin
for
to
do
begin
for all units
in layer
do
begin
(IF
THEN: for all units
in layer
ELSE:
) do
begin
if (
AND NOT
) do
begin
end
end
end
end
end
** Normalize
's gradient to the
length of
's gradient. **
for all
, such that (
AND NOT
) do
begin
end
**
**
** End of
's gradient computation (phase 2). **
** Weight update. **
for all
, such that (
AND NOT
) do
begin
end
** Update learning parameters. **
if (
mod
) do ** ``mod'' is the modulo function **
begin
**
update according to Weigend et al.(1991). **
if (
OR
) do
begin
end
else do
begin
if (
) do
begin
if (
) do
begin
end
end
else do
begin
end
end
** update weights that are alive, **
** a weight is alive if it was alive (marked by
) **
** for at least one pattern presented during the previous epoch. **
for all
, such that (
) do
begin
if (
FALSE) do
begin
FALSE
end
FALSE
end
if (
mod
OR
2.0
) do
** weights are re-animated if the average error is too large; weights are also **
** re-animated every 100-th epoch, to enable faster reduction of quadratic error **
** (due to weight changes, some previously dead weight may turn out to deserve **
** to live again); one may use values other than 100 and 2.0 **
begin
for all
such that
exists do
begin
TRUE
end
end
end
decide whether to stop learning or not
end(while)