Next: Acknowledgments Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.3.2. FAST MULTIPLICATION BY

A.4. PSEUDO CODE OF THE ALGORITHM

Below the algorithm in pseudo code (using fast multiplication as in appendix A.3.2). Comments are marked by **''. Note: the pseudo code was omitted from the version for Neural Computation. We recommend not to blindly reimplement the algorithm from the pseudo code, but to make a serious effort to understand it, by consulting the body of the paper as well. We believe that this will greatly facilitate proper, problem-specific use of the algorithm.

Notation. In what follows, the variable integers and stand for units.
Variables are reserved for output units only.
is the number of layers, where the th layer is the output layer and the st layer is the input layer.
The current pattern consists of input vector and target vector (see section 2).
is the component of the input vector corresponding to input unit .
is the component of the output vector corresponding to output unit .
is the real-valued weight on the connection from unit to unit (see in section 2).
is the net input of unit (see equation (34) and text thereafter).
is the activation function of unit , is the first derivative, is the second derivative of (see appendix A.2).
is the activation of the -th unit (see appendix A.2).
is the error of the -th output unit.
is (see equation (43)).
is (see equation (44)).
is (see equation (45)).
abs denotes the absolute value of real .
kron returns if and otherwise.
sign returns the sign of real .
are variables (used to compute the right hand side of equation (40)).
.
(see the sums over in (40)).
(see the denumerator in the second line of equation (40)).
(see the numerator in the second line of equation (40)).

stands for the number of weights that make a significant contribution to the computation of the current pattern.
is an approximation of 's precision (approximation because is unknown).
marks whether or not provides a significant contribution to the computation of the current pattern.
is (see equation (40)).
is (see equation (46)).
is (see equation (47)).
is (see equation (49)).
is (see equation (48)).
is , the gradient of the additional error term (see equation (50)).
is the current pattern's quadratic error (see section 2 and appendix A.1).
is the learning rate for the quadratic error.
is the learning rate for the additional error term (the following values are used to make updates according to Weigend et al., 1991, see also section 5.6).
is a parameter needed for updating .
is a parameter needed for updating , typical value is 0.5.
is the most recent epoch's average error.
is the current epoch's average error.
is the exponentially weighted average epoch error.
is the parameter for the exponentially weighted error -- a typical value is 0.9 or 0.99, depending on training set size.
is the tolerable error level - it depends on the task to be solved (see section 2 and appendix A.1, equation (13)).
is the number of exemplars observed during training.
is the length of an epoch, measured in number of presented training patterns.
is the current number of epochs so far, where represents integer division.
are variables required for normalizing ' gradient to the length of 's gradient.
is TRUE if was alive for at least one pattern presented during the previous epoch, and FALSE otherwise.
is TRUE if is TRUE and if was always TRUE during all epochs since was set alive for the last time (otherwise is FALSE).
denotes the assignment operator.

• For simplicity, the description of the algorithm neglects bias weights and true units''.

• Targets should be scaled to bound first order derivatives of output units - see text after equation (36) in A.2 (e.g., for sigmoids active in scale targets to range ).

• Removing more than one weight ( FALSE) at a time may cause the error to increase. Removing only one weight at a time leads to smoother performance improvement.

• To prevent accidental weight removal in case of small training sets, we recommend not to use too many near-zero inputs -- weights from such inputs may be evaluated despite being significant.

• Likewise, the random weight initialization in the beginning of the learning phase may cause accidental weight removal due to small, random, initial derivatives. This can be prevented by keeping all weights alive for a certain initial time interval.

• Initially, 's value does not yet have a sensible interpretation. One may start with a large and decrease it as the error decreases.

• For each pattern, there is a minimal (stored in ). represents weight precision required for significant weights.

Speeding up the algorithm. It makes sense to separate the algorithm into two phases. Phase 1 is conventional backprop, phase 2 is FMS. The backprop phase consists of the forward pass, the first backward pass, and the weight update based on (marked in the algorithm). Start with phase 1. Switch to phase 2 if . Switch again to phase 1 if (the values and can be changed).

Two-phase learning does not sensitively depend on (but avoid values that are always too small). Two-phase learning is justified because weights with large (and small ) hardly influence 's gradient -- it makes sense to let FMS focus on low-precision weights, and let backprop focus on the others.

ALGORITHM.

Initialization. Set or (the exponent is the difference between the numbers of significant digits required for maximal and minimal precision).
Set to an arbitrary small value.
Set and to 0.
Initialize for all .
Initialize (typically, ), and provide a criterion for stopping the learning procedure.
Set , for instance, or or .
Set to some desired error value after learning.
Set to 0.
Set .
Set to 0.
Set TRUE for all .
Set FALSE for all .
Set to a large value.
Set FALSE for non-existing .

While training not stopped do
begin(while)

select pattern pair .
** The following variables can be set after they were used for the last time in the previous loop. **
set all components of to 0
set all components of to FALSE
set to 0

** Forward pass. **

for all input units do
begin

end

for to do
begin
for all units in layer do
begin
for all units in layer do
begin
if () do
begin

end
end

end
end

** Compute the error. **

for all output units do
begin

end

** Compute the error gradient **

** 1. backward pass. **

for all output units do
begin

for to do
begin
for all units in layer do
begin
(IF THEN: for all units in layer ELSE: ) do
begin
if () do
begin

set abs 1E-5 ** to avoid division overflow **

end
end

end
end
end

** End of conventional backprop (phase 1). **

** compute **

** we recommend to introduce additional local variables for inner loops, such as and **

for all output units do
begin
for all , such that () do
begin

end

end

** some weights are insignificant to compute the current pattern **

for all , such that () do
begin

if ( ) do
begin

end
end

for all , such that () do
begin
if ( ) do
begin
TRUE
for all output units do
begin

end
end
else do
begin

TRUE
end
end

** update variables after having marked the current pattern's insignificant weights **

for all output units do
begin

for all , such that ( AND NOT ) do
begin

end

end

for all output units do
begin
for all , such that ( AND NOT ) do
begin

for all output units do
begin

end

end
end

** Forward pass. **

for all output units do
begin
for all input units do
begin

end
for to do
begin
(IF THEN: for all units in layer ELSE: ) do
begin
for all units in layer do
begin
if ( AND NOT ) do
begin

end
end

end
end
end

** 2. backward pass. **

for all output units do
begin

for to do
begin
for all units in layer do
begin

(IF THEN: for all units in layer ELSE: ) do
begin
if ( AND NOT ) do
begin

end
end

end
end
end

for all , such that ( AND NOT ) do
begin

end

** **

** End of 's gradient computation (phase 2). **

** Weight update. **

for all , such that ( AND NOT ) do
begin

end

** Update learning parameters. **

if ( mod ) do ** mod'' is the modulo function **
begin

** update according to Weigend et al.(1991). **
if ( OR ) do
begin

end
else do
begin
if () do
begin

if () do
begin

end
end
else do
begin

end
end

** update weights that are alive, **
** a weight is alive if it was alive (marked by ) **
** for at least one pattern presented during the previous epoch. **
for all , such that () do
begin
if ( FALSE) do
begin
FALSE
end
FALSE
end
if ( mod OR 2.0 ) do
** weights are re-animated if the average error is too large; weights are also **
** re-animated every 100-th epoch, to enable faster reduction of quadratic error **
** (due to weight changes, some previously dead weight may turn out to deserve **
** to live again); one may use values other than 100 and 2.0 **
begin
for all such that exists do
begin
TRUE
end
end
end

decide whether to stop learning or not

end(while)

Subsections

Next: Acknowledgments Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.3.2. FAST MULTIPLICATION BY
Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page