In a realistic application, of course, it is implausible to assume that the errors of all are minimal at all times. After having modified the functions computing the internal representations, the must be trained for some time to assure that they can adapt to the new situation.

Each of the predictors, the representational
modules, and the potentially
available auto-associator can be
implemented as a feed-forward back-propagation
network (e.g. Werbos, 1974).
There are two alternating passes - one for minimizing
prediction errors, the other one for maximizing . Here
is an *off-line version* based on successive
`epochs' (presentations of the whole ensemble of training patterns):

*
PASS 1 (minimizing prediction errors):
*

*Repeat for a `sufficient' number of training epochs:
*

*
1. For all :
*

*
1.1. For all : Compute .
*

*
1.2. For all : Compute .
*

*
2. Change each weight of each according to
*

*
where is a positive constant learning rate.
*

*
PASS 2 (minimizing predictability):
*

*
2. For all :
*

*
2.1. For all : Compute .
*

*
2.2. For all : Compute .
*

*
2.3. If an auto-associator is involved, compute .
*

*
*

The off-line version above is perhaps not as appealing as a more local procedure where computing time is distributed evenly between PASS 2 and PASS 1:

*An on-line version.*
An extreme *on-line* version does not sweep
through the whole training ensemble before changing
weights. Instead it processes the same
single input pattern (randomly chosen according to
the input distribution) in both PASS 1 and PASS 2 and
immediately changes
the weights of all involved networks simultaneously, according to
the contribution of to the respective objective functions.

Simultaneous updating of the representations and the predictors, however, introduces a potential for instabilities. Both the predictors and the representational modules perform gradient descent (or gradient ascent) in changing functions. Given a particular implementation of the basic principle, experiments are needed to find out how much on-line interaction is permittable. With the toy-experiments reported below, on-line learning did not cause major problems.

It should be noted that if
(section 5), then
with a given input pattern
we may compute the gradient of with respect to
both the predictor weights and the weights of the representation
modules *in a single pass*. After this we may simply perform
gradient *descent* in the predictor weights and gradient
*ascent* in the remaining weights (it is just a matter of
flipping signs). This was actually done in the
experiments.

*Local maxima.*
Like all gradient ascent procedures, the method is subject
to the problem of local maxima.
A standard method for dealing with local maxima is to repeat
the above algorithm with different weight initializations
(using a fixed number of training epochs for each repetition)
until a (near-) factorial
code is found. Each repetition corresponds to a local search around
the point in weight space defined by the current weight
initialization.

*Shared hidden units.*
It should be mentioned that some or all of the
representational modules may share hidden units.
The same holds for the predictors.
Predictors sharing hidden units, however, will have to
be updated sequentially: No representational unit
may be used to predict its own activity.

Back to Independent Component Analysis page.