In all our experiments we associate input data with itself, using an
FMS-trained 3-layer autoassociator (AA).
Unless stated otherwise
we use 700,000 training exemplars,
sigmoid hidden units (HUs)
with activation function (AF)
,
sigmoid output units with AF
,
noninput units
with an additional bias input,
normal weights initialized in
,
bias hidden weights with -1.0,
with 0.5.
The HU AFs do make sparseness better recognizable,
but the output AFs are fairly arbitrary --
linear AFs or those of the HUs will do as well.
Targets are scaled to
, except for Task 2.2.
Target scaling (1) prevents tiny first order derivatives of output units
(which may cause floating point overflows),
and (2) allows for proving that
the FMS algorithm makes the Hessian entries of output units
decrease where the weight precisions
or
increase (Hochreiter and Schmidhuber 1997a).
Parameters and other details.
Comparison.
In sections 4.3 and 4.4 we compare LOCOCODE to
simple variants of
``independent component analysis'' (ICA, e.g.,
Jutten and Herault 1991,
Cardoso and Souloumiac 1993,
Molgedey and Schuster 1994,
Comon 1994,
Bell and Sejnowski 1995, Amari et al. 1996,
Nadal and Parga 1997)
and ``principal component analysis'' (PCA, e.g., Oja 1989).
ICA is realized by Cardoso's (1993) JADE (Joint Approximate
Diagonalization of Eigen-matrices) algorithm
(we used the Matlab JADE version obtained via FTP from sig.enst.fr).
JADE is based on whitening and subsequent
joint diagonalization of 4th-order cumulant matrices.
For PCA and ICA, 1,000 (3,000) training exemplars are used in
case of
(
) input fields.
Information content.
To measure the information conveyed by the various codes obtained
in sections 4.3 and 4.4 we train a
standard backprop net on the training set used for code generation.
Its inputs are the code components; its task is
to reconstruct the original input
(for all tasks except for ``noisy bars'' the original
input is scaled such that all input
components are in
).
The net has as many biased sigmoid hidden
units
with activation function (AF)
as there are biased
sigmoid output units with AF
.
We train it for 5,000 epochs without caring for overfitting.
The training set consists
of 500 fixed exemplars
in the case of
input fields (bars) and
of 5000 in the case of
input fields (real world images). The test set consists of 500
off-training set exemplars (in the case of real world images we use
a separate test image). The average MSE on the test set
is used to determine the reconstruction error.
Coding efficiency -- discrete codes.
Coding efficiency
is measured by the average number of bits needed to code a test
set input pixel. The code components are scaled to the interval
partitioned into 100 discrete intervals -- this
results in 100 possible discrete values.
Assuming independence of the code components
we estimate the probability of
each discrete code value by Monte Carlo sampling on the training set.
To obtain the bits per pixels (Shannon's optimal value)
on the test set we divide the sum of the
negative logarithms of all discrete code component probabilities
(averaged over the test set)
by the number of input components.