Next: GENERALIZATION TASKS: SIMULATIONS
Up: APPLICATION: FINDING ``SIMPLE'' NEURAL
Previous: APPLICATION: FINDING ``SIMPLE'' NEURAL
- Weight decay.
A special error term
(in addition to the standard term enforcing matches between
desired outputs and actual outputs) encourages weights
close to zero.
The idea is that a zero weight does not cost
many bits to be specified, thus being ``simple'' [Hinton and van Camp, 1993].
Pearlmutter and Hinton were probably the first to propose weight decay,
while Rumelhart was perhaps the first to suggest
its use for reducing overfitting.
Variants of weight decay were successfully applied
by Weigend et al. (1990), Krogh and Hertz
(1992), and others.
- Soft weight sharing.
Nowlan and Hinton (1992)
introduce
an additional objective function
encouraging groups of weights
with nearly equal values.
The weights are taken to be generated by mixtures of Gaussians.
The fewer the number of Gaussians and
the closer some weight is to the center of some Gaussian, the higher
its probability, and the fewer bits are needed to encode it
(according to classical information theory [Shannon, 1948]).
- Bayesian strategies for backprop nets.
MacKay evaluates
hyper-parameters (such as weight-decay rates)
with respect to their probabilities of generating
the observed data [MacKay, 1992].
Estimates of the probabilities are computed on the basis
of Gaussian assumptions.
- Optimal Brain Surgeon.
Hassibi and Storck (1993)
use second order information
to obtain ``simple'' nets
by pruning weights whose influence
on the error is minimal, and changing other weights to compensate.
See [LeCun et al., 1991] for a related
approach and
(Vapnik, 1992; Guyon et al.,1992)
for some theoretical analysis.
- ``Non-algorithmic'' MDL methods based on Gaussian priors.
To minimize the sum of the description lengths of a neural net and
its errors,
Hinton and van Camp (1993) assume Gaussian weight priors and Gaussian
error distributions, and
minimize the asymmetric divergence (or Kullback-Leiber distance)
between prior and posterior after training.
- Methods for finding ``flat'' minima.
Hochreiter and Schmidhuber (1994, 1996) use efficient
second order methods to search for large connected
regions of ``acceptable'' error minima.
This corresponds to ``simple'' networks with low-precision weight
vectors, low description length and low expected overfitting.
- ``Mutual information networks.''
Deco et al. (1993)
measure network complexity with respect to given data
by measuring the mutual information between
inputs and internal representations extracted by the
hidden units.
- Methods for removing redundant information from input data.
If the input data can be compressed, the networks processing
the data can be made smaller (and simpler),
in general. From the standpoint of classical information theory,
an optimal compression algorithm is one that builds
a factorial code of the input data -- a code
with statistically independent components, e.g., [Barlow, 1989]
(see Linsker (1988) for related work on the special, linear case).
Various ``neural'' methods for compressing input data are known.
[Schmidhuber, 1992b] describes
a ``neural'' method designed
to generate factorial codes.
[Atick et al., 1992] focuses on visual inputs.
[Schmidhuber, 1992a]
focuses on loss-free sequence compression.
[Becker, 1991] provides numerous additional references.
There are also numerous heuristic
constructive methods, where network size grows in case of
underfitting the training data.
MDL approaches in other areas of machine learning include
[Quinlan and Rivest, 1989,Gao and Li, 1989,Milosavljevic and Jurka, 1993,Pednault, 1989].
Among the implemented methods,
neither the neural net approaches
nor the other ones
are general in the sense of Solomonoff, Kolmogorov, and Levin.
All the previous implementations
use measures for ``simplicity'' that lack the
universality and elegance of those based on Kolmogorov complexity and
algorithmic information theory.
Many previous approaches
are based on ad-hoc (usually Gaussian) priors.
The remainder of this paper is mostly devoted to
simulations of the more general method based on
the universal prior, self-sizing programs,
and the probabilistic search algorithm favoring
candidates with low Levin complexity over
candidates with high Levin complexity.
With certain seemingly trivial but
actually non-trivial toy problems
it will be demonstrated that
the approach can lead to generalization results
unmatchable by more traditional neural net algorithms.
It should be mentioned, however, that
this does not say much about the applicability of the method to
real world tasks.
Next: GENERALIZATION TASKS: SIMULATIONS
Up: APPLICATION: FINDING ``SIMPLE'' NEURAL
Previous: APPLICATION: FINDING ``SIMPLE'' NEURAL
Juergen Schmidhuber
2003-02-12
Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page