*Weight decay.*A special error term (in addition to the standard term enforcing matches between desired outputs and actual outputs) encourages weights close to zero. The idea is that a zero weight does not cost many bits to be specified, thus being ``simple'' [Hinton and van Camp, 1993]. Pearlmutter and Hinton were probably the first to propose weight decay, while Rumelhart was perhaps the first to suggest its use for reducing overfitting. Variants of weight decay were successfully applied by Weigend et al. (1990), Krogh and Hertz (1992), and others.*Soft weight sharing.*Nowlan and Hinton (1992) introduce an additional objective function encouraging groups of weights with nearly equal values. The weights are taken to be generated by mixtures of Gaussians. The fewer the number of Gaussians and the closer some weight is to the center of some Gaussian, the higher its probability, and the fewer bits are needed to encode it (according to classical information theory [Shannon, 1948]).*Bayesian strategies for backprop nets.*MacKay evaluates hyper-parameters (such as weight-decay rates) with respect to their probabilities of generating the observed data [MacKay, 1992]. Estimates of the probabilities are computed on the basis of Gaussian assumptions.*Optimal Brain Surgeon.*Hassibi and Storck (1993) use second order information to obtain ``simple'' nets by pruning weights whose influence on the error is minimal, and changing other weights to compensate. See [LeCun et al., 1991] for a related approach and (Vapnik, 1992; Guyon et al.,1992) for some theoretical analysis.*``Non-algorithmic'' MDL methods based on Gaussian priors.*To minimize the sum of the description lengths of a neural net and its errors, Hinton and van Camp (1993) assume Gaussian weight priors and Gaussian error distributions, and minimize the asymmetric divergence (or Kullback-Leiber distance) between prior and posterior after training.*Methods for finding ``flat'' minima.*Hochreiter and Schmidhuber (1994, 1996) use efficient second order methods to search for large connected regions of ``acceptable'' error minima. This corresponds to ``simple'' networks with low-precision weight vectors, low description length and low expected overfitting.*``Mutual information networks.''*Deco et al. (1993) measure network complexity with respect to given data by measuring the mutual information between inputs and internal representations extracted by the hidden units.*Methods for removing redundant information from input data.*If the input data can be compressed, the networks processing the data can be made smaller (and simpler), in general. From the standpoint of classical information theory, an optimal compression algorithm is one that builds a factorial code of the input data -- a code with statistically independent components, e.g., [Barlow, 1989] (see Linsker (1988) for related work on the special, linear case). Various ``neural'' methods for compressing input data are known. [Schmidhuber, 1992b] describes a ``neural'' method designed to generate factorial codes. [Atick et al., 1992] focuses on visual inputs. [Schmidhuber, 1992a] focuses on loss-free sequence compression. [Becker, 1991] provides numerous additional references.

There are also numerous heuristic
*constructive methods*, where network size grows in case of
underfitting the training data.
MDL approaches in other areas of machine learning include
[Quinlan and Rivest, 1989,Gao and Li, 1989,Milosavljevic and Jurka, 1993,Pednault, 1989].
Among the implemented methods,
neither the neural net approaches
nor the other ones
are general in the sense of Solomonoff, Kolmogorov, and Levin.
All the previous implementations
use measures for ``simplicity'' that lack the
universality and elegance of those based on Kolmogorov complexity and
algorithmic information theory.
Many previous approaches
are based on ad-hoc (usually Gaussian) priors.

The remainder of this paper is mostly devoted to simulations of the more general method based on the universal prior, self-sizing programs, and the probabilistic search algorithm favoring candidates with low Levin complexity over candidates with high Levin complexity. With certain seemingly trivial but actually non-trivial toy problems it will be demonstrated that the approach can lead to generalization results unmatchable by more traditional neural net algorithms. It should be mentioned, however, that this does not say much about the applicability of the method to real world tasks.

Back to Optimal Universal Search page

Back to Program Evolution page

Back to Algorithmic Information page

Back to Speed Prior page