Basic idea of flat minima search. Our algorithm finds a large region in weight space with the property that each weight vector from that region has similar small error. Such regions are called ``flat minima''. To get an intuitive feeling for why ``flat'' minima are interesting, consider this (see also Wolpert [14]): a ``sharp'' minimum corresponds to weights which have to be specified with high precision. A ``flat'' minimum corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description length (MDL), fewer bits of information are required to pick a ``flat'' minimum (corresponding to a ``simple'' or low complexity-network). The MDL principle suggests that low network complexity corresponds to high generalization performance (see e.g. [4,10]). Unlike Hinton and van Camp's method [3] (see appendix A.3), our approach does not depend on explicitly choosing a ``good'' prior.
Our algorithm finds ``flat'' minima by
searching for weights that minimize
both training error and weight precision.
This requires the computation
of the Hessian. However,
by using Pearlmutter's and M
ller's efficient second order
method [,7],
we obtain
the same order of complexity as with conventional backprop.
Automatically, the method effectively reduces numbers of
units, weigths, and input lines, as
well as the sensitivity of outputs with respect
to remaining weights and units. Excellent experimental
generalization results will be reported in section 4.