Most previous algorithms for finding low complexity networks with high generalization capability are based on different prior assumptions. They can be broadly classified into two categories (see Schmidhuber (1994a), however, for an exception):
(1) Assumptions about the prior weight distribution. Hinton and van Camp (1993) and Williams (1994) assume that pushing the posterior weight distribution close to the weight prior leads to ``good'' generalization (see more details below). Weight decay (e.g., Hanson & Pratt, 1989; Krogh & Hertz, 1992) can be derived, e.g., from Gaussian or Laplace weight priors. Nowlan and Hinton (1992) assume that a distribution of networks with many similar weights generated by Gaussian mixtures is ``better'' a priori. MacKay's weight priors (1992b) are implicit in additional penalty terms, which embody the assumptions made. The problem with the approaches above is this: there may not be a ``good'' weight prior for all possible architectures and training sets. With FMS, however, we don't have to select a ``good'' weight prior -- instead we choose a prior over input/output functions. This automatically takes the net architecture and the training set into account.
(2) Prior assumptions about how theoretical results on early stopping and network complexity carry over to practical applications. Such assumptions are implicit in methods based on validation sets (Mosteller & Tukey, 1968; Stone, 1974; Eubank, 1988; Hastie & Tibshirani, 1993), e.g., ``generalized cross validation'' (Craven & Wahba, 1979; Golub et al., 1979), ``final prediction error'' (Akaike, 1970), ``generalized prediction error'' (Moody & Utans, 1994; Moody, 1992). See also Holden (1994), Wang et al. (1994), Amari and Murata (1993), and Vapnik's ``structural risk minimization'' (Guyon et al., 1992; Vapnik, 1992).
Constructive algorithms / pruning algorithms. Other architecture selection methods are less flexible in the sense that they can be used only either before or after weight adjustments. Examples are ``sequential network construction'' (Fahlman & Lebiere, 1990; Ash, 1989; Moody, 1989), input pruning (Moody, 1992; Refenes et al., 1994), unit pruning (White, 1989; Mozer & Smolensky, 1989; Levin et al., 1994), weight pruning, e.g. ``optimal brain damage'' (LeCun et al., 1990), ``optimal brain surgeon'' (Hassibi & Stork, 1993).
Hinton and van Camp (1993). They minimize the sum of two terms: the first is conventional error plus variance, the other is the distance between posterior and weight prior . They have to choose a ``good'' weight prior. But, as mentioned above, perhaps there is no ``good'' weight prior for all possible architectures and training sets. With FMS, however, we don't depend on a ``good'' weight prior -- instead we have a prior over input/output functions, thus taking into account net architecture and training set. Furthermore, Hinton and van Camp have to compute variances of weights and unit activations, which (in general) cannot be done using linear approximation. Intuitively speaking, their weight variances are related to our . Our approach, however, does justify linear approximation, as seen in appendix A.2.
Wolpert (1994a). His (purely theoretical) analysis suggests an interesting different additional error term (taking into account local flatness in all directions): the logarithm of the Jacobi determinant of the functional from weight space to the space of possible nets. This term is small if the net output (based on the current weight vector) is locally flat in weight space (if many neighboring weights lead to the same net function in the space of possible net functions). It is not clear, however, how to derive a practical algorithm (e.g., a pruning algorithm) from this.
Murray and Edwards (1993). They obtain additional error terms consisting of weight squares and second order derivatives. Unlike our approach, theirs explicitly prefers weights near zero. In addition, their approach appears to require much more computation time (due to second order derivatives in the error term).