How to adjust ? Given recent trends in neural computing (see, e.g., MacKay, 1992a, 1992b), it may seem like a step backwards that is adapted using an ad-hoc heuristic from Weigend et al., 1991. However, for determining in MacKay's style, one would have to compute the Hessian of the cost function. Since our term includes first order derivatives, adjusting would require the computation of third order derivatives. This is impracticable. Also, to optimize the regularizing parameter (see MacKay, 1992b), we need to compute the function , but it is not obvious how: the ``quick and dirty version'' (MacKay, 1992a) cannot deal with the unknown constant in .
Future work will investigate how to adjust without too much computational effort. In fact, as will be seen in appendix A.1, the choices of and are correlated -- the optimal choice of may indeed correspond to the optimal choice of .
Generalized boxes? The boxes found by the current version of FMS are axis-aligned. This may cause an under-estimate of flat minimum volume. Although our experiments indicate that box search works very well, it will be interesting to compare alternative approximations of flat minimum volumes.
Multiple initializations? First, consider this FMS ``alternative'': run conventional backprop starting with several random initial guesses, and pick the flattest minimum with largest volume. This does not work: conventional backprop changes the weights according to steepest descent -- it runs away from flat ranges in weight space! Using an ``FMS committee'' (multiple runs with different initializations), however, would lead to a better approximation of the posterior. This is left for future work.
Notes on generalization error. If the prior distribution of targets (see appendix A.1) is uniform (or if the distribution of prior distributions is uniform), no algorithm can obtain a lower expected generalization error than training error reducing algorithms (see, e.g., Wolpert, 1994b). Typical target distributions in the real world are not uniform, however - the real world appears to favor problem solutions with low algorithmic complexity. See, e.g., Schmidhuber (1994a). MacKay (1992a) suggests to search for alternative priors if the generalization error indicates a ``poor regulariser''. He also points out that with a ``good'' approximation of the non-uniform prior, more probable posterior hypothesis do not necessarily have a lower generalization error. For instance, there may be noise on the test set, or two hypotheses representing the same function may have different posterior values, and the expected generalization error ought to be computed over the whole posterior and not for a single solution. Schmidhuber (1994b) proposes a general, ``self-improving'' system whose entire life is viewed as a single training sequence and which continually attempts to incrementally modify its priors based on experience with previous problems -- see also Schmidhuber (1996). It remains to be seen, however, whether this will lead to practicable algorithms.
Ongoing work on low-complexity coding. FMS can also be useful for unsupervised learning. In recent work, we postulate that a ``generally useful'' code of given input data fulfills three MDL-inspired criteria: (1) It conveys information about the input data. (2) It can be computed from the data by a low-complexity mapping. (3) The data can be computed from the code by a low-complexity mapping. To obtain such codes, we simply train an auto-associator with FMS (after training, codes are represented across the hidden units). In initial experiments, depending on data and architecture, this always led to well-known kinds of codes considered useful in previous work by numerous researchers: we sometimes obtained factorial codes, sometimes local codes, and sometimes sparse codes. In most cases, the codes were of the low-redundancy, binary kind. Initial experiments with a speech data benchmark problem (vowel recognition) already showed the true usefulness of codes obtained by FMS: feeding the codes into standard, supervised, overfitting backprop classifiers, we obtained much better generalization performance than competing approaches.