It was shown that basic concepts from the theory of algorithmic complexity are of interest for machine learning purposes in general and for neural net generalization studies in particular. At least with certain neural net toy problems where it is computationally feasible, search which favors solutions computable by short and fast programs may lead to excellent generalization performance unmatchable by more traditional algorithms. Although the focus of the experiments was on perceptron-like neural nets, the presented methods are general enough to be applied to a wide variety of problems. Much work on ``incremental'' learning in real world applications remains to be done, however.
Bias. The bias towards algorithmic simplicity is a very general one. It is weaker than most kinds of problem specific inductive bias, e.g., [Utgoff, 1986,Haussler, 1988]. If a solution is indeed simple, the bias is justified (it does not require us to know ``the way in which the solution is simple''). If the solution is not simple, the bias towards algorithmic simplicity won't do much damage: even in case of algorithmically complex solutions we cannot lose much if we focus on simple candidates first, before looking at more complex candidates. This is because in general the complex candidates greatly outnumber the simple ones. The few simple ones don't significantly affect total search time of an optimal search algorithm.
When will a general bias towards algorithmic simplicity not only cause no harm but also be useful for problem solving? How many solutions are indeed simple? The next paragraph appears to support the answer ``hardly any.'' But the final part of this section argues that the expression ``hardly any'' actually refers to a worst case that is atypical for real world problems.
In general, generalization is impossible. To be more specific, let the task be to learn some relation between finite bitstrings and finite bitstrings. A training set is chosen randomly. In almost all cases, the shortest algorithm computing a (non-overlapping) test set essentially will have the size of the whole test set (recall from section 2 that most computable objects are incompressible). The shortest algorithm computing the test set, given the training set, won't be any shorter. In other words, the ``mutual algorithmic information'' (e.g., [Chaitin, 1987]) between test set and training set will be zero in almost all cases (ignoring an additive constant independent of the problem). Therefore, in the general case, (1) knowledge of the training set does not provide any clues about the test set, (2) there is no hope for generalization, and (3) obviously there is no reason why a ``simple'' (or any other kind of) solution should be preferred a priori over complex ones (related observations are discussed at length, e.g., in [Dietterich, 1989,Schaffer, 1993,Wolpert, 1993]). This may be viewed as the reason why certain worst-case results of PAC-learning theory (initiated by Valiant, 1984) appear discouraging. Similarly for problem solving in general: a ``problem'' is usually defined by a search space of solution candidates, and a computable criterion for the solution. Most solutions to problems from the set of all possible well-defined problems are algorithmically complex (random, incompressible). Most such problems cannot be efficiently solved (``efficient'' means faster than by exhaustive search), neither by Levin's universal search algorithm, nor by a hypothetical ``optimal'' incremental learning scheme, nor by any other method.
Simple real world. Apparently, however, many typical problems we are confronted with in the ``real world'' are simple! Simple in the sense that their solutions do not require as much information to be specified as most solution candidates. Problems that humans consider to be typical are atypical when compared to the general set of all well-defined problems (see also [Li and Vitányi, 1989]). Indeed, for all ``interesting'' problems, the bias towards algorithmic simplicity seems justified!
This may be a miracle. Or perhaps a consequence of the possibility that our universe is run by a short algorithm (every electron behaves the same way). Or (at least in some cases) just a consequence of the fact that we select only problems we can solve (we would not exist if we could not survive by doing so - but this is an anthropocentric argument). Anyway, our learning machines should try to make use of the enormous amount of algorithmic redundancy in our ``friendly'' universe. The most general way of doing so appears to be to use the tools provided by the theory of algorithmic probability and Kolmogorov complexity.