It was shown that basic concepts from the theory of algorithmic complexity are of interest for machine learning purposes in general and for neural net generalization studies in particular. At least with certain neural net toy problems where it is computationally feasible, search which favors solutions computable by short and fast programs may lead to excellent generalization performance unmatchable by more traditional algorithms. Although the focus of the experiments was on perceptron-like neural nets, the presented methods are general enough to be applied to a wide variety of problems. Much work on ``incremental'' learning in real world applications remains to be done, however.

**Bias.**
The bias towards algorithmic simplicity is a very general one.
It is weaker than most kinds of problem specific
inductive bias, e.g., [Utgoff, 1986,Haussler, 1988].
If a solution is indeed simple,
the bias is justified (it does not require us to know
``the way in which the solution is simple'').
If the solution is *not* simple,
the bias towards algorithmic simplicity
won't do much damage:
even in case of algorithmically complex solutions
we cannot lose much if we focus on simple
candidates first, before looking at more complex candidates.
This is because in general the complex candidates greatly
outnumber the simple ones. The few simple ones
don't significantly affect total search time of an optimal
search algorithm.

When will a general bias towards
algorithmic simplicity not only cause no harm
but also be *useful* for problem solving?
How many solutions are indeed simple?
The next paragraph appears to support the
answer ``hardly any.'' But the
final part of this section
argues that the expression ``hardly any''
actually refers to a worst case that is *atypical*
for real world problems.

**In general, generalization is impossible.**
To be more specific, let the task be to learn some relation
between finite bitstrings and finite bitstrings.
A training set is chosen randomly.
In almost all cases, the shortest algorithm computing a (non-overlapping)
test set essentially will have the size of the whole test set
(recall from section 2 that
most computable objects are incompressible).
The shortest algorithm computing the test set, given the
training set, won't be any shorter. In other words,
the ``mutual algorithmic information''
(e.g., [Chaitin, 1987]) between
test set and training set will be zero in almost all cases
(ignoring an additive constant independent of the problem).
Therefore, in the general case,
(1) knowledge of the training set does
not provide any clues about the test set,
(2) there is no hope for generalization,
and (3) obviously there is no reason
why a ``simple'' (or any other kind of)
solution should be preferred *a priori* over complex ones
(related observations are
discussed at length, e.g., in [Dietterich, 1989,Schaffer, 1993,Wolpert, 1993]).
This may be viewed as the reason why
certain worst-case results of PAC-learning theory
(initiated by Valiant, 1984)
appear discouraging.
Similarly for problem solving in general:
a ``problem'' is usually defined by
a search space of solution candidates, and
a computable criterion for the solution.
Most solutions to problems from the set of
all possible well-defined problems
are algorithmically complex (random, incompressible).
Most such problems cannot be efficiently solved (``efficient''
means faster than by exhaustive search), neither by
Levin's universal search algorithm, nor by a hypothetical
``optimal'' incremental learning scheme, nor by any other method.

**Simple real world.**
Apparently, however,
many *typical* problems
we are confronted with in the ``real world'' are simple!
Simple in the sense that their solutions
do not require as much information
to be specified as most solution candidates.
Problems that humans consider to be *typical*
are *atypical* when compared to the general
set of all well-defined problems
(see also [Li and Vitányi, 1989]).
Indeed, for all ``interesting'' problems, the
bias towards algorithmic simplicity seems justified!

This may be a miracle. Or perhaps a consequence of the possibility that our universe is run by a short algorithm (every electron behaves the same way). Or (at least in some cases) just a consequence of the fact that we select only problems we can solve (we would not exist if we could not survive by doing so - but this is an anthropocentric argument). Anyway, our learning machines should try to make use of the enormous amount of algorithmic redundancy in our ``friendly'' universe. The most general way of doing so appears to be to use the tools provided by the theory of algorithmic probability and Kolmogorov complexity.

Back to Optimal Universal Search page

Back to Program Evolution page

Back to Algorithmic Information page

Back to Speed Prior page