Introduction

Remarkably, there is a theoretically *optimal* way of
making predictions based on observations,
rooted in the early work of Solomonoff and Kolmogorov
[59,28].
The approach reflects basic principles of Occam's razor: simple
explanations of data are preferable to complex ones.

The theory of inductive inference quantifies what simplicity really means. Given certain very broad computability assumptions, it provides techniques for making optimally reliable statements about future events, given the past.

Once there is an optimal, formally describable way of predicting the future, we should be able to construct a machine that continually computes and executes action sequences that maximize expected or predicted reward, thus solving an ancient goal of AI research.

For many decades, however, AI researchers
have not paid a lot of attention to the theory of inductive
inference. Why not? There is another reason
besides the fact that most of them have traditionally
ignored theoretical computer science: the theory
has been perceived as being associated with excessive
computational costs. In fact, its most general statements
refer to methods that are optimal (in a certain asymptotic sense)
but incomputable.
So researchers in machine learning and artificial
intelligence have often resorted to alternative methods
that lack a strong theoretical foundation
but at least seem feasible in certain limited contexts.
For example, since the early attempts
at building a ``General Problem Solver''
[36,43]
much work has been done to develop
mostly heuristic machine learning
algorithms that solve new problems based on experience with
previous problems. Many pointers to *learning by chunking,
learning by macros, hierarchical learning, learning by analogy,*
etc. can be found in Mitchell's book [34] and
Kaelbling's survey [27].

Recent years, however, have brought substantial
progress in the field of *computable* and *feasible*
variants of optimal algorithms for prediction, search, inductive
inference, problem solving, decision making, and reinforcement
learning in very general environments.
In what follows, I will focus on results
predominantly from my own lab at IDSIA.

Sections
3,
4,
7
relate Occam's razor and the notion of simplicity to the shortest
algorithms for computing computable objects, and
will concentrate on recent *asymptotic* optimality results for
universal learning machines,
essentially ignoring issues of practical feasibility--compare my
postdoc Hutter's contribution [25] in this volume.

Section 5, however, will focus on our recent
non-traditional simplicity measure
which is *not* based on the shortest but on the *fastest* way
of describing objects, and
Section 6 will use this measure
to derive non-traditional
predictions concerning the future of our universe.

Sections
8,
9,
10
will finally address quite pragmatic issues and ``true'' time-optimality:
given a problem and only so much limited computation time,
what is the best way of spending it on evaluating
solution candidates?
In particular, Section 9 will
present an optimally fast way of incrementally solving
each task in a problem sequence, given a probability distribution
(the *bias*) on programs computing solution candidates.
Bias shifts are computed by program prefixes that
modify the distribution on their suffixes by
reusing successful code for previous tasks
(stored in non-modifiable memory).
No tested program gets more
runtime than its probability times the total search time.
In illustrative experiments, ours becomes the first general system to
*learn* a universal solver for arbitrary disk
*Towers of Hanoi* tasks (minimal solution size ).
It demonstrates the advantages of incremental learning by
profiting from previously solved, simpler tasks involving
samples of a simple context free language.
Section 10 discusses how to use this approach
for building universal reinforcement learners.