Algorithmic complexity theory provides various different but closely related measures for the complexity (or simplicity) of objects. We will focus on the one that appears to be the most useful for machine learning. Informally speaking, the complexity of a computable object is the length of the shortest program that computes it and halts, where the set of possible programs forms a prefix code.
Machine. To make this more precise, consider a Turing machine (TM) with 3 tapes: the program tape, the work tape, and the output tape. All three are finite but may grow indefinitely. For simplicity, but without loss of generality, let us focus on binary tape alphabets . Initially, the work tape and the output tape consist of a single square filled with a zero. The program tape consists of finitely many squares, each filled with a zero or a one. Each tape has a scanning head. Initially, the scanning head of each tape points to its first square. The program tape is ``read only,'' the output tape is ``write only'' and their scanning heads may be shifted only to the right (one square at a time). The work tape is ``read/write'' and its scanning head may move in both directions. Whenever the scanning heads of work tape or output tape shift beyond the current tape boundary, an additional square is appended and filled with a zero. The case of the program tape's scanning head shifting beyond the program tape boundary will be considered later. For the moment we assume that this does never happen. The TM has a finite number of internal states (one of them being the initial state). Its behavior is specified by a function (implemented as a look-up table). maps the current state and the contents of the square above the scanning head of the work tape to a new state and an action. There are 8 actions: shift worktape left/right, write 1/0 on worktape, write 1/0 on output tape and shift its scanning head right, copy contents above scanning head of program tape onto square above scanning head of work tape and shift program tape's scanning head right, and halt.
Self-delimiting programs. Let denote the number of bits in the bitstring . Consider a non-empty bitstring written onto the program tape such that the scanning head points to the first bit of . is a program for some TM , iff reads all bits and halts. In other words, during the (eventually terminating) computation process the head of 's program tape incrementally moves from its start position to the end of , but not any further. One may say that carries in itself the information about when to stop, and about its own length. Obviously, no program can be the prefix of another one.
Compiler theorem. Each TM , mapping bitstrings (written onto the program tape) to outputs (written onto the output tape) computes a partial function ( is undefined if does not halt). It is well known that there is a universal TM with the following property: for every TM there exists a constant prefix such that for all bitstrings . The prefix is the compiler that compiles programs for into equivalent programs for .
In what follows, let denote a (self-delimiting) program.
the Kolmogorov complexity (a.k.a. ``algorithmic complexity,''
``algorithmic information,'' or
occasionally ``Kolmogorov-Chaitin complexity'')
of an arbitrary bitstring is denoted as and is defined as
the length of a shortest program producing on :
Invariance theorem. Due to the compiler theorem, for two universal machines and . Therefore we may choose one particular universal machine and henceforth write .
Machine learning, MDL, and the prior problem.
In machine learning applications, we are often concerned
with the following problem:
given training data , we would like to select
the most probable hypothesis generating the data.
Bayes formula yields
Define , the a priori probability of a bitstring ,
as the probability of guessing a (halting) program
that computes on . Here, the way of guessing is defined
by the following procedure:
initially, the program tape consists of a single square.
Whenever the scanning head of the program tape shifts
to the right, do: (1) Append a new square.
(2) With probability fill it with a 0;
with probability fill it with a 1.
Under different universal priors (based on different universal machines), probabilities of a given string differ by not more than a constant factor independent of the string size, due to the invariance theorem (the constant factor corresponds to the probability of guessing a compiler). Therefore we may drop the index and write instead of . This justifies the name ``universal prior,'' also known as the Solomonoff-Levin distribution.
Algorithmic entropy. is denoted by , the algorithmic entropy of .
Dominance of shortest programs.
It can be shown (the proof is non-trivial) that
Dominance of universal prior. Suppose there are infinitely many enumerable solution candidates (strings): amazingly, dominates all discrete enumerable semimeasures P (including probability distributions) in the following sense: for each there is a constant such that for all strings .
Inductive inference and Occam's razor.
Occam's razor prefers solutions whose minimal descriptions are short over
solutions whose minimal descriptions are longer.
The ``modern'' prefix-based version of
Solomonoff's theory of inductive inference justifies
Occam's razor in the following way.
Suppose the problem is to extrapolate a sequence of symbols
(bits, without loss of generality).
We have already observed a bitstring and would
like to predict the next bit.
Let denote the event `` is followed by symbol ''
Bayes tells us
Since Kolmogorov complexity is incomputable in general, the universal prior is so, too. A popular myth states that this fact renders useless the concepts of Kolmogorov complexity, as far as practical machine learning is concerned. But this is not so, as will be seen next. There we focus on a natural, computable, yet very general extension of Kolmogorov complexity.
Let us slightly extend the notion of a program.
In what follows, a program is
a string on the program tape which can be
scanned completely by .
The same program may lead to different results,
depending on the runtime.
For a given computable string, consider
the logarithm of the probability of guessing a program
that computes it, plus the logarithm of the corresponding runtime.
The Levin complexity of the string
is the minimal possible value of this.
More formally, let scan a program
(a program in the extended sense)
written onto the program tape
before it finishes printing onto the work tape.
Let be the number of steps taken before
is printed. Then
Universal search. Suppose we have got a problem whose solution can be represented as a (bit)string. Levin's universal optimal search algorithm essentially generates and evaluates all strings (solution candidates) in order of their complexity, until a solution is found. This is essentially equivalent to enumerating all programs in order of decreasing probabilities, divided by their runtimes. Each program computes a string that is tested to see whether it is a solution to the given problem. If so, the search is stopped. To get some intuition on universal search, let denote the probability of computing a solution from a halting program on the program tape. For each there is a given runtime . Now suppose the following gambling scenario. You bet on the success of some . The bid is . To minimize your expected losses, you are going to bet on the with maximal . If you fail to hit a solution, you may bet again, but not on a you already bet on. Continuing this procedure, you are going to list halting programs in order of decreasing . Of course, neither nor are usually known in advance. Still, for a broad class of problems, including inversion problems and time-limited optimization problems, universal search can be shown to be optimal with respect to total expected search time, leaving aside a constant factor independent of the problem size. If string can be computed within time steps by a program , and the probability of guessing (as above) is , then within time steps, systematic enumeration according to Levin will generate , run it for time steps, and output . In the experiments below, a probabilistic algorithm strongly inspired by universal search will be used.
Conditional algorithmic complexity.
The complexity measures above are actually
simplifications of slightly more general
Suppose the universal machine starts the
computation process with a non-empty string
already written on its work tape.
Conditional Kolmogorov complexity
is defined as