Despite (or maybe because of) the ambitiousness and potential power of self-improving machines, there has been little work in this vein outside our own labs at IDSIA and TU Munich. Here we will list essential differences between the Gödel machine and our previous approaches to `learning to learn,' `metalearning,' self-improvement, self-optimization, etc.
The most closely related approaches are Hutter's HSEARCH and AIXI(t,l) (Item 3 below). For historical reasons, however, we will first discuss Levin's Universal Search and Hutter's AIXI.
Unlike the fully self-referential Gödel machine, Levin's Universal Search [24,26] has a hardwired, unmodifiable meta-algorithm that cannot improve itself. It is asymptotically optimal for inversion problems whose solutions can be quickly verified in time (where is the solution size), but it will always suffer from the same huge constant slowdown factors (typically ) buried in the -notation. The self-improvements of a Gödel machine, however, can be more than merely -optimal, since its utility function may formalize a stonger type of optimality that does not ignore huge constants just because they are constant--compare the utility function of equation (1).
Furthermore, the Gödel machine is applicable to general lifelong reinforcement learning (RL) tasks  where Universal Search is not asymptotically optimal, and not even applicable, since in RL the evaluation of some behavior's value in principle consumes the learner's entire life! So the naive test of whether a program is good or not would consume the entire life. That is, we could test only one program; afterwards life would be over.
Therefore, to achieve their objective, general RL machines
must do things that Universal Search does
not do, such as predicting future tasks and rewards.
This partly motivates Hutter's universal RL machine AIXI, to be
Unlike Gödel machines, Hutter's recent AIXI model [16,19] generally needs unlimited computational resources per input update. It combines Solomonoff's universal prediction scheme [52,53] with an expectimax computation. In discrete cycle , action results in perception and reward , both sampled from the unknown (reactive) environmental probability distribution . AIXI defines a mixture distribution as a weighted sum of distributions , where is any class of distributions that includes the true environment . For example, may be a sum of all computable distributions [52,53], where the sum of the weights does not exceed 1. In cycle , AIXI selects as next action the first in an action sequence maximizing -predicted reward up to some given horizon. Recent work  demonstrated AIXI 's optimal use of observations as follows. The Bayes-optimal policy based on the mixture is self-optimizing in the sense that its average utility value converges asymptotically for all to the optimal value achieved by the (infeasible) Bayes-optimal policy which knows in advance. The necessary condition that admits self-optimizing policies is also sufficient. Furthermore, is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments and a strictly higher value in at least one .
While AIXI clarifies certain
theoretical limits of machine learning, it
is computationally intractable, especially when
includes all computable
distributions. This drawback
motivated work on the time-bounded, asymptotically optimal
AIXI(t,l) system 
and the related HSEARCH , both
to be discussed next.
Now we come to the most closely related previous work; so we will go an extra length to point out the main novelties of the Gödel machine.
Hutter's non-self-referential but still -optimal `fastest' algorithm for all well-defined problems HSEARCH  uses a hardwired brute force proof searcher and ignores the costs of proof search. Assume discrete input/output domains , a formal problem specification (say, a functional description of how integers are decomposed into their prime factors), and a particular (say, an integer to be factorized). HSEARCH orders all proofs of an appropriate axiomatic system by size to find programs that for all provably compute within time bound . Simultaneously it spends most of its time on executing the with the best currently proven time bound . It turns out that HSEARCH is as fast as the fastest algorithm that provably computes for all , save for a constant factor smaller than (arbitrary ) and an -specific but -independent additive constant . This constant may be enormous though.
Hutter's AIXI(t,l)  is related. In discrete cycle of AIXI(t,l)'s lifetime, action results in perception and reward , where all quantities may depend on the complete history. Using a universal computer such as a Turing machine, AIXI(t,l) needs an initial offline setup phase (prior to interaction with the environment) where it uses a hardwired brute force proof searcher to examine all proofs of length at most , filtering out those that identify programs (of maximal size and maximal runtime per cycle) which not only could interact with the environment but which for all possible interaction histories also correctly predict a lower bound of their own expected future reward. In cycle , AIXI(t,l) then runs all programs identified in the setup phase (at most ), finds the one with highest self-rating, and executes its corresponding action. The problem-independent setup time (where almost all of the work is done) is . The online time per cycle is . Both are constant but typically huge.
Advantages and Novelty of the Gödel Machine. There are major differences between the Gödel machine and Hutter's HSEARCH  and AIXI(t,l) , including:
It is the self-referential aspects
of the Gödel machine that relieve us of much of the burden of careful
algorithm design required for AIXI(t,l) and HSEARCH. They make the Gödel machine both
conceptually simpler and more general.
The Optimal Ordered Problem Solver OOPS [47,44] (used by BIOPS in Section 5.2) extends Universal Search (Item 1). It is a bias-optimal (see Def. 5.1) way of searching for a program that solves each problem in an ordered sequence of problems of a rather general type, continually organizing and managing and reusing earlier acquired knowledge. Solomonoff recently also proposed related ideas for a scientist's assistant  that modifies the probability distribution of Universal Search  based on experience.
Like Universal Search (Item 1), OOPS is not directly applicable to RL problems. A provably optimal RL machine must somehow prove properties of otherwise un-testable behaviors (such as: what is the expected reward of this behavior which one cannot naively test as there is not enough time). That is part of what the Gödel machine does: it tries to greatly cut testing time, replacing naive time-consuming tests by much faster proofs of predictable test outcomes whenever this is possible.
itself can be performed very quickly. In particular,
verifying the correctness of
a found proof typically does not consume the remaining life.
Hence the Gödel machine may use OOPS as a bias-optimal proof-searching
submodule (Section 5.2). Since the proofs themselves may concern quite
different, arbitrary notions of optimality (not just
bias-optimality), the Gödel machine is more
general than plain OOPS.
But it is not just an extension of OOPS. Instead of OOPS it may
as well use non-bias-optimal alternative methods to initialize
its proof searcher.
On the other hand, OOPS is not just a precursor of the Gödel machine.
It is a stand-alone, incremental, bias-optimal way of allocating
runtime to programs that reuse previously successful
programs, and is applicable to many traditional
problems, including but not limited to proof search.
A learner's modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP) . SMPs can modify the way they modify themselves etc. The Gödel machine has an SMP.
In previous practical work we used the success-story algorithm (SSA) to force some (stochastic) SMP to trigger better and better self-modifications [37,49,48,50]. During the learner's life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call--this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solved a complex task in a partially observable environment whose state space is far bigger than most found in the literature .
The Gödel machine's training algorithm is theoretically
much more powerful than SSA though.
SSA empirically measures the usefulness of previous
self-modifications, and does not necessarily encourage
provably optimal ones.
Similar drawbacks hold for
Lenat's human-assisted, non-autonomous,
self-modifying learner ,
our Meta-Genetic Programming  extending
Cramer's Genetic Programming [8,1],
our metalearning economies 
extending Holland's machine learning economies ,
and gradient-based metalearners
for continuous program spaces of differentiable
recurrent neural networks
All these methods, however, could be used to seed
with an initial policy.