Previous work on IS (for example, [34,35]) implicitly focused on goal-oriented exploration -- in principle IS can learn to change its exploration strategy if this turns out to accelerate external reward in the long run. This chapter's IS implementation, however, involves an additional pure exploration component besides the goal-oriented one. Part of the learner receives internal reward for pointing out something another part did not know but thought it knew. The surprised part suffers to the extent the surprising part benefits -- the sum of all internal rewards remains zero. The learner is ``interested'' in ``creative'' computations leading to unexpected results, while simultaneously trying to make formerly surprising things predictable and boring. It does not care much for irregular noise rich with Shannon information . Instead it prefers easily learnable algorithmic regularities, taking into account the costs of gaining information in an RL framework.
From each module's perspective an instruction subsequence is ``novel'' as long as its outcome surprises the other. Since the surprised module will eventually figure out what is going on, there will be incessant pressure to create new novelties. What is the use of such an inquisitive system's computations? Curiosity's long-term justifiability depends on whether knowledge growth will eventually support goal-oriented behavior. When will this be the case? The question is reminiscent of G. H. Hardy's toast on ``pure'' mathematics -- the kind that ``would never be of any use to anyone'' (, p. 185). History teaches us, however, that it is hard to decide which math will be useless forever. For instance, old results from ``pure'' number theory are used in today's encryption technology.
In general, however, it will always be possible to design environments where ``curiosity kills the cat'' , or at least has negative influence on external performance. For instance, as exemplified by simulation 2 of Experiment 2a, curiosity may occasionally prevent discovery of external reward sources. This is reminiscent of the situation in supervised learning. There often additional ``regularizer'' terms are added to the standard error function defining network performance on the training data. They can greatly help to remove redundant free parameters and improve generalization capability on unseen data (e.g., ), but in general this cannot be guaranteed.
This chapter's approach draws inspiration from several sources. For instance, the two-module system is based on two co-evolving modules. Co-evolution of competing strategies, however, is nothing new. See, for example, [7,19] for interesting cases. Also, the idea of improving a learner by letting it play against itself is ancient. See, for example, [20,41]. Even the idea of unsupervised learning through co-evolution of predictors and modules trying to escape the predictions is nothing new -- it has been used extensively in our previous work on unsupervised sensory coding with neural networks [25,36,33,32,37]. Finally, co-evolutionary methods translating mismatches between reality and expectations into reward for ``curious,'' exploring agents are not new either -- see our previous work on ``pure'' RL-based exploration [24,23,40]. So, what is new?
Novel is the idea that both adaptive modules equally influence the probability of each executed instruction/computation. This (1) allows for a straight-forward way of making both modules equally powerful (by copying the currently superior one onto the other), and (2) prevents each module from being able to enforce computations that will make the other lose no matter what it tries. For instance, details of white noise on a screen are inherently unpredictable, but none of the two opponents may exploit this to generate surprises if the other does not ``agree'' to the corresponding experiment. And it will agree only as long as it suspects that there is a regularity in the white noise that the other does not yet know. The precondition of a surprise is that the surprised module has expressed its confidence in a different outcome of the surprising computation sequence by participating in the collective decision process. Intuitively, my adaptive explorer continually wants to discover new, ``creative'' uses of its innate sensorium and computational potential. It wants to focus on those novel things that seem easy to learn, given current knowledge. It wants to ignore (1) previously learned, predictable things, (2) inherently unpredictable ones (such as details of white noise on a screen), and (3) things that are unexpected but not expected to be easily learned (such as the contents of an advanced math textbook beyond the explorer's current level).
Another novel aspect is the general setting. Instead of being limited to Markovian domains and simple reactive strategies such as approaches in [23,40], this chapter's setup allows for quite arbitrary domains and computations. This is made possible by the recent IS paradigm [34,35]. There is no essential limit (besides computability) to the nature of the regularities that may be exploited to generate surprises. Neither is there an essential limit to the nature of the learning processes that can make formerly surprising regularities predictable and boring. There may be RL schemes even more general than IS, but this is beyond the scope of this chapter.
Note that this chapter's notion of ``simple regularities'' differs from, e.g., Kolmogorov complexity theory's [11,2,39,14,26]. There an object is called simple relative to current knowledge if the size of the shortest algorithm computing it from is small. The algorithm's computation time is ignored, as are constant factors reflecting Kolmogorov complexity's machine independence. The current chapter, however, takes both into account.
As the explorer's knowledge about its environment and computational abilities expands, it keeps balancing on the thin, dynamically changing line between the subjectively random and the subjectively trivial. Unlike Nake and other authors he cites , I do not suggest a predefined optimal ratio between known and unknown information. Instead, the two cooperating/competing modules dynamically, implicitly determine this ratio as they keep trying to surprise each other.
Recent papers attempt to explain ``beauty'' with the help of complexity theory concepts [27,29]. They argue that something ``beautiful'' need not be ``interesting''. They predict that the ``most beautiful'' object from a set of objects satisfying certain specifications is the one that can be most easily computed from the subjective observer's input coding scheme. Interestingness in the current chapter's sense, however, also takes into account whether the computational result is expected or not. Something that is both ``beautiful'' and already known may be quite boring -- ``beauty'' needs to be unexpected to awaken interest.
Future work. The programming language used in the experiments is designed to allow for fairly arbitrary computations/explorations and learning processes. To make progress towards analyzing ``inquisitive'' explorers, however, one will probably have to study alternative systems with less computational power and less general RL paradigms but more accessible dynamics. On the other hand, it will also be interesting to study a curious learner's performance in the case of more difficult tasks and more powerful primitive instructions with more bias towards solving the task. Note that LIs can be almost anything: neural net algorithms, Bayesian analysis algorithms, etc.
Furthermore, although IS is a rather general RL paradigm, it may be possible to develop more general ones. In that case I would like to combine them with the two-module idea. Promising candidates may be RL schemes based on economy and market models, such as classifier systems and their variants [9,44,45,42,43], or the related ``Prototypical Self-referential Associating Learning Mechanisms'' (PSALMs) , the Neural Bucket Brigade , Hayek Machines [1,12], Collective Intelligences (COINs) .
The basic ideas of the present chapter will probably remain unchanged, however: competing agents will agree on algorithmic experiments and bet on their outcomes, the winners profiting from outwitting others.