Most algorithms for reinforcement learning and adaptive control in non-stationary environments can be classified into two major categories.
First, there is the approach of `back-propagation through time and through a frozen model network'. With different degrees of generality it has been pursued by , , , , and  in the case where there is external feedback through a reactive environment.
Second, there is the `adaptive critic' approach, which, again with different degrees of generality, has been pursued by , , , , , and . The `Neural Bucket Brigade Algorithm'  also bears relationships to adaptive critics.
Both the algorithms based on pure gradient descent as well as the `adaptive critic' algorithms have at least one thing in common: They show significant drawbacks when the credit assignment process has to bridge long time gaps between past actions and later consequences.
Both approaches show awkward performance in the case where the learning system already has learned a lot of action sequences in the past. With both approaches, credit assignment proceeds `from time slice to time slice' instead of allowing `jumps through time' on a higher, more abstract level. One could say that both approaches tend to modify `sub-programs' instead of modifying the trigger conditions for sub-programs. They do not have an explicit concept of something like a sub-program. Pure gradient descent methods always consider all past states for credit assignment. Adaptive critics based on `Temporal Differences' (reinforcement comparison methods) or on `Heuristic Dynamic Programming' consider only the most recent states for `handing expectations back into time'. Both methods in general tend to consider the wrong states, they do not selectively focus on relevant points in time. This is a major reason for slow performance in large scale applications where there can be long delays between actions and consequences.
If there is no prior knowledge about typical consequences of certain action sequences, then a learning system cannot be expected to sensibly reduce the number of past states that are `critical' for credit assignment. But if it is assumed that the learning system has already learned to perform well on certain sub-tasks, then an intelligent credit assignment process should make use of available sub-programs to ease the learning of new tasks.
In fact, the learning system incrementally should use information about the starting conditions and the effects of sub-programs to compose more complicated sub-programs in a hierarchical fashion.
Compositional learning  means finding solutions for new tasks by sequentially combining solutions to older tasks. It means learning to `divide and conquer'. It means dividing the task of finding a sequence of actions leading from a current state to a goal state by decomposing the problem into sub-tasks, such that already existing sub-programs for the sub-tasks can be combined. It means to ignore irrelevant details of sub-programs. It means to focus on the interfaces between sub-programs. Compositions of sub-programs may serve as sub-programs for even more complicated tasks.
In this paper it is assumed that the `divide and conquer problem' can be divided and partly conquered by decomposing it into two problems, namely, the `dividing-problem', and the `conquering-problem'.
The `dividing-problem' is the problem of structuring all kinds of sequences of events and/or actions into `parts that belong together'. It is the problem of deciding what a good sub-program is, which its initial conditions are, and when it ends. The `dividing-problem' has to do with unsupervised learning. It has been adressed in , , and .
The `conquering-problem' is to select from many available sub-programs and to combine them in a way that allows to reach a given goal state.
In the sequel the `conquering-problem' will be isolated and studied under the assumption that the `dividing-problem' is already solved. In contrast to previous approaches for credit assignment `from time slice to time slice' the approach described in the next section allows extensive `credit assignment jumps through time'.