In the previous experiments the learner's environment changed because of externally induced task changes. We also did experimental case studies where the tasks change in a less obvious way because of other changing learners.
A complex POE. Schmidhuber et al. (1996) describe two agents A and B living in a partially observable pixel environment with obstacles. They learn to solve a complex task that could not be solved by various TD() Q-learning variants [#!Lin:93!#]. The task requires (1) agent A to find and take a key ``key A''; (2) agent A go to a door ``door A'' and open it for agent B; (3) agent B to enter through ``door A'', find and take another key ``key B''; (4) agent B to go to another door ``door B'' to open it (to free the way to the goal); (5) one of the agents to reach the goal. Both agents share the same design. Each is equipped with limited ``active'' sight: by executing certain instructions, it can sense obstacles, its own key, the corresponding door, or the goal, within up to 50 pixels in front of it. The agent can also move forward, turn around, turn relative to its key or its door or the goal. It can use memory (embodied by its IP) to disambiguate inputs (unlike Jaakkola et al.'s method (1995), ours is not limited to finding suboptimal stochastic policies for POEs with an optimal solution). Reward is provided only if one of the agents touches the goal. This agent's reward is 5.0; the other's is 3.0 (for its cooperation -- note that asymmetric reward introduces competition).
In the beginning, the goal is found only every 300,000 basic cycles. Through self-modifications and SSA, however, within 130,000 trials ( basic cycles) the average trial length decreases by a factor of 60 (mean of 4 simulations). Both agents learn to cooperate to accelerate reward intake. See [#!Schmidhuber:96meta!#] for details.
Zero sum games. Even certain zero sum reward tasks allow for achieving success stories. This has been shown in an experiment with three IS-based agents [#!Jieyu:96self!#]: each agent is both predator and prey; it receives reward 1 for catching its prey and reward -1 for being caught. Since all agents learn each agent's task gets more and more difficult over time. How can it then create a non-trivial history of policy modifications, each corresponding to a lifelong reward acceleration? The answer is: each agent collects a lot of negative reward during its life, and actually comes up with a history of policy modifications causing less and less negative cumulative long-term rewards. The stacks of all agents tend to grow continually as they discover better and better pursuit-evasion strategies.