LSTM RNNs for model-free value function-based reinforcement learning in POMDPs Bram Bakker IDSIA, Lugano, Switzerland Dept. Of Computer Science, University of Amsterdam, The Netherlands In machine learning methods concerned with temporal processing of sequences of signals, hidden state is a ubiquitous problem. This is true in prediction and recognition tasks as well as in control tasks. For example, in speech recognition, the same acoustic signal at a specific time may correspond to parts of different words, and the only way to determine what the correct word is is to take into account a sequence of acoustic signals. In mobile robot control, the same laser range finder profile may correspond to different places in the world, and the only way to determine the correct place and therefore the best next action is to take into account a sequence of laser range finder profiles. My work is concerned with reinforcement learning in control problems with such hidden state. That is, an agent must learn a control policy in such environments (often called Partially Observable Markov Decision Processes, POMDPs), based on delayed scalar reward signals given only when the agent reaches desired goals. I will briefly describe how standard reinforcement learning algorithms deal with delayed reward problems, by estimating value functions over the state space. Then I will describe how this approach can be extended to environments with hidden state, using function approximators with short-term memory, in my case LSTM recurrent neural networks. This is analogous to the use of recurrent neural networks in supervised prediction and recognition learning. Next I will describe my recent work on exploiting hierarchical structure for these types of problems. The reasoning here is again similar to certain work in supervised prediction learning: in complex problems you may want to decompose the problem into simpler subproblems, each of which can be solved more easily by a specialized system (again, an LSTM recurrent neural network) than the entire problem. I will explain how hierarchy can be especially beneficial in problems with hidden state. In my hierarchical reinforcement learning algorithm, higher levels of control learn to select subgoals (which basically correspond to clusters of observations), and lower level controllers learn to specialize for different subgoals. I will present some experimental results in a test problem, discuss similarities to other work in both supervised prediction learning and in reinforcement learning, and I will point out important current limitations.