Ronen Brafman

Abstract:

In reinforcement learning our goal is to learn useful behaviors in unknown environments. The standard environment model for most work in reinforcement learning is that of a Markov decision process. In this talk I will consider a more general model that captures multi-agent environments as well, namely, zero-sum stochastic games. I will describe an extremely simple model-based learning algorithm called R-max, which is the first algorithm to attain near-optimal average reward in polynomial time. Throughout the learning process, the agent maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized with a wishful-thinking, or optimistic approach, assuming that all actions in all states return the maximal possible reward. During execution, the model is updated based on the agent's observations. The reason R-max succeeds in multi-agent environments is that the adversaries can, at worst, either prevent the agent from learning or prevent it from getting high rewards, but not both.

As special cases of stochastic games, R-max applies to Markov decision processes and repeated zero-sum games, improving past algorithms for these problems by Kearns and Singh, Monderer and Tennenholtz, Banos, and Megido. It has a built-in mechanism for resolving the exploration vs. exploitation dilemma, and it provides the first formal justification for the ``optimism under uncertainty'' bias used in many RL algorithms. Using R-max we were recently able to provide a near-optimal poly-time algorithm for learning in common-interest stochastic games under imperfect monitoring.

Joint work with Moshe Tennenholtz from the Technion.