OFFER: Off-Environment Reinforcement Learning

Authors

  • Kamil Ciosek University of Oxford
  • Shimon Whiteson University of Oxford

DOI:

https://doi.org/10.1609/aaai.v31i1.10810

Keywords:

Markov Decision Process, Policy Gradient, Variance Reduction, Actor-Critic, REINFORCE

Abstract

Policy gradient methods have been widely applied in reinforcement learning. For reasons of safety and cost, learning is often conducted using a simulator. However, learning in simulation does not traditionally utilise the opportunity to improve learning by adjusting certain environment variables - state features that are randomly determined by the environment in a physical setting but controllable in a simulator. Exploiting environment variables is crucial in domains containing significant rare events (SREs), e.g., unusual wind conditions that can crash a helicopter, which are rarely observed under random sampling but have a considerable impact on expected return. We propose off environment reinforcement learning (OFFER), which addresses such cases by simultaneously optimising the policy and a proposal distribution over environment variables. We prove that OFFER converges to a locally optimal policy and show experimentally that it learns better and faster than a policy gradient baseline.

Downloads

Published

2017-02-13

How to Cite

Ciosek, K., & Whiteson, S. (2017). OFFER: Off-Environment Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.10810