OFFER: Off-Environment Reinforcement Learning

Kamil Ciosek; Shimon Whiteson

doi:10.1609/aaai.v31i1.10810

Authors

Kamil Ciosek University of Oxford
Shimon Whiteson University of Oxford

DOI:

https://doi.org/10.1609/aaai.v31i1.10810

Keywords:

Markov Decision Process, Policy Gradient, Variance Reduction, Actor-Critic, REINFORCE

Abstract

Policy gradient methods have been widely applied in reinforcement learning. For reasons of safety and cost, learning is often conducted using a simulator. However, learning in simulation does not traditionally utilise the opportunity to improve learning by adjusting certain environment variables - state features that are randomly determined by the environment in a physical setting but controllable in a simulator. Exploiting environment variables is crucial in domains containing significant rare events (SREs), e.g., unusual wind conditions that can crash a helicopter, which are rarely observed under random sampling but have a considerable impact on expected return. We propose off environment reinforcement learning (OFFER), which addresses such cases by simultaneously optimising the policy and a proposal distribution over environment variables. We prove that OFFER converges to a locally optimal policy and show experimentally that it learns better and faster than a policy gradient baseline.

OFFER: Off-Environment Reinforcement Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information