Constraints Penalized Q-learning for Safe Offline Reinforcement Learning

Authors

  • Haoran Xu School of Computer Science and Technology, Xidian University, Xi’an, China JD iCity, JD Technology, Beijing, China JD Intelligent Cities Research
  • Xianyuan Zhan Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
  • Xiangyu Zhu JD iCity, JD Technology, Beijing, China JD Intelligent Cities Research

DOI:

https://doi.org/10.1609/aaai.v36i8.20855

Keywords:

Machine Learning (ML)

Abstract

We study the problem of safe offline reinforcement learning (RL), the goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment. This problem is more appealing for real world RL applications, in which data collection is costly or dangerous. Enforcing constraint satisfaction is non-trivial, especially in offline settings, as there is a potential large discrepancy between the policy distribution and the data distribution, causing errors in estimating the value of safety constraints. We show that naïve approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem. Our method admits the use of data generated by mixed behavior policies. We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines.

Downloads

Published

2022-06-28

How to Cite

Xu, H., Zhan, X., & Zhu, X. (2022). Constraints Penalized Q-learning for Safe Offline Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8753-8760. https://doi.org/10.1609/aaai.v36i8.20855

Issue

Section

AAAI Technical Track on Machine Learning III