Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization

Authors

  • Yufei Kuang CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China
  • Miao Lu CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China
  • Jie Wang CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
  • Qi Zhou CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China
  • Bin Li CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China
  • Houqiang Li CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

DOI:

https://doi.org/10.1609/aaai.v36i7.20686

Keywords:

Machine Learning (ML)

Abstract

Deep reinforcement learning algorithms can perform poorly in real-world tasks due to the discrepancy between source and target environments. This discrepancy is commonly viewed as the disturbance in transition dynamics. Many existing algorithms learn robust policies by modeling the disturbance and applying it to source environments during training, which usually requires prior knowledge about the disturbance and control of simulators. However, these algorithms can fail in scenarios where the disturbance from target environments is unknown or is intractable to model in simulators. To tackle this problem, we propose a novel model-free actor-critic algorithm---namely, state-conservative policy optimization (SCPO)---to learn robust policies without modeling the disturbance in advance. Specifically, SCPO reduces the disturbance in transition dynamics to that in state space and then approximates it by a simple gradient-based regularizer. The appealing features of SCPO include that it is simple to implement and does not require additional knowledge about the disturbance or specially designed simulators. Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.

Downloads

Published

2022-06-28

How to Cite

Kuang, Y., Lu, M., Wang, J., Zhou, Q., Li, B., & Li, H. (2022). Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7), 7247-7254. https://doi.org/10.1609/aaai.v36i7.20686

Issue

Section

AAAI Technical Track on Machine Learning II