Policy Optimization with Stochastic Mirror Descent

Authors

  • Long Yang Zhejiang University, China
  • Yu Zhang Netease Games AI Lab, Hangzhou, China
  • Gang Zheng Zhejiang University, China
  • Qian Zheng Zhejiang University, China Nanyang Technological University,Singapore
  • Pengfei Li Zhejiang University, China
  • Jianhang Huang Zhejiang University, China
  • Gang Pan Zhejiang University, China

DOI:

https://doi.org/10.1609/aaai.v36i8.20863

Keywords:

Machine Learning (ML)

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

Downloads

Published

2022-06-28

How to Cite

Yang, L., Zhang, Y., Zheng, G., Zheng, Q., Li, P., Huang, J., & Pan, G. (2022). Policy Optimization with Stochastic Mirror Descent. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8823-8831. https://doi.org/10.1609/aaai.v36i8.20863

Issue

Section

AAAI Technical Track on Machine Learning III