Policy Optimization with Stochastic Mirror Descent

Long Yang; Yu Zhang; Gang Zheng; Qian Zheng; Pengfei Li; Jianhang Huang; Gang Pan

doi:10.1609/aaai.v36i8.20863

Policy Optimization with Stochastic Mirror Descent

Authors

Long Yang Zhejiang University, China
Yu Zhang Netease Games AI Lab, Hangzhou, China
Gang Zheng Zhejiang University, China
Qian Zheng Zhejiang University, China Nanyang Technological University,Singapore
Pengfei Li Zhejiang University, China
Jianhang Huang Zhejiang University, China
Gang Pan Zhejiang University, China

DOI:

https://doi.org/10.1609/aaai.v36i8.20863

Keywords:

Machine Learning (ML)

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

Downloads

Published

2022-06-28

How to Cite

Yang, L., Zhang, Y., Zheng, G., Zheng, Q., Li, P., Huang, J., & Pan, G. (2022). Policy Optimization with Stochastic Mirror Descent. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8823-8831. https://doi.org/10.1609/aaai.v36i8.20863

Download Citation

Issue

Vol. 36 No. 8: AAAI-22 Technical Tracks 8

Section

AAAI Technical Track on Machine Learning III

Policy Optimization with Stochastic Mirror Descent

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription