Model-Based Offline Weighted Policy Optimization (Student Abstract)
Keywords:Offline Reinforcement Learning, Model-Based Reinforcement Learning, Weighted Bellman Update
AbstractA promising direction for applying reinforcement learning to the real world is learning from offline datasets. Offline reinforcement learning aims to learn policies from pre-collected datasets without online interaction with the environment. Due to the lack of further interaction, offline reinforcement learning faces severe extrapolation error, leading to policy learning failure. In this paper, we investigate the weighted Bellman update in model-based offline reinforcement learning. We explore uncertainty estimation in ensemble dynamics models, then use a variational autoencoder to fit the behavioral prior, and finally propose an algorithm called Model-Based Offline Weighted Policy Optimization (MOWPO), which uses a combination of model confidence and behavioral prior as weights to reduce the impact of inaccurate samples on policy optimization. Experiment results show that MOWPO achieves better performance than state-of-the-art algorithms, and both the model confidence weight and the behavioral prior weight can play an active role in offline policy optimization.
How to Cite
Zhou, R., Zhang, Z., & Yu, Y. (2023). Model-Based Offline Weighted Policy Optimization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), 16392-16393. https://doi.org/10.1609/aaai.v37i13.27056
AAAI Student Abstract and Poster Program