Model-Based Offline Weighted Policy Optimization (Student Abstract)

Authors

  • Renzhe Zhou Nanjing University
  • Zongzhang Zhang Nanjing University
  • Yang Yu Nanjing University

DOI:

https://doi.org/10.1609/aaai.v37i13.27056

Keywords:

Offline Reinforcement Learning, Model-Based Reinforcement Learning, Weighted Bellman Update

Abstract

A promising direction for applying reinforcement learning to the real world is learning from offline datasets. Offline reinforcement learning aims to learn policies from pre-collected datasets without online interaction with the environment. Due to the lack of further interaction, offline reinforcement learning faces severe extrapolation error, leading to policy learning failure. In this paper, we investigate the weighted Bellman update in model-based offline reinforcement learning. We explore uncertainty estimation in ensemble dynamics models, then use a variational autoencoder to fit the behavioral prior, and finally propose an algorithm called Model-Based Offline Weighted Policy Optimization (MOWPO), which uses a combination of model confidence and behavioral prior as weights to reduce the impact of inaccurate samples on policy optimization. Experiment results show that MOWPO achieves better performance than state-of-the-art algorithms, and both the model confidence weight and the behavioral prior weight can play an active role in offline policy optimization.

Downloads

Published

2024-07-15

How to Cite

Zhou, R., Zhang, Z., & Yu, Y. (2024). Model-Based Offline Weighted Policy Optimization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), 16392-16393. https://doi.org/10.1609/aaai.v37i13.27056