Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
DOI:
https://doi.org/10.1609/aaai.v38i15.29607Keywords:
ML: Reinforcement Learning, PRS: Model-Based Reasoning, RU: Sequential Decision MakingAbstract
Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.Downloads
Published
2024-03-24
How to Cite
Zhai, Y., Li, Y., Gao, Z., Gong, X., Xu, K., Feng, D., … Wang, H. (2024). Optimistic Model Rollouts for Pessimistic Offline Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), 16678–16686. https://doi.org/10.1609/aaai.v38i15.29607
Issue
Section
AAAI Technical Track on Machine Learning VI