Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Yuanzhao Zhai; Yiying Li; Zijian Gao; Xudong Gong; Kele Xu; Dawei Feng; Ding Bo; Huaimin Wang

doi:10.1609/aaai.v38i15.29607

Authors

Yuanzhao Zhai National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Yiying Li Artificial Intelligence Research Center, DII, Beijing, China
Zijian Gao National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Xudong Gong National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Kele Xu National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Dawei Feng National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Ding Bo National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Huaimin Wang National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China

DOI:

https://doi.org/10.1609/aaai.v38i15.29607

Keywords:

ML: Reinforcement Learning, PRS: Model-Based Reasoning, RU: Sequential Decision Making

Abstract

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information