Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization

Woosung Kim; Donghyeon Ki; Byung-Jun Lee

doi:10.1609/aaai.v38i12.29218

Authors

Woosung Kim Korea University
Donghyeon Ki Korea University
Byung-Jun Lee Korea University

DOI:

https://doi.org/10.1609/aaai.v38i12.29218

Keywords:

ML: Reinforcement Learning, ML: Optimization

Abstract

One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with f-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the sub-optimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin.

Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription