Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization

Authors

  • Woosung Kim Korea University
  • Donghyeon Ki Korea University
  • Byung-Jun Lee Korea University

DOI:

https://doi.org/10.1609/aaai.v38i12.29218

Keywords:

ML: Reinforcement Learning, ML: Optimization

Abstract

One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with f-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the sub-optimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin.

Published

2024-03-24

How to Cite

Kim, W., Ki, D., & Lee, B.-J. (2024). Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12), 13185-13192. https://doi.org/10.1609/aaai.v38i12.29218

Issue

Section

AAAI Technical Track on Machine Learning III