Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization

Authors

  • Uri Gadot Technion - Israel Institute of Technology
  • Esther Derman MILA, Université de Montréal
  • Navdeep Kumar Technion - Israel Institute of Technology
  • Maxence Mohamed Elfatihi IMT Atlantique
  • Kfir Levy Technion - Israel Institute of Technology
  • Shie Mannor Technion - Israel Institute of Technology NVIDIA Research

DOI:

https://doi.org/10.1609/aaai.v38i19.30101

Keywords:

General

Abstract

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

Published

2024-03-24

How to Cite

Gadot, U., Derman, E., Kumar, N., Elfatihi, M. M., Levy, K., & Mannor, S. (2024). Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19), 21090–21098. https://doi.org/10.1609/aaai.v38i19.30101

Issue

Section

AAAI Technical Track on Safe, Robust and Responsible AI Track