Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions

Authors

  • Reda Ouhamma EPFL - EPF Lausanne
  • Maryam Kamgarpour EPFL - EPF Lausanne

DOI:

https://doi.org/10.1609/aaai.v40i20.38767

Abstract

We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions, nor sharing information. Prior works established polynomial-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy with bounded reachability and mixing time. Our key algorithmic novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. We prove a polynomial-time convergence to an approximate Nash equilibrium by establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer.

Downloads

Published

2026-03-14

How to Cite

Ouhamma, R., & Kamgarpour, M. (2026). Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(20), 17170–17178. https://doi.org/10.1609/aaai.v40i20.38767

Issue

Section

AAAI Technical Track on Game Theory and Economic Paradigms