Cooperative Policy Agreement: Learning Diverse Policy for Offline MARL

Authors

  • Yihe Zhou Zhejiang University
  • Yuxuan Zheng Zhejiang University
  • Yue Hu Zhejiang University
  • Kaixuan Chen State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Tongya Zheng Big Graph Center, Hangzhou City University State Key Laboratory of Blockchain and Data Security, Zhejiang University
  • Jie Song Zhejiang University
  • Mingli Song State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Shunyu Liu Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v39i21.34465

Abstract

Offline Multi-Agent Reinforcement Learning (MARL) aims to learn optimal joint policies from pre-collected datasets without further interaction with the environment. Despite the encouraging results achieved so far, we identify the policy mismatch problem that arises from employing diverse offline MARL datasets, a highly important ingredient for cooperative generalization yet largely overlooked by existing literature. Specifically, in the case that offline datasets exhibit various optimal joint policies, policy mismatch often occurs when individual actions from different optimal joint actions are combined in a way that results in a suboptimal joint action. In this paper, we introduce a novel Cooperative Policy Agreement (CPA) method, that not only mitigates the policy mismatch problem but also learns to generate diverse joint policies. CPA firstly introduces an autoregressive decision-making mechanism among agents during offline training. This mechanism enables agents to access the actions previously taken by other agents, thereby facilitating effective joint policy matching. Moreover, diverse joint policies can be directly obtained through sequential action sampling from the autoregressive model. Then we further incorporate a policy agreement mechanism to convert these autoregressive joint policies into decentralized policies with a non-autoregressive form, while still ensuring the diversity of the generated policies. This mechanism guarantees that the proposed CPA adheres to the Centralized Training with Decentralized Execution (CTDE) constraint. Experiments conducted on various benchmarks demonstrate that CPA yields superior performance to state-of-the-art competitors.

Downloads

Published

2025-04-11

How to Cite

Zhou, Y., Zheng, Y., Hu, Y., Chen, K., Zheng, T., Song, J., … Liu, S. (2025). Cooperative Policy Agreement: Learning Diverse Policy for Offline MARL. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 23018–23026. https://doi.org/10.1609/aaai.v39i21.34465

Issue

Section

AAAI Technical Track on Machine Learning VII