Cooperative Policy Agreement: Learning Diverse Policy for Offline MARL

Yihe Zhou; Yuxuan Zheng; Yue Hu; Kaixuan Chen; Tongya Zheng; Jie Song; Mingli Song; Shunyu Liu

doi:10.1609/aaai.v39i21.34465

Authors

Yihe Zhou Zhejiang University
Yuxuan Zheng Zhejiang University
Yue Hu Zhejiang University
Kaixuan Chen State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Tongya Zheng Big Graph Center, Hangzhou City University State Key Laboratory of Blockchain and Data Security, Zhejiang University
Jie Song Zhejiang University
Mingli Song State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Shunyu Liu Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v39i21.34465

Abstract

Offline Multi-Agent Reinforcement Learning (MARL) aims to learn optimal joint policies from pre-collected datasets without further interaction with the environment. Despite the encouraging results achieved so far, we identify the policy mismatch problem that arises from employing diverse offline MARL datasets, a highly important ingredient for cooperative generalization yet largely overlooked by existing literature. Specifically, in the case that offline datasets exhibit various optimal joint policies, policy mismatch often occurs when individual actions from different optimal joint actions are combined in a way that results in a suboptimal joint action. In this paper, we introduce a novel Cooperative Policy Agreement (CPA) method, that not only mitigates the policy mismatch problem but also learns to generate diverse joint policies. CPA firstly introduces an autoregressive decision-making mechanism among agents during offline training. This mechanism enables agents to access the actions previously taken by other agents, thereby facilitating effective joint policy matching. Moreover, diverse joint policies can be directly obtained through sequential action sampling from the autoregressive model. Then we further incorporate a policy agreement mechanism to convert these autoregressive joint policies into decentralized policies with a non-autoregressive form, while still ensuring the diversity of the generated policies. This mechanism guarantees that the proposed CPA adheres to the Centralized Training with Decentralized Execution (CTDE) constraint. Experiments conducted on various benchmarks demonstrate that CPA yields superior performance to state-of-the-art competitors.

Cooperative Policy Agreement: Learning Diverse Policy for Offline MARL

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information