Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

Chao Li; Yupeng Zhang; Jianqi Wang; Yujing Hu; Shaokang Dong; Wenbin Li; Tangjie Lv; Changjie Fan; Yang Gao

doi:10.1609/aaai.v38i16.29694

Authors

Chao Li State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yupeng Zhang Alibaba DAMO Academy, Hangzhou, China
Jianqi Wang Meituan, Beijing, China
Yujing Hu NetEase Fuxi AI Lab, Hangzhou, China
Shaokang Dong State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Wenbin Li State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Tangjie Lv NetEase Fuxi AI Lab, Hangzhou, China
Changjie Fan NetEase Fuxi AI Lab, Hangzhou, China
Yang Gao State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29694

Keywords:

MAS: Multiagent Learning, ML: Reinforcement Learning

Abstract

In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.

Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information