Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

Authors

  • Chao Li State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
  • Yupeng Zhang Alibaba DAMO Academy, Hangzhou, China
  • Jianqi Wang Meituan, Beijing, China
  • Yujing Hu NetEase Fuxi AI Lab, Hangzhou, China
  • Shaokang Dong State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
  • Wenbin Li State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
  • Tangjie Lv NetEase Fuxi AI Lab, Hangzhou, China
  • Changjie Fan NetEase Fuxi AI Lab, Hangzhou, China
  • Yang Gao State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29694

Keywords:

MAS: Multiagent Learning, ML: Reinforcement Learning

Abstract

In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.

Published

2024-03-24

How to Cite

Li, C., Zhang, Y., Wang, J., Hu, Y., Dong, S., Li, W., Lv, T., Fan, C., & Gao, Y. (2024). Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17453-17460. https://doi.org/10.1609/aaai.v38i16.29694

Issue

Section

AAAI Technical Track on Multiagent Systems