Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Hao Ma; Shijie Wang; Zhiqiang Pu; Siyao Zhao; Xiaolin Ai

doi:10.1609/aaai.v39i18.34123

Authors

Hao Ma School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences
Shijie Wang School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences
Zhiqiang Pu School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences
Siyao Zhao School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences
Xiaolin Ai Institute of Automation, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v39i18.34123

Abstract

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information