GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Shurong Zheng; Yousong Zhu; Hongyin Zhao; Fan Yang; Yufei Zhan; Ming Tang; Jinqiao Wang

doi:10.1609/aaai.v40i34.40120

Authors

Shurong Zheng Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Yousong Zhu School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing, China
Hongyin Zhao Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Fan Yang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Yufei Zhan Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Ming Tang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Jinqiao Wang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Wuhan AI Research, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v40i34.40120

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to cognitive demands and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information