GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Authors

  • Shurong Zheng Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Yousong Zhu School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing, China
  • Hongyin Zhao Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
  • Fan Yang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Yufei Zhan Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Ming Tang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Jinqiao Wang Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Wuhan AI Research, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v40i34.40120

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to cognitive demands and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

Downloads

Published

2026-03-14

How to Cite

Zheng, S., Zhu, Y., Zhao, H., Yang, F., Zhan, Y., Tang, M., & Wang, J. (2026). GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28857–28865. https://doi.org/10.1609/aaai.v40i34.40120

Issue

Section

AAAI Technical Track on Machine Learning XI