Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Authors

  • Xu Yuan The Hong Kong Polytechnic University
  • Li Zhou TAO Technology, Alibaba Group
  • Zenghui Sun TAO Technology, Alibaba Group
  • Zikun Zhou Pengcheng Laboratory
  • Jinsong Lan TAO Technology, Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v39i9.33054

Abstract

Large Multimodal Models (LMMs) have significantly progressed by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation by integrating segmentation models. Despite the innovations, existing works’ textual responses and segmentation masks remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues. To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research. Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple/empty segmentation, and reasoning segmentation. The great properties and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Downloads

Published

2025-04-11

How to Cite

Yuan, X., Zhou, L., Sun, Z., Zhou, Z., & Lan, J. (2025). Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9725–9733. https://doi.org/10.1609/aaai.v39i9.33054

Issue

Section

AAAI Technical Track on Computer Vision VIII