Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Jinxing Zhou; Yanghao Zhou; Mingfei Han; Tong Wang; Xiaojun Chang; Hisham Cholakkal; Rao Muhammad Anwer

doi:10.1609/aaai.v40i16.38373

Authors

Jinxing Zhou Mohamed bin Zayed University of Artificial Intelligence
Yanghao Zhou National University of Singapore
Mingfei Han Mohamed bin Zayed University of Artificial Intelligence
Tong Wang Mohamed bin Zayed University of Artificial Intelligence
Xiaojun Chang Mohamed bin Zayed University of Artificial Intelligence University of Science and Technology of China
Hisham Cholakkal Mohamed bin Zayed University of Artificial Intelligence
Rao Muhammad Anwer Mohamed bin Zayed University of Artificial Intelligence

DOI:

https://doi.org/10.1609/aaai.v40i16.38373

Abstract

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R2-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R2-AVSBench.

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information