Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Authors

  • Jinxing Zhou Mohamed bin Zayed University of Artificial Intelligence
  • Yanghao Zhou National University of Singapore
  • Mingfei Han Mohamed bin Zayed University of Artificial Intelligence
  • Tong Wang Mohamed bin Zayed University of Artificial Intelligence
  • Xiaojun Chang Mohamed bin Zayed University of Artificial Intelligence University of Science and Technology of China
  • Hisham Cholakkal Mohamed bin Zayed University of Artificial Intelligence
  • Rao Muhammad Anwer Mohamed bin Zayed University of Artificial Intelligence

DOI:

https://doi.org/10.1609/aaai.v40i16.38373

Abstract

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R2-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R2-AVSBench.

Downloads

Published

2026-03-14

How to Cite

Zhou, J., Zhou, Y., Han, M., Wang, T., Chang, X., Cholakkal, H., & Anwer, R. M. (2026). Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13665–13673. https://doi.org/10.1609/aaai.v40i16.38373

Issue

Section

AAAI Technical Track on Computer Vision XIII