Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Authors

  • Jia Li University of Texas at Dallas
  • Wenjie Zhao University of Texas at Dallas
  • Ziru Huang Tsinghua University
  • Yunhui Guo University of Texas at Dallas
  • Yapeng Tian University of Texas at Dallas

DOI:

https://doi.org/10.1609/aaai.v40i8.37542

Abstract

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context, resulting in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios, including silence, noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that while state-of-the-art AVS methods consistently fail under negative audio conditions, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving high-quality segmentation performance.

Downloads

Published

2026-03-14

How to Cite

Li, J., Zhao, W., Huang, Z., Guo, Y., & Tian, Y. (2026). Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6172–6180. https://doi.org/10.1609/aaai.v40i8.37542

Issue

Section

AAAI Technical Track on Computer Vision V