Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Jia Li; Wenjie Zhao; Ziru Huang; Yunhui Guo; Yapeng Tian

doi:10.1609/aaai.v40i8.37542

Authors

Jia Li University of Texas at Dallas
Wenjie Zhao University of Texas at Dallas
Ziru Huang Tsinghua University
Yunhui Guo University of Texas at Dallas
Yapeng Tian University of Texas at Dallas

DOI:

https://doi.org/10.1609/aaai.v40i8.37542

Abstract

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context, resulting in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios, including silence, noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that while state-of-the-art AVS methods consistently fail under negative audio conditions, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving high-quality segmentation performance.

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information