When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye; Wei Zeng; Meng Liu; Jie Zhang; Yupeng Hu; Zitong Yu; Yu Zhou

doi:10.1609/aaai.v40i14.38183

Authors

Qilang Ye VCIP & TMCC & DISSec, College of Computer Science & College of Cryptology and Cyber Science, Nankai University Zhongguancun Academy
Wei Zeng Zhongguancun Academy
Meng Liu Zhongguancun Academy School of Computer Science and Technology, Shandong Jianzhu University
Jie Zhang School of Information Science and Technology, Great Bay University
Yupeng Hu School of Software Engineering, Shandong University
Zitong Yu School of Information Science and Technology, Great Bay University Dongguan Key Laboratory for Intelligence and Information Technology
Yu Zhou VCIP & TMCC & DISSec, College of Computer Science & College of Cryptology and Cyber Science, Nankai University Zhongguancun Academy

DOI:

https://doi.org/10.1609/aaai.v40i14.38183

Abstract

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an “Audio-Visual Confusion” scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs “Is there a/an {muted-object} sound”. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves accuracy by 10~30% over the baseline model with limited training data.

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information