When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Authors

  • Qilang Ye VCIP & TMCC & DISSec, College of Computer Science & College of Cryptology and Cyber Science, Nankai University Zhongguancun Academy
  • Wei Zeng Zhongguancun Academy
  • Meng Liu Zhongguancun Academy School of Computer Science and Technology, Shandong Jianzhu University
  • Jie Zhang School of Information Science and Technology, Great Bay University
  • Yupeng Hu School of Software Engineering, Shandong University
  • Zitong Yu School of Information Science and Technology, Great Bay University Dongguan Key Laboratory for Intelligence and Information Technology
  • Yu Zhou VCIP & TMCC & DISSec, College of Computer Science & College of Cryptology and Cyber Science, Nankai University Zhongguancun Academy

DOI:

https://doi.org/10.1609/aaai.v40i14.38183

Abstract

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an “Audio-Visual Confusion” scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs “Is there a/an {muted-object} sound”. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves accuracy by 10~30% over the baseline model with limited training data.

Downloads

Published

2026-03-14

How to Cite

Ye, Q., Zeng, W., Liu, M., Zhang, J., Hu, Y., Yu, Z., & Zhou, Y. (2026). When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11955-11963. https://doi.org/10.1609/aaai.v40i14.38183

Issue

Section

AAAI Technical Track on Computer Vision XI