EchoBat: Echo-Vision Enhancement and Echo-Layered Sampling for Video LLMs Hallucination Mitigation
DOI:
https://doi.org/10.1609/aaai.v40i42.40875Abstract
Recent advancements in multimodal large language models (MLLMs) have shown remarkable progress in video understanding. However, video MLLMs (VideoMLLMs) still suffer from hallucinations, generating nonsensical or irrelevant content. This issue partly stems from over-reliance on pre-trained knowledge, sometimes neglecting the rich visual information present in the video. Additionally, many existing methods rely on uniform frame sampling, which can overlook critical visual cues. To address these challenges, we present EchoBat, a novel approach that leverages audio information as well as video temporal and logical consistency to improve preference data construction and keyframe extraction. Our method integrates Direct Preference Optimization (DPO) to mitigate hallucinations by leveraging high-quality, contextually rich preference feedback. Specifically, we use GPT-4o to generate high-quality video descriptions and integrate visually relevant segments from Whisper-derived transcripts to construct preference responses. Correspondingly, we use the reference model itself to describe the reversed video, and use GPT-4o to flashback the text and fill in the hallucination to produce non-preferred responses. This strategy enhances the model’s ability to better understand visual content and temporal, logical relationships within videos. Furthermore, we propose an echo-layered sampling strategy for keyframe extraction from videos, which can provide more precise visual supervision compared to uniform sampling. Experimental results on the three latest video hallucination benchmarks demonstrate the effectiveness of our approach.Published
2026-03-14
How to Cite
Liu, S., Chen, D., Pan, Y., Tian, C., Li, Q., & Lin, C. (2026). EchoBat: Echo-Vision Enhancement and Echo-Layered Sampling for Video LLMs Hallucination Mitigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35635–35643. https://doi.org/10.1609/aaai.v40i42.40875
Issue
Section
AAAI Technical Track on Philosophy and Ethics of AI