EchoBat: Echo-Vision Enhancement and Echo-Layered Sampling for Video LLMs Hallucination Mitigation

Shuai Liu; Da Chen; Yiheng Pan; Chenwei Tian; Qian Li; Chenhao Lin

doi:10.1609/aaai.v40i42.40875

Authors

Shuai Liu School of Software Engineering, Xi’an Jiaotong University
Da Chen School of Software Engineering, Xi’an Jiaotong University ByteDance
Yiheng Pan School of Software Engineering, Xi’an Jiaotong University
Chenwei Tian School of Software Engineering, Xi’an Jiaotong University
Qian Li School of Cyber Science and Engineering, Xi’an Jiaotong University
Chenhao Lin School of Cyber Science and Engineering, Xi’an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i42.40875

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown remarkable progress in video understanding. However, video MLLMs (VideoMLLMs) still suffer from hallucinations, generating nonsensical or irrelevant content. This issue partly stems from over-reliance on pre-trained knowledge, sometimes neglecting the rich visual information present in the video. Additionally, many existing methods rely on uniform frame sampling, which can overlook critical visual cues. To address these challenges, we present EchoBat, a novel approach that leverages audio information as well as video temporal and logical consistency to improve preference data construction and keyframe extraction. Our method integrates Direct Preference Optimization (DPO) to mitigate hallucinations by leveraging high-quality, contextually rich preference feedback. Specifically, we use GPT-4o to generate high-quality video descriptions and integrate visually relevant segments from Whisper-derived transcripts to construct preference responses. Correspondingly, we use the reference model itself to describe the reversed video, and use GPT-4o to flashback the text and fill in the hallucination to produce non-preferred responses. This strategy enhances the model’s ability to better understand visual content and temporal, logical relationships within videos. Furthermore, we propose an echo-layered sampling strategy for keyframe extraction from videos, which can provide more precise visual supervision compared to uniform sampling. Experimental results on the three latest video hallucination benchmarks demonstrate the effectiveness of our approach.

EchoBat: Echo-Vision Enhancement and Echo-Layered Sampling for Video LLMs Hallucination Mitigation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information