Taming the Phantom: Token-Asymmetric Filtering for Hallucination Mitigation in Large Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v40i10.37768Abstract
Hallucination in Large Vision-Language Models (LVLMs) remains a critical challenge, undermining their reliability in real-world applications. Existing studies have investigated the causes of hallucination at the modality level and proposed effective strategies. However, interaction patterns beyond the modality level remain insufficiently explored. In this paper, we conduct a token-level analysis and identify two key phenomena: (1) a small subset of textual tokens in LVLMs exert disproportionate influence in the visual-active layers, surpassing that of the visual modality and potentially misleading visual understanding; (2) while LVLMs can correctly identify key visual information, insufficient focus on these cues can sometimes lead to hallucinations. Based on such observation, we attribute hallucinations in LVLMs to two token-level causes: the disproportionate influence of certain textual tokens (phantom tokens) and the underutilization of critical visual cues (anchor tokens). To mitigate these issues, we introduce Token-Asymmetric Filtering (TAF)—a training-free, plug-and-play method that modulates intermediate attention maps in LVLMs. TAF isolates the influence of phantom tokens and emphasizes the influence of anchor tokens in the visual-active layers. Experimental results across multiple benchmarks demonstrate that TAF significantly mitigates hallucinations across a range of state-of-the-art LVLMs.Published
2026-03-14
How to Cite
Ouyang, S., Wang, H., Fang, G., Ma, X., Lin, L., & Wang, X. (2026). Taming the Phantom: Token-Asymmetric Filtering for Hallucination Mitigation in Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8206-8214. https://doi.org/10.1609/aaai.v40i10.37768
Issue
Section
AAAI Technical Track on Computer Vision VII