Taming the Phantom: Token-Asymmetric Filtering for Hallucination Mitigation in Large Vision-Language Models

Shuyi Ouyang; Hongyi Wang; Gongfan Fang; Xinyin Ma; Lanfen Lin; Xinchao Wang

doi:10.1609/aaai.v40i10.37768

Authors

Shuyi Ouyang Zhejiang University National University of Singapore
Hongyi Wang Zhejiang University
Gongfan Fang National University of Singapore
Xinyin Ma National University of Singapore
Lanfen Lin Zhejiang University
Xinchao Wang National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v40i10.37768

Abstract

Hallucination in Large Vision-Language Models (LVLMs) remains a critical challenge, undermining their reliability in real-world applications. Existing studies have investigated the causes of hallucination at the modality level and proposed effective strategies. However, interaction patterns beyond the modality level remain insufficiently explored. In this paper, we conduct a token-level analysis and identify two key phenomena: (1) a small subset of textual tokens in LVLMs exert disproportionate influence in the visual-active layers, surpassing that of the visual modality and potentially misleading visual understanding; (2) while LVLMs can correctly identify key visual information, insufficient focus on these cues can sometimes lead to hallucinations. Based on such observation, we attribute hallucinations in LVLMs to two token-level causes: the disproportionate influence of certain textual tokens (phantom tokens) and the underutilization of critical visual cues (anchor tokens). To mitigate these issues, we introduce Token-Asymmetric Filtering (TAF)—a training-free, plug-and-play method that modulates intermediate attention maps in LVLMs. TAF isolates the influence of phantom tokens and emphasizes the influence of anchor tokens in the visual-active layers. Experimental results across multiple benchmarks demonstrate that TAF significantly mitigates hallucinations across a range of state-of-the-art LVLMs.

Taming the Phantom: Token-Asymmetric Filtering for Hallucination Mitigation in Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information