MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v39i18.34056Abstract
Recent advancements in Large Vision-Language Models (LVLMs) highlight their ability to integrate and process multi-modal information. However, hallucinations—where generated content is inconsistent with input vision and instructions—remain a challenge. In this paper, we analyze LVLMs' layer-wise decoding and identify that hallucinations can arise during the reasoning and factual information injection process. Additionally, as the number of generated tokens increases, the forgetting of the original prompt may also lead to hallucinations.To address this, we propose a training-free decoding method called Mixture of Layer Experts (MoLE). MoLE leverages a heuristic gating mechanism to dynamically select multiple layers of LVLMs as expert layers: the Final Expert, the Second Opinion expert, and the Prompt Retention Expert. By the cooperation of each expert, MoLE enhances the robustness and faithfulness of the generation process. Our extensive experiments demonstrate that MoLE significantly reduces hallucinations, outperforming the current state-of-the-art decoding techniques across three mainstream LVLMs and two established hallucination benchmarks. Moreover, our method reveals the potential of LVLMs to independently produce more reliable and accurate outputs.Downloads
Published
2025-04-11
How to Cite
Liang, T., Du, Y., Huang, J., Kong, M., Chen, L., Li, Y., … Zhu, Q. (2025). MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(18), 18684–18692. https://doi.org/10.1609/aaai.v39i18.34056
Issue
Section
AAAI Technical Track on Machine Learning IV