MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Tian Liang; Yuetian Du; Jing Huang; Ming Kong; Luyuan Chen; Yadong Li; Siye Chen; Qiang Zhu

doi:10.1609/aaai.v39i18.34056

Authors

Tian Liang Zhejiang University
Yuetian Du Zhejiang University
Jing Huang Zhejiang University
Ming Kong Zhejiang University
Luyuan Chen Beijing Information Science and Technology University
Yadong Li Ant Group
Siye Chen Ant Group
Qiang Zhu Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v39i18.34056

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) highlight their ability to integrate and process multi-modal information. However, hallucinations—where generated content is inconsistent with input vision and instructions—remain a challenge. In this paper, we analyze LVLMs' layer-wise decoding and identify that hallucinations can arise during the reasoning and factual information injection process. Additionally, as the number of generated tokens increases, the forgetting of the original prompt may also lead to hallucinations.To address this, we propose a training-free decoding method called Mixture of Layer Experts (MoLE). MoLE leverages a heuristic gating mechanism to dynamically select multiple layers of LVLMs as expert layers: the Final Expert, the Second Opinion expert, and the Prompt Retention Expert. By the cooperation of each expert, MoLE enhances the robustness and faithfulness of the generation process. Our extensive experiments demonstrate that MoLE significantly reduces hallucinations, outperforming the current state-of-the-art decoding techniques across three mainstream LVLMs and two established hallucination benchmarks. Moreover, our method reveals the potential of LVLMs to independently produce more reliable and accurate outputs.

MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information