Enhancing Retrieval-Augmented Large Vision Language Models via Knowledge Conflict Mitigation

Authors

  • Wenbin An Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics
  • Jiahao Nie Nanyang Technological University
  • Feng Tian Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics
  • Mingxiang Cai Lenovo Research
  • Yaqiang Wu Lenovo Research
  • Xiaoqin Zhang Zhejiang University of Technology
  • Shijian Lu Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v40i4.37216

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has recently been explored to empower Large Vision Language Models (LVLMs) with more comprehensive and up-to-date contextual knowledge, aiming to compensate for their limited and coarse-grained parametric knowledge in knowledge-intensive tasks. However, the retrieved contextual knowledge is usually not aligned with LVLMs’ internal parametric knowledge, leading to knowledge conflicts and further unreliable responses. To tackle this issue, we design KCM, a training-free and plug-and-play framework that can effectively mitigate knowledge conflicts while incorporating MRAG for more accurate LVLM responses. KCM enhances contextual knowledge utilization by modifying the LVLM architecture from three key perspectives. First, KCM adaptively adjusts attention distributions among multiple attention heads, encouraging LVLMs to focus on contextual knowledge with reduced distraction. Second, KCM identifies and prunes knowledge-centric LVLM neurons that encode coarse-grained parametric knowledge, thereby suppressing interferences and enabling more effective integration of contextual knowledge. Third, KCM amplifies the information flow from the input context by injecting supplementary context logits, reinforcing its contribution to the final output. Extensive experiments over multiple LVLMs and benchmarks show that KCM outperforms the state-of-the-art consistently by large margins, incurring neither extra training nor external tools.

Downloads

Published

2026-03-14

How to Cite

An, W., Nie, J., Tian, F., Cai, M., Wu, Y., Zhang, X., & Lu, S. (2026). Enhancing Retrieval-Augmented Large Vision Language Models via Knowledge Conflict Mitigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2318-2326. https://doi.org/10.1609/aaai.v40i4.37216

Issue

Section

AAAI Technical Track on Computer Vision I