Enhancing Retrieval-Augmented Large Vision Language Models via Knowledge Conflict Mitigation

Wenbin An; Jiahao Nie; Feng Tian; Mingxiang Cai; Yaqiang Wu; Xiaoqin Zhang; Shijian Lu

doi:10.1609/aaai.v40i4.37216

Authors

Wenbin An Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics
Jiahao Nie Nanyang Technological University
Feng Tian Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics
Mingxiang Cai Lenovo Research
Yaqiang Wu Lenovo Research
Xiaoqin Zhang Zhejiang University of Technology
Shijian Lu Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v40i4.37216

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has recently been explored to empower Large Vision Language Models (LVLMs) with more comprehensive and up-to-date contextual knowledge, aiming to compensate for their limited and coarse-grained parametric knowledge in knowledge-intensive tasks. However, the retrieved contextual knowledge is usually not aligned with LVLMs’ internal parametric knowledge, leading to knowledge conflicts and further unreliable responses. To tackle this issue, we design KCM, a training-free and plug-and-play framework that can effectively mitigate knowledge conflicts while incorporating MRAG for more accurate LVLM responses. KCM enhances contextual knowledge utilization by modifying the LVLM architecture from three key perspectives. First, KCM adaptively adjusts attention distributions among multiple attention heads, encouraging LVLMs to focus on contextual knowledge with reduced distraction. Second, KCM identifies and prunes knowledge-centric LVLM neurons that encode coarse-grained parametric knowledge, thereby suppressing interferences and enabling more effective integration of contextual knowledge. Third, KCM amplifies the information flow from the input context by injecting supplementary context logits, reinforcing its contribution to the final output. Extensive experiments over multiple LVLMs and benchmarks show that KCM outperforms the state-of-the-art consistently by large margins, incurring neither extra training nor external tools.

Enhancing Retrieval-Augmented Large Vision Language Models via Knowledge Conflict Mitigation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information