CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding
DOI:
https://doi.org/10.1609/aaai.v40i41.40835Abstract
Medical Visual Question Answering (Med-VQA) aims to generate accurate answers for clinical questions grounded in medical images, which has attracted increasing research attention due to its potential to streamline diagnostics and reduce clinical burden. Recent advances in Large Vision-Language Models (LVLMs) have shown great promise for Med-VQA, but still suffer from two inference-time issues: (1) attention shift, where the LVLM over-relies on textual priors; and (2) attention dispersion, where it fails to focus on critical diagnostic regions. To tackle these issues, we propose Contrastive Mutual Information Decoding (CMID), a training-free inference-time intervention grounded in information theory for Med-VQA. Concretely, CMID first identifies the Principal Focus Area (PFA) from decoder attention maps, then constructs focus-preserving and focus-excluding views to derive dual contrastive signals that simultaneously amplify salient visual cues and suppress background noise. Crucially, these corrective signals are adaptively scaled by a reliability-gated self-correction mechanism, based on the distributional shift induced by the PFA. Extensive experiments on three Med-VQA benchmarks demonstrate the effectiveness of CMID. Further analyses showcase its robust generalizability across diverse medical architectures and tasks.Downloads
Published
2026-03-14
How to Cite
Zhu, Z., Zhang, Y., Zhang, F., Xing, B., & Wu, X. (2026). CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35275–35283. https://doi.org/10.1609/aaai.v40i41.40835
Issue
Section
AAAI Technical Track on Natural Language Processing VI