CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Authors

  • Zhihong Zhu Tencent Jarvis Lab
  • Yunyan Zhang Tencent Jarvis Lab
  • Fan Zhang The Chinese University of Hong Kong
  • Bowen Xing University of Science and Technology Beijing
  • Xian Wu Tencent Jarvis Lab

DOI:

https://doi.org/10.1609/aaai.v40i41.40835

Abstract

Medical Visual Question Answering (Med-VQA) aims to generate accurate answers for clinical questions grounded in medical images, which has attracted increasing research attention due to its potential to streamline diagnostics and reduce clinical burden. Recent advances in Large Vision-Language Models (LVLMs) have shown great promise for Med-VQA, but still suffer from two inference-time issues: (1) attention shift, where the LVLM over-relies on textual priors; and (2) attention dispersion, where it fails to focus on critical diagnostic regions. To tackle these issues, we propose Contrastive Mutual Information Decoding (CMID), a training-free inference-time intervention grounded in information theory for Med-VQA. Concretely, CMID first identifies the Principal Focus Area (PFA) from decoder attention maps, then constructs focus-preserving and focus-excluding views to derive dual contrastive signals that simultaneously amplify salient visual cues and suppress background noise. Crucially, these corrective signals are adaptively scaled by a reliability-gated self-correction mechanism, based on the distributional shift induced by the PFA. Extensive experiments on three Med-VQA benchmarks demonstrate the effectiveness of CMID. Further analyses showcase its robust generalizability across diverse medical architectures and tasks.

Downloads

Published

2026-03-14

How to Cite

Zhu, Z., Zhang, Y., Zhang, F., Xing, B., & Wu, X. (2026). CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35275–35283. https://doi.org/10.1609/aaai.v40i41.40835

Issue

Section

AAAI Technical Track on Natural Language Processing VI