CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Zhihong Zhu; Yunyan Zhang; Fan Zhang; Bowen Xing; Xian Wu

doi:10.1609/aaai.v40i41.40835

Authors

Zhihong Zhu Tencent Jarvis Lab
Yunyan Zhang Tencent Jarvis Lab
Fan Zhang The Chinese University of Hong Kong
Bowen Xing University of Science and Technology Beijing
Xian Wu Tencent Jarvis Lab

DOI:

https://doi.org/10.1609/aaai.v40i41.40835

Abstract

Medical Visual Question Answering (Med-VQA) aims to generate accurate answers for clinical questions grounded in medical images, which has attracted increasing research attention due to its potential to streamline diagnostics and reduce clinical burden. Recent advances in Large Vision-Language Models (LVLMs) have shown great promise for Med-VQA, but still suffer from two inference-time issues: (1) attention shift, where the LVLM over-relies on textual priors; and (2) attention dispersion, where it fails to focus on critical diagnostic regions. To tackle these issues, we propose Contrastive Mutual Information Decoding (CMID), a training-free inference-time intervention grounded in information theory for Med-VQA. Concretely, CMID first identifies the Principal Focus Area (PFA) from decoder attention maps, then constructs focus-preserving and focus-excluding views to derive dual contrastive signals that simultaneously amplify salient visual cues and suppress background noise. Crucially, these corrective signals are adaptively scaled by a reliability-gated self-correction mechanism, based on the distributional shift induced by the PFA. Extensive experiments on three Med-VQA benchmarks demonstrate the effectiveness of CMID. Further analyses showcase its robust generalizability across diverse medical architectures and tasks.

CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information