DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

Authors

  • Nagur Shareef Shaik Georgia State University
  • Teja Krishna Cherukuri Georgia State University
  • Adnan Masood UST
  • Dong Hye Ye Georgia State University

DOI:

https://doi.org/10.1609/aaai.v40i11.37835

Abstract

The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has set new state-of-the-art BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.

Published

2026-03-14

How to Cite

Shaik, N. S., Cherukuri, T. K., Masood, A., & Hye Ye, D. (2026). DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 8814–8823. https://doi.org/10.1609/aaai.v40i11.37835

Issue

Section

AAAI Technical Track on Computer Vision VIII