Medical Vision–Language Pretraining with LLM-Guided Temporal Supervision

Authors

  • Liang Bai Shanxi University
  • Zhi Wang Shanxi University
  • Huimin Yan Shanxi University
  • Xian Yang University of Manchester

DOI:

https://doi.org/10.1609/aaai.v40i24.39047

Abstract

Medical vision–language pretraining typically relies on static image–text pairs, overlooking temporal cues vital for understanding clinical progression. This limits model sensitivity to evolving semantics and reduces their effectiveness in real-world clinical reasoning. To address this challenge, we propose TAMM—a temporal alignment framework that leverages weak but semantically rich supervision from large language models (LLMs). Given temporally adjacent clinical reports, LLMs automatically generate (i) coarse-grained trend labels (e.g., improving or worsening), and (ii) fine-grained rationales explaining the supporting clinical evidence. These complementary signals inject temporal semantics without requiring manual annotation, and guide vision–language representation learning to capture trend-sensitive cross-modal alignment and rationale-grounded coherence. Experiments on multiple medical benchmarks demonstrate that TAMM improves retrieval and classification performance while yielding more interpretable, temporally consistent embeddings. Our results highlight the potential of leveraging LLM-derived supervision to equip vision–language models with temporal awareness critical for clinical applications.

Published

2026-03-14

How to Cite

Bai, L., Wang, Z., Yan, H., & Yang, X. (2026). Medical Vision–Language Pretraining with LLM-Guided Temporal Supervision. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 19666–19674. https://doi.org/10.1609/aaai.v40i24.39047

Issue

Section

AAAI Technical Track on Machine Learning I