Medical Vision–Language Pretraining with LLM-Guided Temporal Supervision

Liang Bai; Zhi Wang; Huimin Yan; Xian Yang

doi:10.1609/aaai.v40i24.39047

Authors

Liang Bai Shanxi University
Zhi Wang Shanxi University
Huimin Yan Shanxi University
Xian Yang University of Manchester

DOI:

https://doi.org/10.1609/aaai.v40i24.39047

Abstract

Medical vision–language pretraining typically relies on static image–text pairs, overlooking temporal cues vital for understanding clinical progression. This limits model sensitivity to evolving semantics and reduces their effectiveness in real-world clinical reasoning. To address this challenge, we propose TAMM—a temporal alignment framework that leverages weak but semantically rich supervision from large language models (LLMs). Given temporally adjacent clinical reports, LLMs automatically generate (i) coarse-grained trend labels (e.g., improving or worsening), and (ii) fine-grained rationales explaining the supporting clinical evidence. These complementary signals inject temporal semantics without requiring manual annotation, and guide vision–language representation learning to capture trend-sensitive cross-modal alignment and rationale-grounded coherence. Experiments on multiple medical benchmarks demonstrate that TAMM improves retrieval and classification performance while yielding more interpretable, temporally consistent embeddings. Our results highlight the potential of leveraging LLM-derived supervision to equip vision–language models with temporal awareness critical for clinical applications.

Medical Vision–Language Pretraining with LLM-Guided Temporal Supervision

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information