ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model

Authors

  • Tong Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Junping Du Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Zhe Xue Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Meiyu Liang Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Aijing Li Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Xiaolong Meng Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
  • Dandan Liu Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia

DOI:

https://doi.org/10.1609/aaai.v40i19.38683

Abstract

Spatial-temporal prediction plays a crucial role in various domains, including intelligent transportation and environmental monitoring. Although large language model has shown advantages in long-range dependency modeling and excellent generalization ability for forecasting, it has limited understanding of spatial-temporal features. Especially for spatial features, most existing methods still simplify the spatial-temporal prediction task into multiple independent temporal prediction tasks, failing to effectively encode the dynamic evolution of spatial relations. To address these problems, we propose ST-VLM (Spatial-Temporal Forecasting with Vision-Language Model), a novel framework that leverages visual representations to encode the dynamic spatial dependencies within spatial-temporal data and integrates multi-modal information to enhance prediction. This framework transforms spatial-temporal features into three modalities: vision, text, and time series, enhances cross-modal fusion through an attention-aware fusion mechanism in the first-layer of Vision-Language Model (VLM), optimizes multi-modal feature interaction via adaptive fine-tuning strategies. After fusion, the multi-modal embeddings are subsequently used for the final spatial-temporal prediction task. Extensive experiments demonstrate that ST-VLM achieves state-of-the-art performance across various datasets. In particular, the framework exhibits promising results in few-shot scenarios, verifying its strong generalization ability.

Downloads

Published

2026-03-14

How to Cite

Zhao, T., Du, J., Xue, Z., Liang, M., Li, A., Meng, X., & Liu, D. (2026). ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16441-16449. https://doi.org/10.1609/aaai.v40i19.38683

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management III