ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model
DOI:
https://doi.org/10.1609/aaai.v40i19.38683Abstract
Spatial-temporal prediction plays a crucial role in various domains, including intelligent transportation and environmental monitoring. Although large language model has shown advantages in long-range dependency modeling and excellent generalization ability for forecasting, it has limited understanding of spatial-temporal features. Especially for spatial features, most existing methods still simplify the spatial-temporal prediction task into multiple independent temporal prediction tasks, failing to effectively encode the dynamic evolution of spatial relations. To address these problems, we propose ST-VLM (Spatial-Temporal Forecasting with Vision-Language Model), a novel framework that leverages visual representations to encode the dynamic spatial dependencies within spatial-temporal data and integrates multi-modal information to enhance prediction. This framework transforms spatial-temporal features into three modalities: vision, text, and time series, enhances cross-modal fusion through an attention-aware fusion mechanism in the first-layer of Vision-Language Model (VLM), optimizes multi-modal feature interaction via adaptive fine-tuning strategies. After fusion, the multi-modal embeddings are subsequently used for the final spatial-temporal prediction task. Extensive experiments demonstrate that ST-VLM achieves state-of-the-art performance across various datasets. In particular, the framework exhibits promising results in few-shot scenarios, verifying its strong generalization ability.Published
2026-03-14
How to Cite
Zhao, T., Du, J., Xue, Z., Liang, M., Li, A., Meng, X., & Liu, D. (2026). ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16441-16449. https://doi.org/10.1609/aaai.v40i19.38683
Issue
Section
AAAI Technical Track on Data Mining & Knowledge Management III