ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model

Tong Zhao; Junping Du; Zhe Xue; Meiyu Liang; Aijing Li; Xiaolong Meng; Dandan Liu

doi:10.1609/aaai.v40i19.38683

Authors

Tong Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Junping Du Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Zhe Xue Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Meiyu Liang Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Aijing Li Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Xiaolong Meng Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia
Dandan Liu Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia

DOI:

https://doi.org/10.1609/aaai.v40i19.38683

Abstract

Spatial-temporal prediction plays a crucial role in various domains, including intelligent transportation and environmental monitoring. Although large language model has shown advantages in long-range dependency modeling and excellent generalization ability for forecasting, it has limited understanding of spatial-temporal features. Especially for spatial features, most existing methods still simplify the spatial-temporal prediction task into multiple independent temporal prediction tasks, failing to effectively encode the dynamic evolution of spatial relations. To address these problems, we propose ST-VLM (Spatial-Temporal Forecasting with Vision-Language Model), a novel framework that leverages visual representations to encode the dynamic spatial dependencies within spatial-temporal data and integrates multi-modal information to enhance prediction. This framework transforms spatial-temporal features into three modalities: vision, text, and time series, enhances cross-modal fusion through an attention-aware fusion mechanism in the first-layer of Vision-Language Model (VLM), optimizes multi-modal feature interaction via adaptive fine-tuning strategies. After fusion, the multi-modal embeddings are subsequently used for the final spatial-temporal prediction task. Extensive experiments demonstrate that ST-VLM achieves state-of-the-art performance across various datasets. In particular, the framework exhibits promising results in few-shot scenarios, verifying its strong generalization ability.

ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information