Zhao, T., Du, J., Xue, Z., Liang, M., Li, A., Meng, X., & Liu, D. (2026). ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16441-16449. https://doi.org/10.1609/aaai.v40i19.38683