WANG, Jiankang; ZHANG, Zhihan; LIU, Zhihang; LI, Yang; GE, Jiannan; XIE, Hongtao; ZHANG, Yongdong. SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability. Proceedings of the AAAI Conference on Artificial Intelligence, [S. l.], v. 40, n. 12, p. 9912–9920, 2026. DOI: 10.1609/aaai.v40i12.37956. Disponível em: https://ojs.aaai.org/index.php/AAAI/article/view/37956. Acesso em: 10 may. 2026.