Wang, J., Zhang, Z., Liu, Z., Li, Y., Ge, J., Xie, H., & Zhang, Y. (2026). SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9912–9920. https://doi.org/10.1609/aaai.v40i12.37956