1.
Wang J, Zhang Z, Liu Z, Li Y, Ge J, Xie H, et al. SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability. AAAI [Internet]. 2026 Mar. 14 [cited 2026 May 10];40(12):9912-20. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/37956