SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Authors

  • Jiankang Wang University of Science and Technology of China
  • Zhihan Zhang University of Science and Technology of China
  • Zhihang Liu University of Science and Technology of China
  • Yang Li Renmin University of China
  • Jiannan Ge University of Science and Technology of China
  • Hongtao Xie University of Science and Technology of China
  • Yongdong Zhang University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i12.37956

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two key bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it hard to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatial coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments show that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness.

Downloads

Published

2026-03-14

How to Cite

Wang, J., Zhang, Z., Liu, Z., Li, Y., Ge, J., Xie, H., & Zhang, Y. (2026). SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9912–9920. https://doi.org/10.1609/aaai.v40i12.37956

Issue

Section

AAAI Technical Track on Computer Vision IX