SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Jiankang Wang; Zhihan Zhang; Zhihang Liu; Yang Li; Jiannan Ge; Hongtao Xie; Yongdong Zhang

doi:10.1609/aaai.v40i12.37956

Authors

Jiankang Wang University of Science and Technology of China
Zhihan Zhang University of Science and Technology of China
Zhihang Liu University of Science and Technology of China
Yang Li Renmin University of China
Jiannan Ge University of Science and Technology of China
Hongtao Xie University of Science and Technology of China
Yongdong Zhang University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i12.37956

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two key bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it hard to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatial coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments show that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness.

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information