Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang; Meng Cao; Ruyang Liu; Xiaoxi Liang; Linglong Li; Ge Li; Xiaodan Liang

doi:10.1609/aaai.v40i11.37899

Authors

Haoran Tang School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Meng Cao Mohamed bin Zayed University of Artificial Intelligence School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Ruyang Liu School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Xiaoxi Liang School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Linglong Li School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Ge Li School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Xiaodan Liang Sun Yat-sen University Mohamed bin Zayed University of Artificial Intelligence

DOI:

https://doi.org/10.1609/aaai.v40i11.37899

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning—the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes—remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

Video Spatial Reasoning with Object-Centric 3D Rollout

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information