Video Spatial Reasoning with Object-Centric 3D Rollout

Authors

  • Haoran Tang School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Meng Cao Mohamed bin Zayed University of Artificial Intelligence School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Ruyang Liu School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Xiaoxi Liang School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Linglong Li School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Ge Li School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
  • Xiaodan Liang Sun Yat-sen University Mohamed bin Zayed University of Artificial Intelligence

DOI:

https://doi.org/10.1609/aaai.v40i11.37899

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning—the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes—remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

Downloads

Published

2026-03-14

How to Cite

Tang, H., Cao, M., Liu, R., Liang, X., Li, L., Li, G., & Liang, X. (2026). Video Spatial Reasoning with Object-Centric 3D Rollout. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9395–9403. https://doi.org/10.1609/aaai.v40i11.37899

Issue

Section

AAAI Technical Track on Computer Vision VIII