VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Authors

  • Zishan Xu South China Normal University Shanghai Jiao Tong University
  • Yifu Guo South China Normal University Sun Yat-sen University
  • Yuquan Lu South China Normal University Sun Yat-sen University
  • Fengyu Yang South China Normal University
  • Junxin Li South China Normal University Sun Yat-sen University
  • Lihua Cai South China Normal University Xiamen Rekey Medical Technology Co., LTD

DOI:

https://doi.org/10.1609/aaai.v40i14.38132

Abstract

Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose VideoSeg-R1, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks.

Published

2026-03-14

How to Cite

Xu, Z., Guo, Y., Lu, Y., Yang, F., Li, J., & Cai, L. (2026). VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11496–11504. https://doi.org/10.1609/aaai.v40i14.38132

Issue

Section

AAAI Technical Track on Computer Vision XI