LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Authors

  • Lianghui Zhu Huazhong University of Science and Technology
  • Bin Ouyang Huazhong University of Science and Technology
  • Yuxuan Zhang Huazhong University of Science and Technology
  • Tianheng Cheng Huazhong University of Science and Technology
  • Rui Hu Huazhong University of Science and Technology
  • Haocheng Shen vivo Mobile Communication Co., Ltd
  • Longjin Ran vivo Mobile Communication Co., Ltd
  • Xiaoxin Chen vivo Mobile Communication Co., Ltd
  • Li Yu Huazhong University of Science and Technology
  • Wenyu Liu Huazhong University of Science and Technology
  • Xinggang Wang Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i16.38405

Abstract

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision–language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM).

Published

2026-03-14

How to Cite

Zhu, L., Ouyang, B., Zhang, Y., Cheng, T., Hu, R., Shen, H., … Wang, X. (2026). LENS: Learning to Segment Anything with Unified Reinforced Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13952–13960. https://doi.org/10.1609/aaai.v40i16.38405

Issue

Section

AAAI Technical Track on Computer Vision XIII