Reasoning via Implicit Self-supervised Emergence for Instruction Segmentation

Authors

  • Qing Zhou Northwestern Polytechnical University
  • Lichang Yang Northwestern Polytechnical University
  • Yuyu Jia Northwestern Polytechnical University
  • Junyu Gao Northwestern Polytechnical University
  • Weiping Ni Northwest Institute of Nuclear technology
  • Junzheng Wu Northwest Institute of Nuclear technology
  • Qi Wang Northwestern Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v40i16.38382

Abstract

We challenge the assumption that complex instruction-guided segmentation tasks necessitate equally complex and explicit supervision. This paper introduces RISE (Reasoning via Implicit Self-supervised Emergence), a framework that learns intricate compositional reasoning, spanning spatial relations to world knowledge, without a single ground-truth mask. To achieve this, RISE employs reinforcement learning with GRPO guided by a single, strikingly simple reward: the semantic alignment score between the textual instruction and the predicted image region. Our primary discovery is the implicit emergence of a high-quality chain-of-thought process from this minimalist signal. Within a structured format, the model autonomously learns to understand instructions by accessing its latent knowledge, inferring spatial relationships—capabilities inherent in its architecture but unlocked by our simple objective. Remarkably, our emergent reasoning yields highly competitive results: RISE achieves 58.7 gIoU on the ReasonSeg benchmark, on par with methods using geometric rewards. Furthermore, we show extreme data efficiency: a variant trained on only 2,000 ImageNet-label pairs establishes a new state-of-the-art for annotation-free referring segmentation with 79.6 cIoU on RefCOCO.

Downloads

Published

2026-03-14

How to Cite

Zhou, Q., Yang, L., Jia, Y., Gao, J., Ni, W., Wu, J., & Wang, Q. (2026). Reasoning via Implicit Self-supervised Emergence for Instruction Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13746–13754. https://doi.org/10.1609/aaai.v40i16.38382

Issue

Section

AAAI Technical Track on Computer Vision XIII