RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Authors

  • Fu Rong National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
  • Meng Lan Hong Kong University of Science and Technology
  • Qian Zhang Horizon Robotics
  • Lefei Zhang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i11.37828

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

Downloads

Published

2026-03-14

How to Cite

Rong, F., Lan, M., Zhang, Q., & Zhang, L. (2026). RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 8751–8759. https://doi.org/10.1609/aaai.v40i11.37828

Issue

Section

AAAI Technical Track on Computer Vision VIII