RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation
DOI:
https://doi.org/10.1609/aaai.v40i11.37828Abstract
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.Downloads
Published
2026-03-14
How to Cite
Rong, F., Lan, M., Zhang, Q., & Zhang, L. (2026). RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 8751–8759. https://doi.org/10.1609/aaai.v40i11.37828
Issue
Section
AAAI Technical Track on Computer Vision VIII