RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Fu Rong; Meng Lan; Qian Zhang; Lefei Zhang

doi:10.1609/aaai.v40i11.37828

Authors

Fu Rong National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
Meng Lan Hong Kong University of Science and Technology
Qian Zhang Horizon Robotics
Lefei Zhang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i11.37828

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information