Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Authors

  • Bingyu Li Department of Electronic Engineering and Information Science, University of Science and Technology of China, China Institute of Artificial Intelligence (TeleAI), China
  • Haocheng Dong Department of Electronic Engineering and Information Science, University of Science and Technology of China, China Institute of Artificial Intelligence (TeleAI), China
  • Da Zhang School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, China Institute of Artificial Intelligence (TeleAI), China
  • Zhiyuan Zhao Institute of Artificial Intelligence (TeleAI), China
  • Hao Sun Institute of Artificial Intelligence (TeleAI), China
  • Junyu Gao School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, China Institute of Artificial Intelligence (TeleAI), China

DOI:

https://doi.org/10.1609/aaai.v40i8.37521

Abstract

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (OVRSISBench) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose RSKT-Seg, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2× faster inference through efficient aggregation.

Published

2026-03-14

How to Cite

Li, B., Dong, H., Zhang, D., Zhao, Z., Sun, H., & Gao, J. (2026). Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 5982-5991. https://doi.org/10.1609/aaai.v40i8.37521

Issue

Section

AAAI Technical Track on Computer Vision V