MaRS: A Multi-modality Very-high-resolution Remote Sensing Foundation Model with Cross-Granularity Meta-Modality Learning

Authors

  • Ruoyu Yang Wuhan University
  • Yinhe Liu Wuhan University
  • Heng Yan Wuhan University
  • Yiheng Zhou Wuhan University
  • Yihan Fu Wuhan University
  • Han Luo Wuhan University
  • Yanfei Zhong Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i14.38153

Abstract

The multi-modality remote sensing foundation model (MM-RSFM) has made notable progress recently. However, most existing approaches remain limited to medium-resolution, single-modality, restricting their performance in fine-grained downstream applications such as disaster response and urban planning. In this work, MaRS is proposed, a multi-modality very-high-resolution (VHR) remote sensing foundation model designed for cross-modality granularity interpretation of complex scenes. To achieve this, a multi-modality VHR SAR-optical dataset, MaRS-16M, is constructed through large-scale collection and semi-automated processing, comprising over 16 million paired samples. Unlike previous work, MaRS tackles two fundamental challenges in VHR SAR-optical self-supervised learning (SSL) techniques. Cross-granularity contrastive learning (CGCL) is introduced to alleviate alignment inconsistencies caused by imaging differences, and meta-modality attention (MMA) is designed to unify heterogeneous physical characteristics across modalities. Compared to existing remote sensing foundation models (RSFMs) and general vision foundation models (VFMs), MaRS performs better as a pre-trained backbone across nine multi-modality VHR downstream tasks.

Downloads

Published

2026-03-14

How to Cite

Yang, R., Liu, Y., Yan, H., Zhou, Y., Fu, Y., Luo, H., & Zhong, Y. (2026). MaRS: A Multi-modality Very-high-resolution Remote Sensing Foundation Model with Cross-Granularity Meta-Modality Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11685-11693. https://doi.org/10.1609/aaai.v40i14.38153

Issue

Section

AAAI Technical Track on Computer Vision XI