CO²IF: Language-Bridging Hyperspectral-Multispectral Image Fusion with Coordinated and Cross-modal Optimal Transport

Authors

  • Mingjin Zhang Xidian University
  • Zhongkai Yang Xidian University
  • Fei Gao Hangzhou Institute of Technology, Xidian University

DOI:

https://doi.org/10.1609/aaai.v40i15.38261

Abstract

Due to the difficulties of directly obtaining high-resolution hyperspectral images (HR-HSI), the fusion of low-resolution hyperspectral images (LR-HSI) and high-resolution multispectral images (HR-MSI) has emerged as an effective approach. While existing methods leverage image-level priors from HR-MSI, they often lack explicit semantic guidance for precise detail reconstruction. Recognizing that textual scene descriptions encapsulate valuable object attributes and contextual information, we introduce the first Language-Bridging framework for Hyperspectral and Multispectral image fusion (CO²IF). CO²IF leverages language semantics as prior knowledge to explicitly guide the reconstruction process. To bridge the modality gap between textual descriptions and high-dimensional hyperspectral data, we design a Cross-modal Optimal Transport (COT) module. COT establishes precise semantic correspondences between language features and the visual cues of individual spectral bands. Building upon this semantic alignment, we develop a Multimodal Coordinated State Space Model (CoMamba). CoMamba effectively integrates the language-derived priors with spatial information from HR-MSI and spectral information from LR-HSI. This language-guided reconstruction significantly enhances the extraction of crucial spatial-spectral details, leading to superior fidelity in the generated HR-HSI. In addition, this paper adds text descriptions for three widely used datasets. Both qualitative and quantitative experimental results on the public datasets confirm the superiority of the proposed method compared to the SOTA methods.

Downloads

Published

2026-03-14

How to Cite

Zhang, M., Yang, Z., & Gao, F. (2026). CO²IF: Language-Bridging Hyperspectral-Multispectral Image Fusion with Coordinated and Cross-modal Optimal Transport. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12654–12662. https://doi.org/10.1609/aaai.v40i15.38261

Issue

Section

AAAI Technical Track on Computer Vision XII