CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening

Authors

  • Lihua Jian Zhengzhou University
  • Jiabo Liu Zhengzhou University
  • Shaowu Wu Wuhan University
  • Lihui Chen Chongqing University

DOI:

https://doi.org/10.1609/aaai.v40i7.37451

Abstract

Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios. To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel loss integrating semantic language constraints, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.

Downloads

Published

2026-03-14

How to Cite

Jian, L., Liu, J., Wu, S., & Chen, L. (2026). CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5350–5358. https://doi.org/10.1609/aaai.v40i7.37451

Issue

Section

AAAI Technical Track on Computer Vision IV