Any2RSI: Controllable Remote Sensing Text-to-Image Generation via Any Control and Enriched Description

Authors

  • Xu Zhang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
  • Jianzhong Huang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
  • Lefei Zhang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i15.38283

Abstract

Recent advances in controllable text-to-image (T2I) generation have achieved impressive results in natural images, but remote sensing (RS) T2I remains challenging due to the unique nature of geospatial data. Existing methods struggle to integrate diverse spatial controls and model complex spatial relationships, often failing to maintain semantic consistency with typically vague or incomplete textual descriptions. Moreover, limited by small-scale, low-quality datasets, these models produce outputs with inconsistent layouts and unrealistic content. To address these issues, we propose Any2RSI, a flexible framework for controllable RS T2I generation. It features a Cross-Modal Multi-Control Adapter that extracts modality-agnostic embeddings from heterogeneous spatial inputs, enabling precise spatial guidance. To compensate for sparse or ambiguous text prompts, we introduce a VLM-Empowered Enriched Description Generation module that enhances input descriptions with cross-modal semantics for more coherent image generation. Furthermore, we present RST2I-110K, a new large-scale dataset with over 115,000 high-quality RS image-text pairs across diverse scenes, alleviating data scarcity in this domain. Extensive experiments show that Any2RSI achieves state-of-the-art performance on both existing and new datasets, improving the realism and structural accuracy of generated RS imagery.

Published

2026-03-14

How to Cite

Zhang, X., Huang, J., & Zhang, L. (2026). Any2RSI: Controllable Remote Sensing Text-to-Image Generation via Any Control and Enriched Description. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12852–12860. https://doi.org/10.1609/aaai.v40i15.38283

Issue

Section

AAAI Technical Track on Computer Vision XII