Any2RSI: Controllable Remote Sensing Text-to-Image Generation via Any Control and Enriched Description
DOI:
https://doi.org/10.1609/aaai.v40i15.38283Abstract
Recent advances in controllable text-to-image (T2I) generation have achieved impressive results in natural images, but remote sensing (RS) T2I remains challenging due to the unique nature of geospatial data. Existing methods struggle to integrate diverse spatial controls and model complex spatial relationships, often failing to maintain semantic consistency with typically vague or incomplete textual descriptions. Moreover, limited by small-scale, low-quality datasets, these models produce outputs with inconsistent layouts and unrealistic content. To address these issues, we propose Any2RSI, a flexible framework for controllable RS T2I generation. It features a Cross-Modal Multi-Control Adapter that extracts modality-agnostic embeddings from heterogeneous spatial inputs, enabling precise spatial guidance. To compensate for sparse or ambiguous text prompts, we introduce a VLM-Empowered Enriched Description Generation module that enhances input descriptions with cross-modal semantics for more coherent image generation. Furthermore, we present RST2I-110K, a new large-scale dataset with over 115,000 high-quality RS image-text pairs across diverse scenes, alleviating data scarcity in this domain. Extensive experiments show that Any2RSI achieves state-of-the-art performance on both existing and new datasets, improving the realism and structural accuracy of generated RS imagery.Downloads
Published
2026-03-14
How to Cite
Zhang, X., Huang, J., & Zhang, L. (2026). Any2RSI: Controllable Remote Sensing Text-to-Image Generation via Any Control and Enriched Description. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12852–12860. https://doi.org/10.1609/aaai.v40i15.38283
Issue
Section
AAAI Technical Track on Computer Vision XII