Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Authors

  • Yuhang Qian MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, Jiangsu, China College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
  • Haiyan Chen College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
  • Wentong Li MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, Jiangsu, China
  • Ningzhong Liu College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
  • Jie Qin MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, Jiangsu, China

DOI:

https://doi.org/10.1609/aaai.v40i10.37804

Abstract

Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.

Published

2026-03-14

How to Cite

Qian, Y., Chen, H., Li, W., Liu, N., & Qin, J. (2026). Text-guided Controllable Diffusion for Realistic Camouflage Images Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8529-8537. https://doi.org/10.1609/aaai.v40i10.37804

Issue

Section

AAAI Technical Track on Computer Vision VII