Multi-Region Text-Driven Manipulation of Diffusion Imagery
DOI:
https://doi.org/10.1609/aaai.v38i4.28111Keywords:
CV: Computational Photography, Image & Video Synthesis, CV: Language and Vision, CV: Multi-modal Vision, CV: Learning & Optimization for CV, GeneralAbstract
Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.Downloads
Published
2024-03-24
How to Cite
Li, Y., Zhou, P. ., Sun, J., & Xu, Y. (2024). Multi-Region Text-Driven Manipulation of Diffusion Imagery. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3261-3269. https://doi.org/10.1609/aaai.v38i4.28111
Issue
Section
AAAI Technical Track on Computer Vision III