Multi-Region Text-Driven Manipulation of Diffusion Imagery

Authors

  • Yiming Li Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
  • Peng Zhou China Mobile (Suzhou) Software Technology Co., Ltd, China
  • Jun Sun Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University
  • Yi Xu Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v38i4.28111

Keywords:

CV: Computational Photography, Image & Video Synthesis, CV: Language and Vision, CV: Multi-modal Vision, CV: Learning & Optimization for CV, General

Abstract

Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.

Published

2024-03-24

How to Cite

Li, Y., Zhou, P. ., Sun, J., & Xu, Y. (2024). Multi-Region Text-Driven Manipulation of Diffusion Imagery. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3261-3269. https://doi.org/10.1609/aaai.v38i4.28111

Issue

Section

AAAI Technical Track on Computer Vision III