Multi-Region Text-Driven Manipulation of Diffusion Imagery

Yiming Li; Peng  Zhou; Jun Sun; Yi Xu

doi:10.1609/aaai.v38i4.28111

Authors

Yiming Li Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Peng Zhou China Mobile (Suzhou) Software Technology Co., Ltd, China
Jun Sun Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University
Yi Xu Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v38i4.28111

Keywords:

CV: Computational Photography, Image & Video Synthesis, CV: Language and Vision, CV: Multi-modal Vision, CV: Learning & Optimization for CV, General

Abstract

Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.

Multi-Region Text-Driven Manipulation of Diffusion Imagery

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription