SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Jiahao Wang; Caixia Yan; Weizhan Zhang; Haonan Lin; Mengmeng Wang; Guang Dai; Tieliang Gong; Hao Sun; Jingdong Wang

doi:10.1609/aaai.v39i7.32831

Authors

Jiahao Wang School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University State Key Laboratory of Communication Content Cognition
Caixia Yan School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
Weizhan Zhang School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
Haonan Lin School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
Mengmeng Wang College of Computer Science and Technology, Zhejiang University of Technology SGIT AI Lab, State Grid Corporation of China
Guang Dai SGIT AI Lab, State Grid Corporation of China
Tieliang Gong School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
Hao Sun China Telecom Corporation Ltd. Data&AI Technology Company
Jingdong Wang Baidu Inc

DOI:

https://doi.org/10.1609/aaai.v39i7.32831

Abstract

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned optimizing stage and a consistent sampling stage. In the optimizing stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the sampling stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information