SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Authors

  • Jiahao Wang School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University State Key Laboratory of Communication Content Cognition
  • Caixia Yan School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
  • Weizhan Zhang School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
  • Haonan Lin School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
  • Mengmeng Wang College of Computer Science and Technology, Zhejiang University of Technology SGIT AI Lab, State Grid Corporation of China
  • Guang Dai SGIT AI Lab, State Grid Corporation of China
  • Tieliang Gong School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University
  • Hao Sun China Telecom Corporation Ltd. Data&AI Technology Company
  • Jingdong Wang Baidu Inc

DOI:

https://doi.org/10.1609/aaai.v39i7.32831

Abstract

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned optimizing stage and a consistent sampling stage. In the optimizing stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the sampling stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

Downloads

Published

2025-04-11

How to Cite

Wang, J., Yan, C., Zhang, W., Lin, H., Wang, M., Dai, G., Gong, T., Sun, H., & Wang, J. (2025). SpotActor: Training-Free Layout-Controlled Consistent Image Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7718-7726. https://doi.org/10.1609/aaai.v39i7.32831

Issue

Section

AAAI Technical Track on Computer Vision VI