Progressive Text-to-Image Diffusion with Soft Latent Direction

Authors

  • YuTeng Ye Huazhong University of Science and Technology
  • Jiale Cai Huazhong University of Science and Technology
  • Hang Zhou Huazhong University of Science and Technology
  • Guanwen Li Huazhong University of Science and Technology
  • Youjia Zhang Huazhong University of Science and Technology
  • Zikai Song Huazhong University of Science and Technology
  • Chenxing Gao Huazhong University of Science and Technology
  • Junqing Yu Huazhong University of Science and Technology
  • Wei Yang Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v38i7.28492

Keywords:

CV: Computational Photography, Image & Video Synthesis, CV: Language and Vision

Abstract

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

Published

2024-03-24

How to Cite

Ye, Y., Cai, J., Zhou, H., Li, G., Zhang, Y., Song, Z., Gao, C., Yu, J., & Yang, W. (2024). Progressive Text-to-Image Diffusion with Soft Latent Direction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6693-6701. https://doi.org/10.1609/aaai.v38i7.28492

Issue

Section

AAAI Technical Track on Computer Vision VI