LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Authors

  • Mushui Liu Zhejiang University Fuxi AI Lab, NetEase
  • Yuhang Ma Fuxi AI Lab, NetEase
  • Zhen Yang The Hong Kong University of Science and Technology
  • Jun Dan Zhejiang University
  • Yunlong Yu Zhejiang University
  • Zeng Zhao Fuxi AI Lab, NetEase
  • Zhipeng Hu Fuxi AI Lab, NetEase
  • Bai Liu Fuxi AI Lab, NetEase
  • Changjie Fan Fuxi AI Lab, NetEase

DOI:

https://doi.org/10.1609/aaai.v39i5.32588

Abstract

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

Downloads

Published

2025-04-11

How to Cite

Liu, M., Ma, Y., Yang, Z., Dan, J., Yu, Y., Zhao, Z., Hu, Z., Liu, B., & Fan, C. (2025). LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 5523-5531. https://doi.org/10.1609/aaai.v39i5.32588

Issue

Section

AAAI Technical Track on Computer Vision IV