DICE: Distilling Classifier-Free Guidance into Text Embeddings

Authors

  • Zhenyu Zhou Zhejiang University, State Key Laboratory of Blockchain and Data Security HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security
  • Defang Chen University at Buffalo, State University of New York
  • Can Wang Zhejiang University, State Key Laboratory of Blockchain and Data Security HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security
  • Chun Chen Zhejiang University, State Key Laboratory of Blockchain and Data Security HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security
  • Siwei Lyu University at Buffalo, State University of New York

DOI:

https://doi.org/10.1609/aaai.v40i16.38397

Abstract

Text-to-image diffusion models are capable of generating high-quality images, but suboptimal pre-trained text representations often result in these images failing to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, CFG introduces significant computational overhead. In this paper, we present DIstilling CFG by sharpening text Embeddings (DICE) that replaces CFG in the sampling process with half the computational complexity while maintaining similar generation quality. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Furthermore, examining the enhancement pattern, we identify the underlying mechanism of DICE that sharpens specific components of text embeddings to preserve semantic information while enhancing fine-grained details. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL, and PixArt-\alpha demonstrate the effectiveness of our method.

Published

2026-03-14

How to Cite

Zhou, Z., Chen, D., Wang, C., Chen, C., & Lyu, S. (2026). DICE: Distilling Classifier-Free Guidance into Text Embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13880–13888. https://doi.org/10.1609/aaai.v40i16.38397

Issue

Section

AAAI Technical Track on Computer Vision XIII