Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

Authors

  • Xiang Gao Peking University
  • Zhengbo Xu Peking University
  • Junhan Zhao Peking University
  • Jiaying Liu Peking University

DOI:

https://doi.org/10.1609/aaai.v38i3.27951

Keywords:

CV: Computational Photography, Image & Video Synthesis, CV: Multi-modal Vision

Abstract

Recently, text-to-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.

Published

2024-03-24

How to Cite

Gao, X., Xu, Z., Zhao, J., & Liu, J. (2024). Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 1824-1832. https://doi.org/10.1609/aaai.v38i3.27951

Issue

Section

AAAI Technical Track on Computer Vision II