Decoupled Textual Embeddings for Customized Image Generation

Yufei Cai; Yuxiang Wei; Zhilong Ji; Jinfeng Bai; Hu Han; Wangmeng Zuo

doi:10.1609/aaai.v38i2.27850

Authors

Yufei Cai Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Harbin Institute of Technology
Yuxiang Wei Harbin Institute of Technology
Zhilong Ji Tomorrow Advancing Life
Jinfeng Bai Tomorrow Advancing Life
Hu Han Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Wangmeng Zuo Harbin Institute of Technology Pengcheng Lab

DOI:

https://doi.org/10.1609/aaai.v38i2.27850

Keywords:

CV: Computational Photography, Image & Video Synthesis

Abstract

Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX.

Decoupled Textual Embeddings for Customized Image Generation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information