Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Yeji Song; Jimyeong Kim; Wonhark Park; Wonsik Shin; Wonjong Rhee; Nojun Kwak

doi:10.1609/aaai.v39i19.34264

Authors

Yeji Song Seoul National University
Jimyeong Kim Seoul National University
Wonhark Park Seoul National University
Wonsik Shin Seoul National University
Wonjong Rhee Seoul National University
Nojun Kwak Seoul National University

DOI:

https://doi.org/10.1609/aaai.v39i19.34264

Abstract

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new context. However, the existing methods often 1) generate images with the same pose as an input image, and 2) exhibit deterioration in the subject's identity when facing a pose variation prompt. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the pose indication in the textual embedding. Conversely, the textual embedding also harms the subject's identity which is tightly entangled with the pose in the visual embedding. As a remedy, we propose text-orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our method is both effective and robust, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information