Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization
DOI:
https://doi.org/10.1609/aaai.v39i19.34264Abstract
In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new context. However, the existing methods often 1) generate images with the same pose as an input image, and 2) exhibit deterioration in the subject's identity when facing a pose variation prompt. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the pose indication in the textual embedding. Conversely, the textual embedding also harms the subject's identity which is tightly entangled with the pose in the visual embedding. As a remedy, we propose text-orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our method is both effective and robust, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.Downloads
Published
2025-04-11
How to Cite
Song, Y., Kim, J., Park, W., Shin, W., Rhee, W., & Kwak, N. (2025). Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(19), 20549-20557. https://doi.org/10.1609/aaai.v39i19.34264
Issue
Section
AAAI Technical Track on Machine Learning V