ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion

Lishuai Gao; Jun-Yan He; Yingsen Zeng; Yujie Zhong; Xiaopeng Sun; Jie Hu; Zan Gao; Xiaoming Wei

doi:10.1609/aaai.v40i6.42408

Authors

Lishuai Gao Tianjin University of Technology Meituan
Jun-Yan He Meituan
Yingsen Zeng Meituan
Yujie Zhong Meituan
Xiaopeng Sun Meituan
Jie Hu Meituan
Zan Gao Tianjin University of Technology
Xiaoming Wei Meituan

DOI:

https://doi.org/10.1609/aaai.v40i6.42408

Abstract

Current text-to-image models face challenges in visual text rendering: text encoders like CLIP and T5 lack glyph-level understanding and often struggle to distinguish between the specific words to be rendered and their intended semantic meaning within prompts. In addition, inconsistencies between the base model and its plugins further compromise the quality of synthesized images. In this paper, we enhance the existing text-to-image method by addressing the following aspects: (1) Text-Glyph Alignmentin a Visual Question Answering (VQA) manner to enable glyph understanding for the text encoder. This involves establishing an explicit alignment between the representations of the glyphs and their detailed attribute descriptions, which boosts the model's ability to capture fine-grained visual features of the text. (2) Accurate and harmony visual text rendering: integrating pre-aligned glyph-visual embeddings with semantic text tokens through the Multimodal Diffusion Transformer(MMDiT) synchronously, ensuring coherent feature alignment and enhancing both the robustness and fidelity of visual text rendering. (3) Image Aesthetic Refinement: leveraging a multisource data training strategy that incorporates diverse, high-quality image-text pairs from various domains, exposing the model to extensive linguistic and visual diversity while maintaining superior aesthetic quality throughout training. Our experiments demonstrate that the proposed approach significantly outperforms the existing state-of-the-art method.

ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information