ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion

Authors

  • Lishuai Gao Tianjin University of Technology Meituan
  • Jun-Yan He Meituan
  • Yingsen Zeng Meituan
  • Yujie Zhong Meituan
  • Xiaopeng Sun Meituan
  • Jie Hu Meituan
  • Zan Gao Tianjin University of Technology
  • Xiaoming Wei Meituan

DOI:

https://doi.org/10.1609/aaai.v40i6.42408

Abstract

Current text-to-image models face challenges in visual text rendering: text encoders like CLIP and T5 lack glyph-level understanding and often struggle to distinguish between the specific words to be rendered and their intended semantic meaning within prompts. In addition, inconsistencies between the base model and its plugins further compromise the quality of synthesized images. In this paper, we enhance the existing text-to-image method by addressing the following aspects: (1) Text-Glyph Alignmentin a Visual Question Answering (VQA) manner to enable glyph understanding for the text encoder. This involves establishing an explicit alignment between the representations of the glyphs and their detailed attribute descriptions, which boosts the model's ability to capture fine-grained visual features of the text. (2) Accurate and harmony visual text rendering: integrating pre-aligned glyph-visual embeddings with semantic text tokens through the Multimodal Diffusion Transformer(MMDiT) synchronously, ensuring coherent feature alignment and enhancing both the robustness and fidelity of visual text rendering. (3) Image Aesthetic Refinement: leveraging a multisource data training strategy that incorporates diverse, high-quality image-text pairs from various domains, exposing the model to extensive linguistic and visual diversity while maintaining superior aesthetic quality throughout training. Our experiments demonstrate that the proposed approach significantly outperforms the existing state-of-the-art method.

Published

2026-03-14

How to Cite

Gao, L., He, J.-Y., Zeng, Y., Zhong, Y., Sun, X., Hu, J., … Wei, X. (2026). ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4131–4139. https://doi.org/10.1609/aaai.v40i6.42408

Issue

Section

AAAI Technical Track on Computer Vision III