ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion
DOI:
https://doi.org/10.1609/aaai.v40i6.42408Abstract
Current text-to-image models face challenges in visual text rendering: text encoders like CLIP and T5 lack glyph-level understanding and often struggle to distinguish between the specific words to be rendered and their intended semantic meaning within prompts. In addition, inconsistencies between the base model and its plugins further compromise the quality of synthesized images. In this paper, we enhance the existing text-to-image method by addressing the following aspects: (1) Text-Glyph Alignmentin a Visual Question Answering (VQA) manner to enable glyph understanding for the text encoder. This involves establishing an explicit alignment between the representations of the glyphs and their detailed attribute descriptions, which boosts the model's ability to capture fine-grained visual features of the text. (2) Accurate and harmony visual text rendering: integrating pre-aligned glyph-visual embeddings with semantic text tokens through the Multimodal Diffusion Transformer(MMDiT) synchronously, ensuring coherent feature alignment and enhancing both the robustness and fidelity of visual text rendering. (3) Image Aesthetic Refinement: leveraging a multisource data training strategy that incorporates diverse, high-quality image-text pairs from various domains, exposing the model to extensive linguistic and visual diversity while maintaining superior aesthetic quality throughout training. Our experiments demonstrate that the proposed approach significantly outperforms the existing state-of-the-art method.Downloads
Published
2026-03-14
How to Cite
Gao, L., He, J.-Y., Zeng, Y., Zhong, Y., Sun, X., Hu, J., … Wei, X. (2026). ViType: High-Fidelity Visual Text Rendering via Glyph-Aware Multimodal Diffusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4131–4139. https://doi.org/10.1609/aaai.v40i6.42408
Issue
Section
AAAI Technical Track on Computer Vision III