EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Authors

  • Runnan Lu National University of Singapore
  • Yuxuan Zhang The Chinese University of Hong Kong
  • Jiaming Liu Alibaba
  • Haofan Wang Liblib AI
  • Yiren Song National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v40i9.37697

Abstract

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an under-explored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

Downloads

Published

2026-03-14

How to Cite

Lu, R., Zhang, Y., Liu, J., Wang, H., & Song, Y. (2026). EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7565–7573. https://doi.org/10.1609/aaai.v40i9.37697

Issue

Section

AAAI Technical Track on Computer Vision VI