TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Minghao Li; Tengchao Lv; Jingye Chen; Lei Cui; Yijuan Lu; Dinei Florencio; Cha Zhang; Zhoujun Li; Furu Wei

doi:10.1609/aaai.v37i11.26538

Authors

Minghao Li Beihang University
Tengchao Lv Microsoft Corporation
Jingye Chen Microsoft Corporation
Lei Cui Microsoft Corporation
Yijuan Lu Microsoft Corporation
Dinei Florencio Microsoft Corporation
Cha Zhang Microsoft Corporation
Zhoujun Li Beihang University
Furu Wei Microsoft Corporation

DOI:

https://doi.org/10.1609/aaai.v37i11.26538

Keywords:

SNLP: Applications, CV: Language and Vision

Abstract

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription