Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition

Shihao Zou; Wei Wei; Leyang Xu; Kaihe Xu; Wenfeng Xie

doi:10.1609/aaai.v40i16.38419

Authors

Shihao Zou Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research
Wei Wei Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research
Leyang Xu Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property
Kaihe Xu Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property
Wenfeng Xie Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property

DOI:

https://doi.org/10.1609/aaai.v40i16.38419

Abstract

Masked Image Modeling (MIM) has been widely recognized as a powerful self-supervised paradigm for learning general-purpose visual representations. However, standard MIM based on random masking tends to underperform in domain-specific tasks like Scene Text Recognition (STR), due to challenges such as information sparsity and appearance discrepancies caused by partial occlusion or distortion. To address this issue, we propose a novel pre-training framework called Appearance Discrepancy-guided Sequence Hybrid Masking (DSHM), specifically designed to learn robust representations for STR. To this end, we introduce an Appearance Discrepancy Metric that quantifies the discrepancy level of each image patch by measuring its deviation from anisotropic local discrepancy and intra-instance global style discrepancy. The resulting discrepancy scores are utilized in two key components: (1) A Sequence Hybrid Masking strategy, which prioritizes masking high-discrepancy patches in coherent block forms, thereby elevating the pretext task from simple pixel-level completion to more complex structural reasoning; (2) Discrepancy-Conditioned Tokens (DC-Tokens), which encode prior knowledge about patch difficulty into the decoder, enabling an adaptive reconstruction process and improving the model robustness under scenarios with partial occlusion or text distortion. We achieve competitive performance on multiple benchmark datasets, including common benchmarks, Union14M benchmarks, and Chinese benchmarks.

Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information