Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition

Authors

  • Shihao Zou Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research
  • Wei Wei Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research
  • Leyang Xu Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property
  • Kaihe Xu Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property
  • Wenfeng Xie Joint Laboratory of HUST and Pingan Property & Casualty Research Pingan Insurance of Causal & Property

DOI:

https://doi.org/10.1609/aaai.v40i16.38419

Abstract

Masked Image Modeling (MIM) has been widely recognized as a powerful self-supervised paradigm for learning general-purpose visual representations. However, standard MIM based on random masking tends to underperform in domain-specific tasks like Scene Text Recognition (STR), due to challenges such as information sparsity and appearance discrepancies caused by partial occlusion or distortion. To address this issue, we propose a novel pre-training framework called Appearance Discrepancy-guided Sequence Hybrid Masking (DSHM), specifically designed to learn robust representations for STR. To this end, we introduce an Appearance Discrepancy Metric that quantifies the discrepancy level of each image patch by measuring its deviation from anisotropic local discrepancy and intra-instance global style discrepancy. The resulting discrepancy scores are utilized in two key components: (1) A Sequence Hybrid Masking strategy, which prioritizes masking high-discrepancy patches in coherent block forms, thereby elevating the pretext task from simple pixel-level completion to more complex structural reasoning; (2) Discrepancy-Conditioned Tokens (DC-Tokens), which encode prior knowledge about patch difficulty into the decoder, enabling an adaptive reconstruction process and improving the model robustness under scenarios with partial occlusion or text distortion. We achieve competitive performance on multiple benchmark datasets, including common benchmarks, Union14M benchmarks, and Chinese benchmarks.

Published

2026-03-14

How to Cite

Zou, S., Wei, W., Xu, L., Xu, K., & Xie, W. (2026). Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 14077–14085. https://doi.org/10.1609/aaai.v40i16.38419

Issue

Section

AAAI Technical Track on Computer Vision XIII