Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition
DOI:
https://doi.org/10.1609/aaai.v40i16.38419Abstract
Masked Image Modeling (MIM) has been widely recognized as a powerful self-supervised paradigm for learning general-purpose visual representations. However, standard MIM based on random masking tends to underperform in domain-specific tasks like Scene Text Recognition (STR), due to challenges such as information sparsity and appearance discrepancies caused by partial occlusion or distortion. To address this issue, we propose a novel pre-training framework called Appearance Discrepancy-guided Sequence Hybrid Masking (DSHM), specifically designed to learn robust representations for STR. To this end, we introduce an Appearance Discrepancy Metric that quantifies the discrepancy level of each image patch by measuring its deviation from anisotropic local discrepancy and intra-instance global style discrepancy. The resulting discrepancy scores are utilized in two key components: (1) A Sequence Hybrid Masking strategy, which prioritizes masking high-discrepancy patches in coherent block forms, thereby elevating the pretext task from simple pixel-level completion to more complex structural reasoning; (2) Discrepancy-Conditioned Tokens (DC-Tokens), which encode prior knowledge about patch difficulty into the decoder, enabling an adaptive reconstruction process and improving the model robustness under scenarios with partial occlusion or text distortion. We achieve competitive performance on multiple benchmark datasets, including common benchmarks, Union14M benchmarks, and Chinese benchmarks.Downloads
Published
2026-03-14
How to Cite
Zou, S., Wei, W., Xu, L., Xu, K., & Xie, W. (2026). Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 14077–14085. https://doi.org/10.1609/aaai.v40i16.38419
Issue
Section
AAAI Technical Track on Computer Vision XIII