Context-Based Contrastive Learning for Scene Text Recognition


  • Xinyun Zhang The Chinese University of Hong Kong
  • Binwu Zhu The Chinese University of Hong Kong
  • Xufeng Yao The Chinese University of Hong Kong
  • Qi Sun The Chinese University of Hong Kong
  • Ruiyu Li Smartmore
  • Bei Yu The Chinese University of Hong Kong



Computer Vision (CV)


Pursuing accurate and robust recognizers has been a long-lasting goal for scene text recognition (STR) researchers. Recently, attention-based methods have demonstrated their effectiveness and achieved impressive results on public benchmarks. The attention mechanism enables models to recognize scene text with severe visual distortions by leveraging contextual information. However, recent studies revealed that the implicit over-reliance of context leads to catastrophic out-of-vocabulary performance. On the contrary to the superior accuracy of the seen text, models are prone to misrecognize unseen text even with good image quality. We propose a novel framework, Context-based contrastive learning (ConCLR), to alleviate this issue. Our proposed method first generates characters with different contexts via simple image concatenation operations and then optimizes contrastive loss on their embeddings. By pulling together clusters of identical characters within various contexts and pushing apart clusters of different characters in embedding space, ConCLR suppresses the side-effect of overfitting to specific contexts and learns a more robust representation. Experiments show that ConCLR significantly improves out-of-vocabulary generalization and achieves state-of-the-art performance on public benchmarks together with attention-based recognizers.




How to Cite

Zhang, X., Zhu, B., Yao, X., Sun, Q., Li, R., & Yu, B. (2022). Context-Based Contrastive Learning for Scene Text Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 3353-3361.



AAAI Technical Track on Computer Vision III