LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Authors

  • Feng Xue Hefei University of Technology
  • Baochao Zhu Hefei University of Technology
  • Wei Jia Hefei University of Technology
  • Shujie Li Hefei University of Technology
  • Yu Li Hefei University of Technology
  • Jinrui Zhang Hefei University of Technology
  • Shengeng Tang Hefei University of Technology
  • Dan Guo Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i14.38133

Abstract

Visual Speech Recognition (VSR), commonly known as lipreading, enables the recognition of spoken text by analyzing lip visual features. Due to the subtlety of lip movements, its recognition is much harder than other motion recognition tasks. Existing VSR models face the challenge of viseme ambiguity when processing phonemes with similar pronunciations—multiple phonemes share similar viseme features, leading to a notable drop in lipreading accuracy. To address this issue, this study proposes a Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition(LinProVSR) framework. First, an ambiguous sample set is constructed based on linguistic knowledge to provide supervisory signals for the model's training. Then, a Progressive Contrastive Disambiguation Network (PCDN) is designed, which progressively enhances the model's ability to capture the subtle viseme differences corresponding to similar phonemes through viseme-phoneme contrastive disambiguation in the encoding stage and text contrastive disambiguation in the decoding stage. Furthermore, we pioneer the Ambiguous Word Error Rate (AWER) metric specifically for evaluating recognition of phonetically ambiguous text, and verify the effectiveness of the proposed method on multiple public datasets, achieving a significant breakthrough especially in distinguishing visually similar phonemes.

Downloads

Published

2026-03-14

How to Cite

Xue, F., Zhu, B., Jia, W., Li, S., Li, Y., Zhang, J., … Guo, D. (2026). LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11505–11513. https://doi.org/10.1609/aaai.v40i14.38133

Issue

Section

AAAI Technical Track on Computer Vision XI