LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Feng Xue; Baochao Zhu; Wei Jia; Shujie Li; Yu Li; Jinrui Zhang; Shengeng Tang; Dan Guo

doi:10.1609/aaai.v40i14.38133

Authors

Feng Xue Hefei University of Technology
Baochao Zhu Hefei University of Technology
Wei Jia Hefei University of Technology
Shujie Li Hefei University of Technology
Yu Li Hefei University of Technology
Jinrui Zhang Hefei University of Technology
Shengeng Tang Hefei University of Technology
Dan Guo Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i14.38133

Abstract

Visual Speech Recognition (VSR), commonly known as lipreading, enables the recognition of spoken text by analyzing lip visual features. Due to the subtlety of lip movements, its recognition is much harder than other motion recognition tasks. Existing VSR models face the challenge of viseme ambiguity when processing phonemes with similar pronunciations—multiple phonemes share similar viseme features, leading to a notable drop in lipreading accuracy. To address this issue, this study proposes a Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition(LinProVSR) framework. First, an ambiguous sample set is constructed based on linguistic knowledge to provide supervisory signals for the model's training. Then, a Progressive Contrastive Disambiguation Network (PCDN) is designed, which progressively enhances the model's ability to capture the subtle viseme differences corresponding to similar phonemes through viseme-phoneme contrastive disambiguation in the encoding stage and text contrastive disambiguation in the decoding stage. Furthermore, we pioneer the Ambiguous Word Error Rate (AWER) metric specifically for evaluating recognition of phonetically ambiguous text, and verify the effectiveness of the proposed method on multiple public datasets, achieving a significant breakthrough especially in distinguishing visually similar phonemes.

LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information