Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Authors

  • Jiadong Wang National University of Singapore The Chinese University of Hong Kong, Shenzhen
  • Zexu Pan National University of Singapore
  • Malu Zhang University of Electronic Science and Technology of China
  • Robby T. Tan National University of Singapore
  • Haizhou Li The Chinese University of Hong Kong, Shenzhen National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v38i17.29882

Keywords:

NLP: Speech, CV: Multi-modal Vision

Abstract

Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.

Published

2024-03-24

How to Cite

Wang, J., Pan, Z., Zhang, M., Tan, R. T., & Li, H. (2024). Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19144-19152. https://doi.org/10.1609/aaai.v38i17.29882

Issue

Section

AAAI Technical Track on Natural Language Processing II