Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition
DOI:
https://doi.org/10.1609/aaai.v38i17.29882Keywords:
NLP: Speech, CV: Multi-modal VisionAbstract
Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.Downloads
Published
2024-03-24
How to Cite
Wang, J., Pan, Z., Zhang, M., Tan, R. T., & Li, H. (2024). Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19144-19152. https://doi.org/10.1609/aaai.v38i17.29882
Issue
Section
AAAI Technical Track on Natural Language Processing II