Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Jiadong Wang; Zexu Pan; Malu Zhang; Robby T. Tan; Haizhou Li

doi:10.1609/aaai.v38i17.29882

Authors

Jiadong Wang National University of Singapore The Chinese University of Hong Kong, Shenzhen
Zexu Pan National University of Singapore
Malu Zhang University of Electronic Science and Technology of China
Robby T. Tan National University of Singapore
Haizhou Li The Chinese University of Hong Kong, Shenzhen National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v38i17.29882

Keywords:

NLP: Speech, CV: Multi-modal Vision

Abstract

Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription