Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

Authors

  • Chen Chen Nanyang Technological University
  • Yuchen Hu Nanyang Technological University
  • Qiang Zhang ZJU-Hangzhou Global Scientific and Technological Innovation Center Zhejiang University
  • Heqing Zou Nanyang Technological University
  • Beier Zhu Nanyang Technological University
  • Eng Siong Chng Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v37i11.26484

Keywords:

SNLP: Speech and Multimodality, SNLP: Applications

Abstract

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

Downloads

Published

2023-06-26

How to Cite

Chen, C., Hu, Y., Zhang, Q., Zou, H., Zhu, B., & Chng, E. S. (2023). Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12607-12615. https://doi.org/10.1609/aaai.v37i11.26484

Issue

Section

AAAI Technical Track on Speech & Natural Language Processing