Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning


  • Chen Chen Nanyang Technological University
  • Yuchen Hu Nanyang Technological University
  • Qiang Zhang ZJU-Hangzhou Global Scientific and Technological Innovation Center Zhejiang University
  • Heqing Zou Nanyang Technological University
  • Beier Zhu Nanyang Technological University
  • Eng Siong Chng Nanyang Technological University




SNLP: Speech and Multimodality, SNLP: Applications


Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.




How to Cite

Chen, C., Hu, Y., Zhang, Q., Zou, H., Zhu, B., & Chng, E. S. (2023). Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12607-12615. https://doi.org/10.1609/aaai.v37i11.26484



AAAI Technical Track on Speech & Natural Language Processing