Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation

Authors

  • Chih-Chun Yang Department of Computer Science and Information Engineering, National Taiwan University, Taiwan, R.O.C.
  • Wan-Cyuan Fan Graduate Institute of Communication Engineering, National Taiwan University, Taiwan, R.O.C.
  • Cheng-Fu Yang Graduate Institute of Communication Engineering, National Taiwan University, Taiwan, R.O.C.
  • Yu-Chiang Frank Wang Graduate Institute of Communication Engineering, National Taiwan University, Taiwan, R.O.C. ASUS Intelligent Cloud Services, Taiwan, R.O.C.

DOI:

https://doi.org/10.1609/aaai.v36i3.20210

Keywords:

Computer Vision (CV)

Abstract

As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been a challenge, benefiting not only audio/visual speech recognition (ASR/VSR) but also for manipulating data within/across modalities. In this paper, we present a feature disentanglement-based framework for jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert visual or audio-based linguistic features into modality-agnostic representations. Such derived linguistic representations not only allow one to perform ASR, VSR, and AVSR, but also to manipulate audio and visual data output based on the desirable subject identity and linguistic content information. We perform extensive experiments on different recognition and synthesis tasks to show that our model performs favorably against state-of-the-art approaches on each individual task, while ours is a unified solution that is able to jointly tackle the aforementioned audio-visual learning tasks.

Downloads

Published

2022-06-28

How to Cite

Yang, C.-C., Fan, W.-C., Yang, C.-F., & Wang, Y.-C. F. (2022). Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 3036-3044. https://doi.org/10.1609/aaai.v36i3.20210

Issue

Section

AAAI Technical Track on Computer Vision III