Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Authors

  • Qianqian Dong Institute of Automation, Chinese Academy of Sciences, China School of Artificial Intelligence, University of Chinese Academy of Sciences, China
  • Rong Ye ByteDance AI Lab
  • Mingxuan Wang ByteDance AI Lab
  • Hao Zhou ByteDance AI Lab
  • Shuang Xu Institute of Automation, Chinese Academy of Sciences, China
  • Bo Xu Institute of Automation, Chinese Academy of Sciences, China School of Artificial Intelligence, University of Chinese Academy of Sciences, China
  • Lei Li ByteDance AI Lab

DOI:

https://doi.org/10.1609/aaai.v35i14.17509

Keywords:

Machine Translation & Multilinguality, Speech & Signal Processing, Semi-Supervised Learning, Multimodal Learning

Abstract

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.

Downloads

Published

2021-05-18

How to Cite

Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., & Li, L. (2021). Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14), 12749-12759. https://doi.org/10.1609/aaai.v35i14.17509

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing I