Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong; Rong Ye; Mingxuan Wang; Hao Zhou; Shuang Xu; Bo Xu; Lei Li

doi:10.1609/aaai.v35i14.17509

Authors

Qianqian Dong Institute of Automation, Chinese Academy of Sciences, China School of Artificial Intelligence, University of Chinese Academy of Sciences, China
Rong Ye ByteDance AI Lab
Mingxuan Wang ByteDance AI Lab
Hao Zhou ByteDance AI Lab
Shuang Xu Institute of Automation, Chinese Academy of Sciences, China
Bo Xu Institute of Automation, Chinese Academy of Sciences, China School of Artificial Intelligence, University of Chinese Academy of Sciences, China
Lei Li ByteDance AI Lab

DOI:

https://doi.org/10.1609/aaai.v35i14.17509

Keywords:

Machine Translation & Multilinguality, Speech & Signal Processing, Semi-Supervised Learning, Multimodal Learning

Abstract

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription