Divergence-Guided Simultaneous Speech Translation

Xinjie Chen; Kai Fan; Wei Luo; Linlin Zhang; Libo Zhao; Xinggao Liu; Zhongqiang Huang

doi:10.1609/aaai.v38i16.29733

Authors

Xinjie Chen Zhejiang University
Kai Fan Alibaba DAMO Academy
Wei Luo Alibaba DAMO Academy
Linlin Zhang Zhejiang University
Libo Zhao South China University of Technology
Xinggao Liu Zhejiang University
Zhongqiang Huang Alibaba DAMO Academy

DOI:

https://doi.org/10.1609/aaai.v38i16.29733

Keywords:

NLP: Machine Translation, Multilinguality, Cross-Lingual NLP, NLP: Speech

Abstract

To achieve high-quality translation with low latency, a Simultaneous Speech Translation (SimulST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using ``wait-k'' policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (DiG-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefix-based strategy for training translation models with partial speech input, and develop an adaptive policy that makes read/write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the MuST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods.

Divergence-Guided Simultaneous Speech Translation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information