Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Authors

  • Min Liu National Key Laboratory for Novel Software Technology, Nanjing University
  • Yu Bao ByteDance AI Lab
  • Chengqi Zhao ByteDance AI Lab
  • Shujian Huang National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology and Industrialization

DOI:

https://doi.org/10.1609/aaai.v37i11.26555

Keywords:

SNLP: Machine Translation & Multilinguality

Abstract

Benefiting from the sequence-level knowledge distillation, the Non-Autoregressive Transformer (NAT) achieves great success in neural machine translation tasks. However, existing knowledge distillation has side effects, such as propagating errors from the teacher to NAT students, which may limit further improvements of NAT models and are rarely discussed in existing research. In this paper, we introduce selective knowledge distillation by introducing an NAT evaluator to select NAT-friendly targets that are of high quality and easy to learn. In addition, we introduce a simple yet effective progressive distillation method to boost NAT performance. Experiment results on multiple WMT language directions and several representative NAT models show that our approach can realize a flexible trade-off between the quality and complexity of training data for NAT models, achieving strong performances. Further analysis shows that distilling only 5% of the raw translations can help an NAT outperform its counterpart trained on raw data by about 2.4 BLEU.

Downloads

Published

2023-06-26

How to Cite

Liu, M., Bao, Y., Zhao, C., & Huang, S. (2023). Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13246-13254. https://doi.org/10.1609/aaai.v37i11.26555

Issue

Section

AAAI Technical Track on Speech & Natural Language Processing