Learning Light-Weight Translation Models from Deep Transformer

Authors

  • Bei Li Northeastern University, China
  • Ziyang Wang Northeastern University
  • Hui Liu Northeastern University
  • Quan Du Northeastern University NiuTrans Research
  • Tong Xiao Northeastern University NiuTrans Research
  • Chunliang Zhang Northeastern University, China NiuTrans Research
  • Jingbo Zhu Northeastern University, China NiuTrans Research

Keywords:

Machine Translation & Multilinguality

Abstract

Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8 times shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.

Downloads

Published

2021-05-18

How to Cite

Li, B., Wang, Z., Liu, H., Du, Q., Xiao, T., Zhang, C., & Zhu, J. (2021). Learning Light-Weight Translation Models from Deep Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13217-13225. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17561

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing II