Learning Light-Weight Translation Models from Deep Transformer
Keywords:Machine Translation & Multilinguality
AbstractRecently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8 times shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.
How to Cite
Li, B., Wang, Z., Liu, H., Du, Q., Xiao, T., Zhang, C., & Zhu, J. (2021). Learning Light-Weight Translation Models from Deep Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13217-13225. https://doi.org/10.1609/aaai.v35i15.17561
AAAI Technical Track on Speech and Natural Language Processing II