Accelerating Neural Machine Translation with Partial Word Embedding Compression

Authors

  • Fan Zhang Communication University of China Samsung Research China - Beijing
  • Mei Tu Samsung Research China - Beijing
  • Jinyao Yan Communication University of China

DOI:

https://doi.org/10.1609/aaai.v35i16.17688

Keywords:

Machine Translation & Multilinguality

Abstract

Large model size and high computational complexity prevent the neural machine translation (NMT) models from being deployed to low resource devices (e.g. mobile phones). Due to the large vocabulary, a large storage memory is required for the word embedding matrix in NMT models, in the meantime, high latency is introduced when constructing the word probability distribution. Based on reusing the word embedding matrix in the softmax layer, it is possible to handle the two problems brought by large vocabulary at the same time. In this paper, we propose Partial Vector Quantization (P-VQ) for NMT models, which can both compress the word embedding matrix and accelerate word probability prediction in the softmax layer. With P-VQ, the word embedding matrix is split into two low dimensional matrices, namely the shared part and the exclusive part. We compress the shared part by vector quantization and leave the exclusive part unchanged to maintain the uniqueness of each word. For acceleration, in the softmax layer, we replace most of the multiplication operations with the efficient looking-up operations based on our compression to reduce the computational complexity. Furthermore, we adopt curriculum learning and compact the word embedding matrix gradually to improve the compression quality. Experimental results on the Chinese-to-English translation task show that our method can reduce 74.35% of parameters of the word embedding and 74.42% of the FLOPs of the softmax layer. Meanwhile, the average BLEU score on the WMT test sets only drops 0.04.

Downloads

Published

2021-05-18

How to Cite

Zhang, F., Tu, M., & Yan, J. (2021). Accelerating Neural Machine Translation with Partial Word Embedding Compression. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16), 14356-14364. https://doi.org/10.1609/aaai.v35i16.17688

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing III