Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Authors

  • Raj Dabre National Institute of Information and Communications Technology
  • Atsushi Fujita National Institute of Information and Communications Technology

DOI:

https://doi.org/10.1609/aaai.v33i01.33016292

Abstract

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Downloads

Published

2019-07-17

How to Cite

Dabre, R., & Fujita, A. (2019). Recurrent Stacking of Layers for Compact Neural Machine Translation Models. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6292-6299. https://doi.org/10.1609/aaai.v33i01.33016292

Issue

Section

AAAI Technical Track: Natural Language Processing