Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization

Authors

transfer learning, semi-supervised neural machine translation, importance sampling

Neural machine translation (NMT) heavily relies on parallel

bilingual data for training. Since large-scale, high-quality

parallel corpora are usually costly to collect, it is appealing

to exploit monolingual corpora to improve NMT. Inspired by

the law of total probability, which connects the probability of

a given target-side monolingual sentence to the conditional

probability of translating from a source sentence to the target

one, we propose to explicitly exploit this connection to

learn from and regularize the training of NMT models using

monolingual data. The key technical challenge of this approach

is that there are exponentially many source sentences

for a target monolingual sentence while computing the sum

of the conditional probability given each possible source sentence.

We address this challenge by leveraging the dual translation

model (target-to-source translation) to sample several

mostly likely source-side sentences and avoid enumerating

all possible candidate source sentences. That is, we transfer

the knowledge contained in the dual model to boost the

training of the primal model (source-to-target translation),

and we call such an approach dual transfer learning. Experiment

results on English-French and German-English tasks

demonstrate that dual transfer learning achieves significant

improvement over several strong baselines and obtains new

state-of-the-art results.