Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

Authors

  • Ke Wang Peking University
  • Guandan Chen Alibaba Group
  • Zhongqiang Huang Alibaba Group
  • Xiaojun Wan Peking University
  • Fei Huang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v35i16.17645

Keywords:

Machine Translation & Multilinguality, Applications, Representation Learning

Abstract

Despite the near-human performances already achieved on formal texts such as news articles, neural machine translation still has difficulty in dealing with "user-generated" texts that have diverse linguistic phenomena but lack large-scale high-quality parallel corpora. To address this problem, we propose a counterfactual domain adaptation method to better leverage both large-scale source-domain data (formal texts) and small-scale target-domain data (informal texts). Specifically, by considering effective counterfactual conditions (the concatenations of source-domain texts and the target-domain tag), we construct the counterfactual representations to fill the sparse latent space of the target domain caused by a small amount of data, that is, bridging the gap between the source-domain data and the target-domain data. Experiments on English-to-Chinese and Chinese-to-English translation tasks show that our method outperforms the base model that is trained only on the informal corpus by a large margin, and consistently surpasses different baseline methods by +1.12 ~ 4.34 BLEU points on different datasets. Furthermore, we also show that our method achieves competitive performances on cross-domain language translation on four language pairs.

Downloads

Published

2021-05-18

How to Cite

Wang, K., Chen, G., Huang, Z., Wan, X., & Huang, F. (2021). Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16), 13970-13978. https://doi.org/10.1609/aaai.v35i16.17645

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing III