Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

Ke Wang; Guandan Chen; Zhongqiang Huang; Xiaojun Wan; Fei Huang

doi:10.1609/aaai.v35i16.17645

Authors

Ke Wang Peking University
Guandan Chen Alibaba Group
Zhongqiang Huang Alibaba Group
Xiaojun Wan Peking University
Fei Huang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v35i16.17645

Keywords:

Machine Translation & Multilinguality, Applications, Representation Learning

Abstract

Despite the near-human performances already achieved on formal texts such as news articles, neural machine translation still has difficulty in dealing with "user-generated" texts that have diverse linguistic phenomena but lack large-scale high-quality parallel corpora. To address this problem, we propose a counterfactual domain adaptation method to better leverage both large-scale source-domain data (formal texts) and small-scale target-domain data (informal texts). Specifically, by considering effective counterfactual conditions (the concatenations of source-domain texts and the target-domain tag), we construct the counterfactual representations to fill the sparse latent space of the target domain caused by a small amount of data, that is, bridging the gap between the source-domain data and the target-domain data. Experiments on English-to-Chinese and Chinese-to-English translation tasks show that our method outperforms the base model that is trained only on the informal corpus by a large margin, and consistently surpasses different baseline methods by +1.12 ~ 4.34 BLEU points on different datasets. Furthermore, we also show that our method achieves competitive performances on cross-domain language translation on four language pairs.

Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription