Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision

Meng Zhang; Haoruo Peng; Yang Liu; Huanbo Luan; Maosong Sun

doi:10.1609/aaai.v31i1.10988

Authors

Meng Zhang Tsinghua University
Haoruo Peng University of Illinois, Urbana-Champaign
Yang Liu Tsinghua University
Huanbo Luan Tsinghua University
Maosong Sun Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v31i1.10988

Keywords:

Bilingual word representation learning, Bilingual lexicon induction, Resource-scarce settings

Abstract

Building bilingual lexica from non-parallel data is a long-standing natural language processing research problem that could benefit thousands of resource-scarce languages which lack parallel data. Recent advances of continuous word representations have opened up new possibilities for this task, e.g. by establishing cross-lingual mapping between word embeddings via a seed lexicon. The method is however unreliable when there are only a limited number of seeds, which is a reasonable setting for resource-scarce languages. We tackle the limitation by introducing a novel matching mechanism into bilingual word representation learning. It captures extra translation pairs exposed by the seeds to incrementally improve the bilingual word embeddings. In our experiments, we find the matching mechanism to substantially improve the quality of the bilingual vector space, which in turn allows us to induce better bilingual lexica with seeds as few as 10.

Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription