Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching

Michael Wick; Pallika Kanani; Adam Pocock

doi:10.1609/aaai.v30i1.10360

Authors

Michael Wick Oracle Labs
Pallika Kanani Oracle Labs
Adam Pocock Oracle Labs

DOI:

https://doi.org/10.1609/aaai.v30i1.10360

Keywords:

NLP, word embeddings, multilingual, sentiment analysis, artificial code switching

Abstract

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations---which are often freely and readily available. We can encode such word translations as hard constraints in the model's objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code-switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target-language data).

Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information