Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching

Authors

  • Michael Wick Oracle Labs
  • Pallika Kanani Oracle Labs
  • Adam Pocock Oracle Labs

DOI:

https://doi.org/10.1609/aaai.v30i1.10360

Keywords:

NLP, word embeddings, multilingual, sentiment analysis, artificial code switching

Abstract

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations---which are often freely and readily available. We can encode such word translations as hard constraints in the model's objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code-switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target-language data).

Downloads

Published

2016-03-05

How to Cite

Wick, M., Kanani, P., & Pocock, A. (2016). Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10360

Issue

Section

Technical Papers: NLP and Machine Learning