InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu; Xinshuai Dong; Thong Nguyen; Chaoqun Liu; Liang-Ming Pan; Anh Tuan Luu

doi:10.1609/aaai.v37i11.26612

Authors

Xiaobao Wu Nanyang Technological University, Singapore
Xinshuai Dong Carnegie Mellon University, USA
Thong Nguyen National University of Singapore, Singapore
Chaoqun Liu Nanyang Technological University, Singapore DAMO Academy, Alibaba Group, Singapore
Liang-Ming Pan National University of Singapore, Singapore
Anh Tuan Luu Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v37i11.26612

Keywords:

SNLP: Text Mining, SNLP: Text Classification, SNLP: Machine Translation & Multilinguality

Abstract

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription