InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Authors

  • Xiaobao Wu Nanyang Technological University, Singapore
  • Xinshuai Dong Carnegie Mellon University, USA
  • Thong Nguyen National University of Singapore, Singapore
  • Chaoqun Liu Nanyang Technological University, Singapore DAMO Academy, Alibaba Group, Singapore
  • Liang-Ming Pan National University of Singapore, Singapore
  • Anh Tuan Luu Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v37i11.26612

Keywords:

SNLP: Text Mining, SNLP: Text Classification, SNLP: Machine Translation & Multilinguality

Abstract

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.

Downloads

Published

2023-06-26

How to Cite

Wu, X., Dong, X., Nguyen, T., Liu, C., Pan, L.-M., & Luu, A. T. (2023). InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13763-13771. https://doi.org/10.1609/aaai.v37i11.26612

Issue

Section

AAAI Technical Track on Speech & Natural Language Processing