InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling
DOI:
https://doi.org/10.1609/aaai.v37i11.26612Keywords:
SNLP: Text Mining, SNLP: Text Classification, SNLP: Machine Translation & MultilingualityAbstract
Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.Downloads
Published
2023-06-26
How to Cite
Wu, X., Dong, X., Nguyen, T., Liu, C., Pan, L.-M., & Luu, A. T. (2023). InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13763-13771. https://doi.org/10.1609/aaai.v37i11.26612
Issue
Section
AAAI Technical Track on Speech & Natural Language Processing