TC-DWA:Text Clustering with Dual Word-Level Augmentation

Authors

  • Bo Cheng School of Artificial Intelligence, Jilin University, China International Center of Future Science, Jilin University, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
  • Ximing Li College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China
  • Yi Chang School of Artificial Intelligence, Jilin University, China International Center of Future Science, Jilin University, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v37i6.25868

Keywords:

ML: Clustering, DMKM: Applications, ML: Unsupervised & Self-Supervised Learning, SNLP: Language Models

Abstract

The pre-trained language models, e.g., ELMo and BERT, have recently achieved promising performance improvement in a wide range of NLP tasks, because they can output strong contextualized embedded features of words. Inspired by their great success, in this paper we target at fine-tuning them to effectively handle the text clustering task, i.e., a classic and fundamental challenge in machine learning. Accordingly, we propose a novel BERT-based method, namely Text Clustering with Dual Word-level Augmentation (TCDWA). To be specific, we formulate a self-training objective and enhance it with a dual word-level augmentation technique. First, we suppose that each text contains several most informative words, called anchor words, supporting the full text semantics. We use the embedded features of anchor words as augmented data, which are selected by ranking the norm-based attention weights of words. Second, we formulate an expectation form of word augmentation, which is equivalent to generating infinite augmented features, and further suggest a tractable approximation of Taylor expansion for efficient optimization. To evaluate the effectiveness of TCDWA, we conduct extensive experiments on several benchmark text datasets. The results demonstrate that TCDWA consistently outperforms the state-of-the-art baseline methods. Code available: https://github.com/BoCheng-96/TC-DWA.

Downloads

Published

2023-06-26

How to Cite

Cheng, B., Li, X., & Chang, Y. (2023). TC-DWA:Text Clustering with Dual Word-Level Augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6), 7113-7121. https://doi.org/10.1609/aaai.v37i6.25868

Issue

Section

AAAI Technical Track on Machine Learning I