TC-DWA:Text Clustering with Dual Word-Level Augmentation

Bo Cheng; Ximing Li; Yi Chang

doi:10.1609/aaai.v37i6.25868

Authors

Bo Cheng School of Artificial Intelligence, Jilin University, China International Center of Future Science, Jilin University, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China
Ximing Li College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China
Yi Chang School of Artificial Intelligence, Jilin University, China International Center of Future Science, Jilin University, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v37i6.25868

Keywords:

ML: Clustering, DMKM: Applications, ML: Unsupervised & Self-Supervised Learning, SNLP: Language Models

Abstract

The pre-trained language models, e.g., ELMo and BERT, have recently achieved promising performance improvement in a wide range of NLP tasks, because they can output strong contextualized embedded features of words. Inspired by their great success, in this paper we target at fine-tuning them to effectively handle the text clustering task, i.e., a classic and fundamental challenge in machine learning. Accordingly, we propose a novel BERT-based method, namely Text Clustering with Dual Word-level Augmentation (TCDWA). To be specific, we formulate a self-training objective and enhance it with a dual word-level augmentation technique. First, we suppose that each text contains several most informative words, called anchor words, supporting the full text semantics. We use the embedded features of anchor words as augmented data, which are selected by ranking the norm-based attention weights of words. Second, we formulate an expectation form of word augmentation, which is equivalent to generating infinite augmented features, and further suggest a tractable approximation of Taylor expansion for efficient optimization. To evaluate the effectiveness of TCDWA, we conduct extensive experiments on several benchmark text datasets. The results demonstrate that TCDWA consistently outperforms the state-of-the-art baseline methods. Code available: https://github.com/BoCheng-96/TC-DWA.

TC-DWA:Text Clustering with Dual Word-Level Augmentation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription