TC-DWA:Text Clustering with Dual Word-Level Augmentation
DOI:
https://doi.org/10.1609/aaai.v37i6.25868Keywords:
ML: Clustering, DMKM: Applications, ML: Unsupervised & Self-Supervised Learning, SNLP: Language ModelsAbstract
The pre-trained language models, e.g., ELMo and BERT, have recently achieved promising performance improvement in a wide range of NLP tasks, because they can output strong contextualized embedded features of words. Inspired by their great success, in this paper we target at fine-tuning them to effectively handle the text clustering task, i.e., a classic and fundamental challenge in machine learning. Accordingly, we propose a novel BERT-based method, namely Text Clustering with Dual Word-level Augmentation (TCDWA). To be specific, we formulate a self-training objective and enhance it with a dual word-level augmentation technique. First, we suppose that each text contains several most informative words, called anchor words, supporting the full text semantics. We use the embedded features of anchor words as augmented data, which are selected by ranking the norm-based attention weights of words. Second, we formulate an expectation form of word augmentation, which is equivalent to generating infinite augmented features, and further suggest a tractable approximation of Taylor expansion for efficient optimization. To evaluate the effectiveness of TCDWA, we conduct extensive experiments on several benchmark text datasets. The results demonstrate that TCDWA consistently outperforms the state-of-the-art baseline methods. Code available: https://github.com/BoCheng-96/TC-DWA.Downloads
Published
2023-06-26
How to Cite
Cheng, B., Li, X., & Chang, Y. (2023). TC-DWA:Text Clustering with Dual Word-Level Augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6), 7113-7121. https://doi.org/10.1609/aaai.v37i6.25868
Issue
Section
AAAI Technical Track on Machine Learning I