Contrast-Enhanced Semi-supervised Text Classification with Few Labels


  • Austin Cheng-Yun Tsai National Taiwan University
  • Sheng-Ya Lin National Taiwan University
  • Li-Chen Fu National Taiwan University



Speech & Natural Language Processing (SNLP), Machine Learning (ML), Domain(s) Of Application (APP)


Traditional text classification requires thousands of annotated data or an additional Neural Machine Translation (NMT) system, which are expensive to obtain in real applications. This paper presents a Contrast-Enhanced Semi-supervised Text Classification (CEST) framework under label-limited settings without incorporating any NMT systems. We propose a certainty-driven sample selection method and a contrast-enhanced similarity graph to utilize data more efficiently in self-training, alleviating the annotation-starving problem. The graph imposes a smoothness constraint on the unlabeled data to improve the coherence and the accuracy of pseudo-labels. Moreover, CEST formulates the training as a “learning from noisy labels” problem and performs the optimization accordingly. A salient feature of this formulation is the explicit suppression of the severe error propagation problem in conventional semi-supervised learning. With solely 30 labeled data per class for both training and validation dataset, CEST outperforms the previous state-of-the-art algorithms by 2.11% accuracy and only falls within the 3.04% accuracy range of fully-supervised pre-training language model fine-tuning on thousands of labeled data.




How to Cite

Tsai, A. C.-Y., Lin, S.-Y., & Fu, L.-C. (2022). Contrast-Enhanced Semi-supervised Text Classification with Few Labels. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11394-11402.



AAAI Technical Track on Speech and Natural Language Processing