Constrained Coclustering for Textual Documents

Authors

  • Yangqiu Song IBM Research - China
  • Shimei Pan IBM T. J. Watson Research Center
  • Shixia Liu IBM Research - China
  • Furu Wei IBM Research - China
  • Michelle Zhou IBM Research - Almaden Center
  • Weihong Qian IBM Research - China

DOI:

https://doi.org/10.1609/aaai.v24i1.7680

Keywords:

constrained clustering, co-clustering, semi-supervised learning

Abstract

In this paper, we present a constrained co-clustering approach for clustering textual documents. Our approach combines the benefits of information-theoretic co-clustering and constrained clustering. We use a two-sided hidden Markov random field (HMRF) to model both the document and word constraints. We also develop an alternating expectation maximization (EM) algorithm to optimize the constrained co-clustering model. We have conducted two sets of experiments on a benchmark data set: (1) using human-provided category labels to derive document and word constraints for semi-supervised document clustering, and (2) using automatically extracted named entities to derive document constraints for unsupervised document clustering. Compared to several representative constrained clustering and co-clustering approaches, our approach is shown to be more effective for high-dimensional, sparse text data.

Downloads

Published

2010-07-03

How to Cite

Song, Y., Pan, S., Liu, S., Wei, F., Zhou, M., & Qian, W. (2010). Constrained Coclustering for Textual Documents. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1), 581-586. https://doi.org/10.1609/aaai.v24i1.7680