Word Segmentation for Chinese Novels


  • Likun Qiu Singapore University of Technology and Design
  • Yue Zhang Singapore University of Technology and Design




Natural Language Processing, Evaluation and Analysis, Information Extraction


Word segmentation is a necessary first step for automaticsyntactic analysis of Chinese text. Chinese segmentationis highly accurate on news data, but the accuraciesdrop significantly on other domains, such as science andliterature. For scientific domains, a significant portionof out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentationsignificantly. For the literature domain, however,there is not a fixed set of domain terms. For example,each novel can contain a specifiac set of person, organizationand location names. We investigate a method forautomatically mining common noun entities for eachnovel using information extraction techniques, and usethe resulting entities to improve a state-of-the-art segmentationmodel for the novel. In particular, we designa novel double-propagation algorithm that mines nounentities together with common contextual patterns, anduse them as plug-in features to a model trained on thesource domain. An advantage of our method is that noretraining for the segmentation model is needed for eachnovel, and hence it can be applied efficiently given thehuge number of novels on the web. Results on five differentnovels show significantly improved accuracies,in particular for OOV words.




How to Cite

Qiu, L., & Zhang, Y. (2015). Word Segmentation for Chinese Novels. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). https://doi.org/10.1609/aaai.v29i1.9523