Word Segmentation for Chinese Novels

Likun Qiu; Yue Zhang

doi:10.1609/aaai.v29i1.9523

Authors

Likun Qiu Singapore University of Technology and Design
Yue Zhang Singapore University of Technology and Design

DOI:

https://doi.org/10.1609/aaai.v29i1.9523

Keywords:

Natural Language Processing, Evaluation and Analysis, Information Extraction

Abstract

Word segmentation is a necessary first step for automaticsyntactic analysis of Chinese text. Chinese segmentationis highly accurate on news data, but the accuraciesdrop significantly on other domains, such as science andliterature. For scientific domains, a significant portionof out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentationsignificantly. For the literature domain, however,there is not a fixed set of domain terms. For example,each novel can contain a specifiac set of person, organizationand location names. We investigate a method forautomatically mining common noun entities for eachnovel using information extraction techniques, and usethe resulting entities to improve a state-of-the-art segmentationmodel for the novel. In particular, we designa novel double-propagation algorithm that mines nounentities together with common contextual patterns, anduse them as plug-in features to a model trained on thesource domain. An advantage of our method is that noretraining for the segmentation model is needed for eachnovel, and hence it can be applied efficiently given thehuge number of novels on the web. Results on five differentnovels show significantly improved accuracies,in particular for OOV words.

Word Segmentation for Chinese Novels

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information