A Simple and Effective Unsupervised Word Segmentation Approach

Authors

  • Songjian Chen Sun Yat-sen University
  • Yabo Xu Sun Yat-sen University
  • Huiyou Chang Sun Yat-sen Universit

DOI:

https://doi.org/10.1609/aaai.v25i1.7970

Abstract

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary information to complete the model. In light of WordRank, word segmentation can be modeled as an optimization problem. A Viterbi-styled algorithm is developed for the search of the optimal segmentation. Extensive experiments conducted on phonetic transcripts as well as standard Chinese and Japanese data sets demonstrate the effectiveness of our approach. On the standard Brent version of Bernstein-Ratner corpora, our approach outperforms the state-of-the-art Bayesian models by more than 3%. Plus, our approach is simpler and more efficient than the Bayesian methods. Consequently, our approach is more suitable for real-world applications.

Downloads

Published

2011-08-04

How to Cite

Chen, S., Xu, Y., & Chang, H. (2011). A Simple and Effective Unsupervised Word Segmentation Approach. Proceedings of the AAAI Conference on Artificial Intelligence, 25(1), 866-871. https://doi.org/10.1609/aaai.v25i1.7970

Issue

Section

AAAI Technical Track: Natural Language Processing