A Simple and Effective Unsupervised Word Segmentation Approach

Songjian Chen; Yabo Xu; Huiyou Chang

doi:10.1609/aaai.v25i1.7970

Authors

Songjian Chen Sun Yat-sen University
Yabo Xu Sun Yat-sen University
Huiyou Chang Sun Yat-sen Universit

DOI:

https://doi.org/10.1609/aaai.v25i1.7970

Abstract

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary information to complete the model. In light of WordRank, word segmentation can be modeled as an optimization problem. A Viterbi-styled algorithm is developed for the search of the optimal segmentation. Extensive experiments conducted on phonetic transcripts as well as standard Chinese and Japanese data sets demonstrate the effectiveness of our approach. On the standard Brent version of Bernstein-Ratner corpora, our approach outperforms the state-of-the-art Bayesian models by more than 3%. Plus, our approach is simpler and more efficient than the Bayesian methods. Consequently, our approach is more suitable for real-world applications.

A Simple and Effective Unsupervised Word Segmentation Approach

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription