Unsupervised Learning Helps Supervised Neural Word Segmentation
DOI:
https://doi.org/10.1609/aaai.v33i01.33017200Abstract
By exploiting unlabeled data for further performance improvement for Chinese word segmentation, this work makes the first attempt at exploring adding unsupervised segmentation information into neural supervised segmenter. We survey various effective strategies, including extending the character embedding, augmenting the word score and applying multi-task learning, for leveraging unsupervised information derived from abundant unlabeled data. Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-theart models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually verifies the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data.