当前主流的中文分词方法是基于字标注的传统机器学习方法,但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且利用CPU训练模型时间长的缺点.针对以上问题进行了研究,提出基于LSTM(longshort-term memory)网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量(character embedding)进行中文分词.在中文分词评测常用的语料上进行实验对比结果表明,基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;LLTM网络模型的方法也更容易推广并应用到其他自然语言处理中序列标注的任务.
Currently ’ the dominant state-of-the-art methods for Chinese word segmentation are based on character taggingmethods by using traditional machine learning technology. However, there are some disadvantages in the trlearning methods: artificially configuring and extracting features from Chinese texts , high dimension of the dict iotraining time by just exploiting CPUs. This paper proposed an improved method based on long short- work model. It used different tag set and added pre-trained character embeddings to perform Chinese word segmentation. pared with the best result in Bakeoff and state-of-the-art methods, this paper conducted the experiments on commpuses. The results demonstrate that traditional machine learning methods are exceeded by the methowork. By using six-tag-set and adding pre-trained character embedding, the proposed method can reach the relatively highestperformance on Chinese word segmentation. Then, it can greatly reduce the training time of deep neural network model byusing GPUs. Moreover, the methods based on LSTM net-work can easily applied to other sequence label ing tasks in natural lan-guage processing( NLP).