中文分词是自然语言处理的前提和基础,利用分类效果较好的交叉覆盖算法实现中文分词。将中文分词想象成字的分类过程,把字放入向前向后相邻两个字这样一个语境下判断该字所属的类别,是自己独立,或是跟前一字结合,或是跟后一字结合,或是跟前后的字结合。对人民日报熟语料库进行训练,不需要词典,可以较好地解决中文分词中的交叉歧义问题,分词正确率达90.6%。
Chinese word segment is very important in natural language processing.Chinese word segment is regards as classified process of character.The character is put in the linguistic environment which covers four characters around it.Every character belongs to one of such four categories as independent existence, existence connecting with the character before, existence connecting with the character after and existence connecting with the character before and after.The category of every character is judged by using alternative covering algorithm which has good classification effect.This method carries on statistics in a large annotated corpus and does not need the dictionary.It has a good solution to overlapping ambiguity and achieves 90.6% accuracy.