提出一种基于近邻匹配新的分词算法Jlppeccz,该算法首先把一篇文章以标点符号为界线分成若干个句子,然后用近邻匹配方法把一句话切分成1~4字的词,通过对词库的搜索,对已分的词进行重组,把小词合并成大词,再将处理过的词存储到一个临时的词库里,以备后续的句子查找,并可实现对词库添加词的功能.与经典MM算法和词频统计方法相比,本文算法有较大的改进.
This paper presents a new Chinese word segmentation algorithm Jlppeccz based on neighboring match.The traditional MM algorithm which may easily produce ambiguity depends on dictionary strongly.JIppeccz algorithm divided a article into some sentences with the benchmark of punctuation mark,then one sentence is cut into one word or multiword by neighboring match.The database of the words is searched;the words which have been divided are recombined;the small phrases are combined into the big ones,the words are put into a temporary table to prepare for the following phrases;the words are added into the database of the words.Compared to the classical MM algorithm and the word frequency statistics algorithm,Jlppeccz algorithm has greater improvement.Experiment shows the present algorithm possesses higher precision and efficiency than MM algorithm.The example demonstrates the effectiveness of the present algorithm.