提出了一种基于词频的机械匹配自动分词算法,以长度为首优先,结合词频进行分词,未匹配字串进一步应用改进的正向和逆向的最大匹配法,结合熵率分词分别标注所有可能为词的元素。在匹配时完全忽略了5个字以上的词语,解决了随长度增大复杂度呈指数增长这一难题。实验表明,该方法改善了分词正确率,提高了切分效率。
We present an automatic mechanical matching segmentation algorithm based on word frequency, in which the length is taken as the priority. The unmatched the strings are segmented with the improved forward-reverse maximum matching method. Combing with the entropy rate of word, the all possible elements are marked as words. The phase with over 5 words is ignored completely in the matching to reduce the complexity. The experimental results show that the word correct separating rate is improved and the segmentation efficiency is increased.