提出了一种字典与统计相结合的中文分词方法,该方法首先利用基于字典的分词方法进行第一步处理,然后利用统计的方法处理第一步所产生的歧义问题和未登录词问题.本算法在基于字典的处理过程中,通过改进字典的存储结构,提高了字典匹配的速度;在基于统计的处理过程中,通过统计和规则相结合的方法提高了交集型歧义切分的准确率,并且一定条件下解决了语境中高频未登录词问题,实验结果表明,由本文算法实现的分词系统DSfenci的分全率达99.52%,准确率达98.52%.
Proposed a method based on dictionary integrated with statistics. The method uses the segmentation method based on dictionary in the first step and then employs segmentation based on statistics to resolve ambiguity and unregistered words left over in the first step. An improved data structure of dictionary is employed to accelerate dictionary looking up speed in the first step, and during the second step, statistics integrated with rules is adopted in order to improve accuracy of crossing ambiguity division and to deal with the unregistered words. The integrity of Dsfenci System which is realized on the method proposed by this paper is 99.52%, the accuracy is 98.52%.