东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

字典与统计相结合的中文分词方法

期刊名称：小型微型计算机系统．27（9）．1766-1771，2006
时间：0
分类：TP391.12[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]吉林大学软件学院,吉林长春130012
相关基金：国家自然科学基金项目（60373099）资助.
相关项目：具有增量特性的移动式主题爬行技术

关键词：中文分词, 基于字典的分词, 基于统计的分词, 交集型分词歧义, chinese word segmentation, chinese word segmentation based on dictionary, chinese word segmentation based on statistical method , crossing ambiguities in chinese word segmentation

中文摘要：

提出了一种字典与统计相结合的中文分词方法，该方法首先利用基于字典的分词方法进行第一步处理，然后利用统计的方法处理第一步所产生的歧义问题和未登录词问题．本算法在基于字典的处理过程中，通过改进字典的存储结构，提高了字典匹配的速度；在基于统计的处理过程中，通过统计和规则相结合的方法提高了交集型歧义切分的准确率，并且一定条件下解决了语境中高频未登录词问题，实验结果表明，由本文算法实现的分词系统DSfenci的分全率达99．52％，准确率达98．52％．

英文摘要：

Proposed a method based on dictionary integrated with statistics. The method uses the segmentation method based on dictionary in the first step and then employs segmentation based on statistics to resolve ambiguity and unregistered words left over in the first step. An improved data structure of dictionary is employed to accelerate dictionary looking up speed in the first step, and during the second step, statistics integrated with rules is adopted in order to improve accuracy of crossing ambiguity division and to deal with the unregistered words. The integrity of Dsfenci System which is realized on the method proposed by this paper is 99.52%, the accuracy is 98.52%.