东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

融合字特征的平滑最大熵模型消解交集型歧义

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]大连理工大学计算机科学与技术学院,辽宁大连116024
相关基金：国家自然科学基金资助项目（60673039 60973068）; 国家社科基金资助项目（08BTQ025）; 国家高科技863计划资助项目（2006AA01Z151）; 教育部博士点基金资助项目（20090041110002）

关键词：计算机应用, 中文信息处理, 分词, 交集型歧义, 融合丰富字特征, 最大熵模型, 平滑技术, computer application, Chinese information processing, word segmentation, overlapping ambiguity strings, character feature, maximum entropy model, smoothing technology

中文摘要：

交集型歧义的切分问题是分词阶段需要解决难点之一。该文将交集型歧义的消解问题转化为分类问题,并利用融合丰富字特征的最大熵模型解决该问题,为了克服最大熵建模时的数据稀疏问题,该文引入了不等式平滑技术和高斯平滑技术。我们在第二届国际分词竞赛的四个数据集上比较了高斯平滑技术、不等式平滑技术和频度折扣平滑技术,测试结果表明：不等式平滑技术和高斯平滑技术比频度折扣技术有显著提高,而它们之间不分伯仲,但是不等式平滑技术能使特征选择无缝嵌入到参数估计过程中,显著压缩模型规模。该方法在四个测试集上最终获得了96.27%、96.83%、96.56%、96.52%的消歧正确率,对比实验表明：丰富的特征使消歧性能分别提高了5.87%、5.64%、5.00%、5.00%,平滑技术使消歧性能分别提高了0.99%、0.93%、1.02%、1.37%,不等式平滑使分类模型分别压缩了38.7、19.9、44.6、9.7。

英文摘要：

The overlapping ambiguity strings（OAS） is one of the difficulties in automatic Chinese word segmentation.This paper treats the resolution of OAS asa classification task,using maximum entropy integrating character features to solve the problem.In order to overcome the data sparseness in maximum entropy modeling,this paper introduces the inequality smoothing techniques and Gaussian smoothing techniques.We compared the Gaussian smoothing,inequality smoothing and frequency discount on the four datasets of the Second International Chinese Word Segmentation,proving that Gaussian smoothing,inequality smoothing are much better than the discount method..while inequality smoothing enables the seamless integration of feature selectioninto the parameter estimation with the result of a significantly compressed model.On the four datasets,the precision of disambiguation by the proposed method can achieve 96.27%,96.83%,96.56%,96.52% respectively,with a relative improvement of 5.87%,5.64%,5.00%,5.00% by the rich feature and a relative improvement of 5.87%,5.64%,5.00%,5.00% by smoothing technology.Meanwhile,the classification models are compressed by 38.7,19.9,44.6,9.7 by using inequality smoothing.

同期刊论文项目