位置:成果数据库 > 期刊 > 期刊详情页
中文名实体识别:基于词触发对的条件随机域方法
  • 期刊名称:赵健,王晓龙,关毅,徐志明,中文名实体识别:基于词触发对的条件随机域方法, 高技术通讯﹒2006年0
  • 时间:0
  • 分类:TP391.41[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001
  • 相关基金:国家自然科学重点基金(60435020)、863计划(2002AA117010-09)及哈尔滨工业大学校基金(HIT200271)资助项目.
  • 相关项目:面向智能化信息检索的危险式人工免疫网络理论与方法研究
中文摘要:

首次把条件随机域(CRF)模型应用到了中文名实体识别中,且根据中文的特点,定义了多种特征模板。同时,为了解决长距离约束问题,将词语触发对融合到了CRF模型中。提出了基于词语方差(word variance)的选词方法,在词语相关性计算上,采用了平均互信息(AMI)方法和χ^2统计量方法。通过在半年人民日报上的测试,结果表明在采用相同特征集合的条件下,条件随机域模型较其他概率模型有更好的性能表现;融合长距离触发对的条件随机域模型可以使系统的F量度提高约1.38%。

英文摘要:

In this paper, a new probabilistic model, conditional random fields (CRF), which is very fit for labeling sequence data, is firstly introduced to the task of Chinese named entity recognition (CNER). Unlike the generative model, CRF does not make effort on the observation modeling and can utilize rich overlapped features; moreover it can avoid the label bias problem of discriminative model. In order to perform CNER, special features are selected to capture more informative traits of Chinese language. In addition, word triggers are integrated to CRF to solve the long distance constraint problem, which has advantages of small parameters space and memory size compared with mining parallel information in a large sized window or whole sentence. Word triggers are selected by two steps: preparing candidate words and estimating the correlation degree of two words. The two methods of AMI (Average Mutual Information) and χ^2 statistic are used to estimate the correlation degree. Experimental results on haft-year People' s Daily show that the CRF together with word triggers extracted by the method of χ^2 Can achieve the state-of-the-art performance.

同期刊论文项目
同项目期刊论文