东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于条件随机场的藏文人名识别技术研究

ISSN号：0469-5097
期刊名称：《南京大学学报：自然科学版》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]西南交通大学信息科学与技术学院,成都610031, [2]西藏大学计算机科学系,拉萨850000
相关基金：国家自然科学基金（61262058）

关键词：藏文人名, 条件随机场(CRF), 特征选择, Tibetan names, conditional random field（CRF）, feature selection

中文摘要：

文本挖掘中命名实体识别是一项重要的研究内容,利用统计学原理进行命名实体识别具有较高的识别率.利用条件随机场（conditional random fields,CRF）方法,研究藏文人名识别技术,重点探讨藏文人名的内部结构特征、上下文特征、特征选择和数据预处理等内容,并通过实验分析了不同特征的有效性.首先给出了基于字（音节）和字位信息的人名识别方法;其次研究了触发词、虚词、人名词典和指人名词后缀为特征的不同特征组合与优化,并细化了不同虚词对人名识别的作用;最后,通过不同组合的实验测试,结果表明：1）触发词和作格助词特征在藏文人名识别上能够起到积极的作用;2）不同特征窗口大小对人名识别有一定影响;3）利用CRF识别藏文人名F1值能够达到80%左右,但由于藏文两字人名的高歧义性,目前还达不到与其他语言相近的识别效果.

英文摘要：

Named entity recognition is an important research content in text mining.It has a high recognition rate by use of statistical principle.This paper studies Tibetan name recognition technology using conditional random fields（CRF）principle,focuses on analysis of the internal structure of the Tibetan names,contextual features,feature selection and data preprocessing,etc.and evaluates the effectiveness of different features through experiments.The contributions of this paper are that the method of name recognition based on the information of word（syllable）and word position is firstly presented;trigger words,function words,dictionary of names and personal noun suffix as features,together with their different combinations and optimization are studied,and the role of the different function words to the name recognition is refined.Experimental evaluation on different combinations showed that：1）the features of trigger words and ergative particle can play apositive role on the Tibetan name recognition;2）different feature window sizes have an impact on the name recognition;3）the recognition rate of Tibetan names can reach 80% of F1 value by use of CRF.However,it can＇t reach similar recognition results in other languages due to the high ambiguity of words consisting of two Tibetan syllables.

同期刊论文项目