文本挖掘中命名实体识别是一项重要的研究内容,利用统计学原理进行命名实体识别具有较高的识别率.利用条件随机场(conditional random fields,CRF)方法,研究藏文人名识别技术,重点探讨藏文人名的内部结构特征、上下文特征、特征选择和数据预处理等内容,并通过实验分析了不同特征的有效性.首先给出了基于字(音节)和字位信息的人名识别方法;其次研究了触发词、虚词、人名词典和指人名词后缀为特征的不同特征组合与优化,并细化了不同虚词对人名识别的作用;最后,通过不同组合的实验测试,结果表明:1)触发词和作格助词特征在藏文人名识别上能够起到积极的作用;2)不同特征窗口大小对人名识别有一定影响;3)利用CRF识别藏文人名F1值能够达到80%左右,但由于藏文两字人名的高歧义性,目前还达不到与其他语言相近的识别效果.
Named entity recognition is an important research content in text mining.It has a high recognition rate by use of statistical principle.This paper studies Tibetan name recognition technology using conditional random fields(CRF)principle,focuses on analysis of the internal structure of the Tibetan names,contextual features,feature selection and data preprocessing,etc.and evaluates the effectiveness of different features through experiments.The contributions of this paper are that the method of name recognition based on the information of word(syllable)and word position is firstly presented;trigger words,function words,dictionary of names and personal noun suffix as features,together with their different combinations and optimization are studied,and the role of the different function words to the name recognition is refined.Experimental evaluation on different combinations showed that:1)the features of trigger words and ergative particle can play apositive role on the Tibetan name recognition;2)different feature window sizes have an impact on the name recognition;3)the recognition rate of Tibetan names can reach 80% of F1 value by use of CRF.However,it can't reach similar recognition results in other languages due to the high ambiguity of words consisting of two Tibetan syllables.