受传统观念的影响,中国人名最后一个或两个字的用法对性别判定通常具有一定的指示作用,由此提出利用条件随机场模型来实现中文人名性别的自动识别.该机器学习方法根据人名的结构和用字信息,构建人名标注集,选择6组不同的特征模板集,利用条件随机场模型,在231 337个人名数据库中经过封闭测试,正确率可以达到89.30%,比采用朴素贝叶斯依赖人名用字进行性别识别的方法好将近7个百分点.实验证明:在人名库中识别性别,名字尾字的作用要高于姓氏用字,且女性人名性别识别的准确度要略高于男性,一般是高2至3个百分点,从机器学习的角度来说性别差异可以体现在人名用字中.通过分析实验数据总结了适合人名识别的CRF特征模板设计的一般规律,这为后续的研究工作提供了基础.
On the influence of traditional concept,the last one or two words of Chinese name usually has a certain instructions role to gender recognition.Gender recognition of person name can be used in natural language processing which is a specific application of Named Entity Recognition.Gender recognition method makes use of the structure and vocabulary information of Chinese personal name.The experiment on the basis of CRF is designed by constructing person name annotation set and selecting suitable feature model using NLP technology.Through the closed test on 231337 person names 89.30% accuracy is got which is about seven percentages higher than the bayes method.The experiment proves that the effect of the last name in gender recognition is higher than the role of the first name and the accuracy of gender recognition in female names is more higher than male names,about two or three percentages.Gender differences based on machine learning can be found from the names itself.The general principle of template design was proposed.