在互联网迅速发展的现代化信息社会,大量地理信息都以非结构化的文本形式存在,而地名识别是挖掘这些地理信息的重要基础。目前已有的地名识别方法主要是从自然语言处理的角度来实现,并没有充分考虑到地名的构成和使用习惯等特征,造成识别率偏低或过拟合等问题。本文引入语言学相关知识,分析中文地名用字特征,在传统的地名专名+通名的结构上,更细致地划分地名的词素类型,总结归纳各词素类型的特征,将这些特征融入条件随机场的方法中,使地名识别问题转化为序列标注问题。并根据中文地名的特征,制定形式化规则,设计基于字的标注规范。在此基础上,设计中文地名特征模板,通过条件随机场模型训练和预测,识别自然语言文本中的中文地名。采用170万字的人民日报标注语料进行实验验证,结果表明本文方法对中文地名识别的召回率、准确率和F值分别达到92.69%、96.73%和94.67%,优于已有研究成果,能为地理信息科学领域的研究和应用提供更有效的地名服务。
With the rapid development of the World Wide Web,a huge quantity of geographic information resources are hidden as unstructured texts.Toponym recognition is the foundation of mining the potential geographic information from these texts.In traditional toponym recognition methods based on the natural language processing,the structure of Chinese toponym and features of user customs are ignored,which results in the low recall and precision.In this paper,linguistic knowledge is introduced to analyze Chinese toponym,and the more specific morpheme categories are recognized.Then the process of toponym recognition is transformed into an equivalent sequence labeling problem based on the conditional random field.A proper labeling schema for Chinese toponym is also designed to improve the recognition accuracy.In the experiments,the 1.7 million tagged corpus of The People's Daily are used to test the proposed method.The recall,precision and F value of the result are92.69%,96.73% and 94.67%respectively,which are better than other machine learning models.It is proven that the proposed method is effective to recognize Chinese toponym.This research can provide more precise Toponym services for geographic information applications.