最大熵模型能有效整合多种约束信息,对于汉语命名实体识别具有很好的适用性。因此,将其作为基本框架,提出一种融合多特征的最大熵汉语命名实体识别模型。该模型集成局部与全局多种特征,同时为降低搜索空间并提高处理效率,而引入了启发式知识。基于SIGHAN 2008命名实体评测任务测试数据的实验结果表明,所建立的混合模式是一种组合统计模型与启发式知识的有效汉语命名实体识别模式。基于不同测试数据的实验说明,该方法针对不同测试数据源具有一致性。
With the development of natural language processing (NLP) technology, the need for automatic named entity recognition (NER) is highlighted in order to enhance the performance of information extraction systems. The task of NER, which plays a vital role in NLP, is to tag each named entity (NE) in documents with a set of certain NE types. In this paper, a hybrid pattern for Chinese NER based on maximum entropy model is proposed, which fuses multiple features. It differentiates from most of the previous approaches mainly in the following aspects. Firstly, maximum entropy model is an outstanding statistical model for its good integration of various constraints and its compatibility to Chinese NER. Secondly, local features and global features are integrated in the hybrid model to get high performance. Thirdly, in order to reduce the searching space and improve the processing efficiency, heuristic human knowledge is introduced into the statistical model, which could increase the recognition performance significantly. From the experimental results on testing set for NER evaluation task in SIGHAN 2008, it can be concluded that the established hybrid model is an effective pattern to combine statistical model and heuristic human knowledge. And the experiments on another different testing set also confirm the above conclusion, which show that this algorithm has consistence on different testing data sources.