实体指代识别(Entity Mention Detection,EMD)是识别文本中对实体的指代(Mention)的任务,包括专名、普通名词、代词指代的识别。本文提出一种基于多层次特征集成的中文实体指代识别方法,利用条件随机场模型的特征集成能力,综合使用字符、拼音、词及词性、各类专名列表、频次统计等各层次特征提高识别性能。本文利用流水线框架,分三个阶段标注实体指代的各项信息。基于本方法的指代识别系统参加了2007年自动内容抽取(ACE07)中文EMD评测,系统的ACE Value值名列第二。
The purpose of Entity Mention Detection (EMD) is to recognizel all mentions of entities in a document, involving recognition of named entities, noun words and pronoun coreference etc. In this paper, we propose an approach for Chinese entity mention detection by integrating multi-level features into the Conditional Random Fields (CRFs) framework. These features used include characters, phonetic symbols, lexical words and part-of-speech, named entities, and frequency statistics. All EMD subtasks are integrated into a three-stage pipeline framework in which three different CRFs classifiers are used to label different attributes sequentially in a predefined order. The system described here is the our submission to NIST ACE07 EMD Evaluation project, and achieved rank-2 performance in ACE07.