命名实体是文本中承载信息的重要语言单位,命名实体的识别和分析在网络信息抽取、网络内容管理和知识工程等领域都占有非常重要的地位。有关命名实体的研究任务包括:实体识别、实体排歧、实体跨语言关联、实体属性抽取、实体关系检测等,该文重点介绍命名实体识别、排歧和跨语言关联等任务的研究现状,包括难点、评测、现有方法和技术水平,并对下一步需要重点解决的问题进行分析和讨论。该文认为,命名实体识别、排歧和跨语言关联目前的技术水平还远远不能满足大规模真实应用的需求,需要更加深入的研究。在研究方法上,要突破自然语言文本的限制,直接面向海量、冗余、异构、不规范、合有大量噪声的网页信息处理。
Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and crosslingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and cross-lingual coreference resolution are far from the requirement of large-scale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and cross-lingual conference resolution should he carried beyond the natural language texts and should be investigated directly among the large-scale, redundant, heterogeneous, ill-formed and noisy web pages.