传统的实体关系抽取方法主要针对语义信息较为完整的文本,基于抽取模式抽取文本中的实体关系,并采用启发式算法或者概率模型来选择抽取出的候选关系.而对于半结构化的页面,由于没有成句的实体信息展示,导致这些方法不能很好适用.论文提出的实体关系抽取系统能较好地处理半结构化的页面.该系统主要包括数据抽取规则学习、数据抽取、实体间关系计算等核心功能模块,并为用户提供了关系库查询接口.用户输入关键词和选定匹配类型,系统将根据关键词及匹配类型查询实体信息库,然后用满足条件的实体再去查询实体关系库,将包含这些实体的关系返回给用户.
In traditional methods, researchers use extraction pattern to extract entity relationships in text fragments that have complete semantic information. And they use heuristic algorithms or probabilistic models to choose the extracted candidate relationships. As for the semi-structured web pages, these methods become less applicable because the information of the entities is shown in some html modules where the semantic information is not complete. In this paper, an entity relationship extraction system that can solve the problem perfectly is propsoed. The system is composed of four functional modules: data extraction rule learning module, data extraction module, entity relationship compute module and entity relationship base query module. Firstly, users give a key word and choose a matching type. And the system will query the entity information base and find some entities that meet the conditions. Then the system will query the entity relationship base with the entities founded previously. Finally, the relationships that contain the entities will be returned to users.