位置:成果数据库 > 期刊 > 期刊详情页
D-EEM:一种基于DOM树的Deep Web实体抽取机制
  • 期刊名称:计算机研究与发展
  • 时间:0
  • 页码:858-865
  • 分类:TP311.13[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]东北大学信息科学与工程学院,沈阳110004, [2]东软集团商用软件事业部,沈阳110179
  • 相关基金:国家自然科学基金项目(60673139 60973021); 国家“八六三”高技术研究发展计划基金项目(2008AA01Z146); 中央高校基本科研业务费专项基金项目(NO90304005)~~
  • 相关项目:面向数据空间内多模式查询和数据集成的关键技术研究
中文摘要:

随着Web数据库的不断增长,通过对Deep Web的访问逐渐成为获取信息的主要手段.如何有效地抽取Deep Web中结果页面所包含的实体信息成为一个值得研究的问题.通过分析Deep Web结果页面的特点,提出了一种基于DOM树的Deep Web实体抽取机制(DOM-tree based entity extraction mechanism for Deepweb,D-EEM),能够有效解决Deep Web环境中的实体抽取问题.D-EEM采用基于DOM树的自动实体抽取策略,利用DOM树中的文本内容和层次结构来确定数据区域和实体区域,提高了实体抽取的准确性;另外,提出了一种基于上下文距离和共现次数的语义标注方法,有效地将来自不同数据源的抽取结果进行合成.通过实验验证了D-EEM中所采用的关键技术的可行性和有效性,同其他实体抽取策略相比,D-EEM在抽取效率及抽取准确性等方面具有一定的优势.

英文摘要:

With the increase of Web databases,accessing Deep Web is becoming the main method to acquire information.Because of the large-scale unstructured content,heterogeneous result and dynamic data in Deep Web,there are some new challenges for entity extraction.Thus it is important to solve the problem of extracting the entities from Deep Web result pages effectively.By analyzing the characteristics of result pages,a DOM-tree based entity extraction mechanism for Deep Web(called D-EEM) is presented to solve the problem of entity extraction for Deep Web.D-EEM is modeled as three levels:expression level,extraction level,collection level.Therein the components of region location and semantic annotation are the core parts to be researched in this paper.A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and entity regions respectively,which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees.Also based on the Web context and co-occurrence,a semantic annotation method is proposed to benefit the process of data integration effectively.An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM.Compared with various entity extraction strategies,D-EEM is superior in the accuracy and efficiency of extraction.

同期刊论文项目
期刊论文 38 会议论文 27 专利 2
同项目期刊论文