随着Web数据库的不断增长,通过对Deep Web的访问逐渐成为获取信息的主要手段.如何有效地抽取Deep Web中结果页面所包含的实体信息成为一个值得研究的问题.通过分析Deep Web结果页面的特点,提出了一种基于DOM树的Deep Web实体抽取机制(DOM-tree based entity extraction mechanism for Deepweb,D-EEM),能够有效解决Deep Web环境中的实体抽取问题.D-EEM采用基于DOM树的自动实体抽取策略,利用DOM树中的文本内容和层次结构来确定数据区域和实体区域,提高了实体抽取的准确性;另外,提出了一种基于上下文距离和共现次数的语义标注方法,有效地将来自不同数据源的抽取结果进行合成.通过实验验证了D-EEM中所采用的关键技术的可行性和有效性,同其他实体抽取策略相比,D-EEM在抽取效率及抽取准确性等方面具有一定的优势.
With the increase of Web databases,accessing Deep Web is becoming the main method to acquire information.Because of the large-scale unstructured content,heterogeneous result and dynamic data in Deep Web,there are some new challenges for entity extraction.Thus it is important to solve the problem of extracting the entities from Deep Web result pages effectively.By analyzing the characteristics of result pages,a DOM-tree based entity extraction mechanism for Deep Web(called D-EEM) is presented to solve the problem of entity extraction for Deep Web.D-EEM is modeled as three levels:expression level,extraction level,collection level.Therein the components of region location and semantic annotation are the core parts to be researched in this paper.A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and entity regions respectively,which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees.Also based on the Web context and co-occurrence,a semantic annotation method is proposed to benefit the process of data integration effectively.An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM.Compared with various entity extraction strategies,D-EEM is superior in the accuracy and efficiency of extraction.