为解决异构DeepWeb结果页面中数据区域及数据记录的自动抽取问题,提出一种基于DOM树与领域本体的Web抽取方法。利用数据内容特征以及领域本体库标记DOM树的节点,按照结果页面展示规律定位数据区域,根据改进的简单树匹配算法,定位数据区域及数据记录。实验结果表明,该方法定位数据区域及数据记录的F-measure值比传统的抽取方法高2.93%~6.67%。
To solve the problem of automatic extraction from different DeepWeb result page structures,this paper proposes a method which combines the Web structure and the content of Web pages.This method uses the characteristics of data content and the DOM tree nodes which are marked by the domain ontology library positioning data area.An improved simple tree matching algorithm is used to identify data records.Experimental results show that the F-measure value of this method is 2.93%~6.67% higher than that of traditional methods.