提出一种新的基于视觉的网页数据表格定位方法,这种方法模拟人类视觉认知,通过表格视觉特征来定位网页数据表格。首先通过引入布局引擎。使得模拟浏览器显示并获得网页中各表格的视觉特征成为可能。然后提出一系列规则。将DOM树拆分为若干个独立的TABLE。最后提炼出表格的视觉特征指标,根据这些指标对表格进行排序,从而得到最终的网页数据表格。
This paper presents a new approach for detecting data tables in Web pages based on visual cues, which simulates the human visual awareness to detect data tables in Web page. Firstly by intro- ducing the layout engine, it makes the simulation of browser to display Web page and getting all kinds of visual features possible. Secondly we propose a series of rules by which the DOM tree will be split into several independent TABLEs. Thirdly a set of visual indicators is extracted, and the final data tables in Web pages are detected based on these indicators.