随着语义网的不断发展,网页语义的研究也在不断的进步。但现阶段的网络结构中,非语义化网页仍旧占据了信息系统最主要的部分。信息系统在整合的过程中,也需要了解网页的语义结构以完成信息的获取和分析。提出一种基于视觉特征筛选的网页语义结构分析方法。该方法可以在忽略网页语义的情况下,通过网页结构的视觉特性和内容特性分析网页中不同结构的语义关系,使用聚类分析方法来推定网页中半结构化信息的语义结构,并通过该方法对一组随机网页进行了分析,结果证明该方法具有比较好的分析能力。
The research on webpage semantics is making constant progress along with the development of semantic web.However,the non-semantic Web pages are still the principal parts of the information systems at present.In the process of information system integration,there is also the need to understand the semantic structure of the Web pages as to accomplish the access and analysis of the information.This paper proposes an approach for analysing semantic structure of the Web pages based on visual feature selection.In circumstance of ignoring the webpage semantics,the approach can analyse the semantic relations with different structures in Web pages by means of visual and content features of the webpage structure,and infer the semantic structures of the semi-structured information in Web pages by cluster analysis.A series of random Web pages have been analysed by this approach.The result turned out that the approach excels in analysis.