互联网上存在很多在线购物网站,抽取这类网站页面里的商品信息可以为电子商务、Web查询提供增值服务。该文针对这类网站提出一种自动的Web信息抽取方法,通过检测网页中的重复模式以及分析主题内容的特征获取网页的主题内容,该方法在抽取过程中不需要人工干预。对10个在线购物网站进行了测试,实验结果表明提出的方法是有效的。
There are many on-line shopping Web sites on WWW, and commodity information in these Web pages can be extracted for E-commerce and Web-query. This paper presents an automated approach for Web information extraction against these Web sites. The approach finds the topic area by detecting repetitive patterns and analyzing the characteristics of topic area in a single Web page. There are no human interactions during extraction. The approach tests 10 on-line shopping sites and experimental results show that the approach is effective.