[目的]网页所表达的主要信息通常隐藏在大量无关的结构和内容中,使用户不能迅速获取主题内容,限制了网页资源的可用性,使用信息抽取技术解决了这一问题。[方法]基于文档对象模型(DOM)的信息抽取技术能够简单准确地从网页中提取所需内容,但依靠网页本身结构;基于行块分布算法的技术摆脱网页结构的限制,克服限定数据源的缺点,但需要人工干预,文章结合DOM技术和行块分布算法以及正则表达式,实现网页信息采集与信息抽取。[结论]能够自动准确提取网页信息。[局限]对英文以及结构复杂的网页抽取效果不是很理想,抽取内容仅限于文字。
[Purpose] The main information of web page is usually hidden in a large number of irrelevant structures and content,which cannot make users get the main content quickly and limits the availability of web resource. This paper uses information extraction technology to solve the problem. [Method] Information extraction technology based on DOM can extract needed content simply and accurately,but it relies on the structure of web page. The technology based on block distribution algorithm,which needs manual intervention,breaks the limitation of web page structure and overcomes the shortcoming of limited data source. In this paper,based DOM,block distribution algorithm and regular expression,web page information collection and extraction system is implemented. [Conclusion] The result shows that the method extracts web page information automatically and accurately. [Limitations]The extraction results of English pages and pages with complex structure are not satisfied,and the extraction content is limited to text.