以w3c文档对象模型(DOM)为基础,利用元搜索引擎原理实现了一个互联网新闻自动抽取系统.该系统通过搜索引擎获取相关新闻的web页面,分析后得到其元数据,然后利用元数据表现出来的信息进行新闻正文抽取,该方法不依赖于原网页结构,不需要人工干预,是自动、可靠、通用的方法.试验表明,该抽取方法有着较高的准确率,平均可达到96%以上.
This paper based on DOM and metadata realized a news automatic search and extraction system by using meta search engine technique. First, gets news pages from the web by search engine, after analyzing gets its metadata, then extracts content by using the information that metadata describes. This approach is independent of document structures and domains, which is a universal method. Experimental results show that the extraction precision is higher than 96 %.