在原始分类器聚焦爬虫的基础上设计并实现在线增量学习的自适应聚焦爬虫。该聚焦爬虫包括一个基础网页分类器和一个在线增量学习自适应链接分类器。基础页面分类器根据领域知识对抓取到的页面内容主题相关性进行分类。在线增量学习自适应链接分类器能即时根据爬虫爬得网页和网页链接信息作出分类模型调整以更合理方式计算链接的主题相关度。系统中链接排序模块采用TopicalRank主题相关度计算方法分析链接优先抓取顺序。把基于增量学习的自适应聚焦爬虫应用到农业领域,实验结果和分析证明在线增量学习的自适应聚焦爬虫在农业领域爬行性能比仅基于网页相关性和链接重要度的原始分类器聚焦爬虫具有更好的性能。
An adaptive focused crawler of online-incremental learning based on primitive classification focused crawler is designed and realized in this article. The crawler' s architecture includes a basic webpage classifier and an online-incremental learning adaptive link classifier. The basic webpage classifier is used to classify the correlation of fetched content' s topics of pages according to domain knowledge. The online- incremental learning adaptive link classifier is able to adjust the classifying model instantly according to the web pages fetched by the crawler and the link information of the web pages, and to calculate correlation degree of the linked topics more reasonably. The Links Sorting Module in the system uses Topicalrank algorithm of topic correlation degree to analyze the preferential fetching sequence of the links. The paper introduces the application of incremental learning-based adaptive focused crawler in agriculture field. Experimental result and analysis demonstrate that, with regard to the crawling performance in agriculture field, the online-incremental learning adaptive focused crawler has more excellent performance than the primitive classifier focused crawler based only on web pages' correlation and link importance degree.