集中的爬虫是重要工具支持象专业化门户网站那样的应用,联机寻找,;网搜索引擎。赶的爬虫选择最好的 URL 的一个话题;相关的页将在网爬行期间追求。处理无关的页是困难的。这篇论文论述一个新奇集中的爬虫框架。在我们的集中的爬虫,我们建议一个方法克服一些处理无关的页的限制。我们也介绍我们的集中的爬虫的实现;介绍一些重要度量标准;为评价页关联的评估功能。试验性的结果证明我们的爬虫能获得更多的“重要”的页;有高精确;召回价值。
Focused crawlers are important tools to support applications such as specialized Web portals, online searching, and Web search engines. A topic driven crawler chooses the best URLs and relevant pages to pursue during Web crawling. It is difficult to deal with irrelevant pages. This paper presents a novel focused crawler framework. In our focused crawler, we propose a method to overcome some of the limitations of dealing with the irrelevant pages. We also introduce the implementation of our focused crawler and present some important metrics and an evaluation function for ranking pages relevance. The experimental result shows that our crawler can obtain more "important" pages and has a high precision and recall value.