东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

A New Framework for Focused Web Crawling

ISSN号：1003-7985
期刊名称：《东南大学学报：英文版》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]College of Computer Science and Technology/KeyLaboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changehun130012, Jilin, China
相关基金：Supported by the National Natural ,Science Foundation of China （60373099）

关键词：聚焦履带, 不相干记录, 关联量度, WEB, focused crawlers, irrelevant pages, relevance metrics

中文摘要：

集中的爬虫是重要工具支持象专业化门户网站那样的应用，联机寻找，；网搜索引擎。赶的爬虫选择最好的 URL 的一个话题；相关的页将在网爬行期间追求。处理无关的页是困难的。这篇论文论述一个新奇集中的爬虫框架。在我们的集中的爬虫，我们建议一个方法克服一些处理无关的页的限制。我们也介绍我们的集中的爬虫的实现；介绍一些重要度量标准；为评价页关联的评估功能。试验性的结果证明我们的爬虫能获得更多的“重要”的页；有高精确；召回价值。

英文摘要：

Focused crawlers are important tools to support applications such as specialized Web portals, online searching, and Web search engines. A topic driven crawler chooses the best URLs and relevant pages to pursue during Web crawling. It is difficult to deal with irrelevant pages. This paper presents a novel focused crawler framework. In our focused crawler, we propose a method to overcome some of the limitations of dealing with the irrelevant pages. We also introduce the implementation of our focused crawler and present some important metrics and an evaluation function for ranking pages relevance. The experimental result shows that our crawler can obtain more ＂important＂ pages and has a high precision and recall value.

同期刊论文项目