东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

使用联合链接相似度评估爬取Web资源

期刊名称：计算机学报
时间：0
页码：2267-2280
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]武汉大学软件工程国家重点实验室,武汉430072, [2]武汉大学计算机学院,武汉430072
相关基金：国家自然科学基金（60970018）资助
相关项目：Web社区用户个性挖掘与排序研究

关键词：聚焦爬行, 主题相似度, 链接评估, Web链接图, Q学习, focused crawling, topic similarity, link evaluation, Web link graph, Q learning

中文摘要：

如何从Web上获取感兴趣的资源是许多Web研究领域重要的研究内容.目前针对特定领域Web资源的获取,主要采用聚焦爬行策略.但目前的聚焦爬行技术在同时解决高效率爬行和高质量的爬行结果等方面还存在许多问题.文中提出了一种基于联合链接相似度评估的爬行算法,该算法在评估链接的主题相似度时,联合使用了关于链接主题相似度的直接证据和间接证据.直接证据通过计算链接的锚链文本的主题相似度来获得,而间接证据则是通过一个基于Q学习的Web链接图增量学习算法获取.该算法首先利用聚焦爬行过程中得到的结果页面,建立起一个Web链接图.然后通过在线学习Web链接图,获取链接和链接主题相似度之间的映射关系.通过对链接进行多属性特征建模,使得链接评估器能够将当前链接映射到Web链接图的链接空间中,从而获得当前链接的近似主题相似度.在3个主题域上对该算法进行了实验,结果表明,该算法可以显著提高爬行结果的精度和召回率.

英文摘要：

For many fields of Web research,how to fetch the interesting resources is crucial.At present,the chief method for obtaining the domain-specific resources on Web is to adopt the strategy of focused crawling.However,for the most current techniques of focused crawling,there are many problems in simultaneously meeting the high efficient crawl and the high quality of crawl results.This paper proposes a joint link similarity evaluation based algorithm.When evaluating the similarity between a link and a specific topic,the algorithm combines the direct evidence with indirect evidence on the topic similarity of the link.The direct evidence can be obtained by computing the topic similarity of the anchor text corresponding to the link.As to the indirect evidence,this paper presents a Q learning based algorithm for incrementally learning Web Link graph.The algorithm firstly builds a Web link graph by exploiting the on-topic Web pages fetched by focused crawler and then gets the map relationship between the link and topic similarity through online learning.Modeling any link as a multi-attribute vector,the system gives the link evaluator the ability to map the current link into the space of the Web link graph and thus obtains its approximate topic similarity.The experimental results for three specific topics show that the algorithm can significantly improve the precision and the recall of crawl results.

同期刊论文项目

Web社区用户个性挖掘与排序研究

期刊论文 15 会议论文 6 专利 3

同项目期刊论文

基于FOAF演化博弈的网络资源可信度判别

Web temporal inconsistency modeling based on web time axis

WPBL: A webpage block labeling based approach for web information extraction

Belief reasoning recommendation: Mashing up web information fusion and FOAF

Subposition Assembly-Based Construction of Non-Frequent Concept Semi-Lattice

基于层次树模型的Deep Web数据提取方法

基于格空间的受限Deep Web数据抽取算法

一种整合粒子群优化和K-均值的数据聚类算法

Retrieving deep web data through multi-attributes interfaces with structured queries

Web Database Sampling Approach Based on Attribute Correlation.

基于Iceberg概念格叠置半集成的全局闭频繁项集挖掘算法

中文深层网络的模式匹配和接口集成

基于数据质量的Deep Web数据源排序