如何从Web上获取感兴趣的资源是许多Web研究领域重要的研究内容.目前针对特定领域Web资源的获取,主要采用聚焦爬行策略.但目前的聚焦爬行技术在同时解决高效率爬行和高质量的爬行结果等方面还存在许多问题.文中提出了一种基于联合链接相似度评估的爬行算法,该算法在评估链接的主题相似度时,联合使用了关于链接主题相似度的直接证据和间接证据.直接证据通过计算链接的锚链文本的主题相似度来获得,而间接证据则是通过一个基于Q学习的Web链接图增量学习算法获取.该算法首先利用聚焦爬行过程中得到的结果页面,建立起一个Web链接图.然后通过在线学习Web链接图,获取链接和链接主题相似度之间的映射关系.通过对链接进行多属性特征建模,使得链接评估器能够将当前链接映射到Web链接图的链接空间中,从而获得当前链接的近似主题相似度.在3个主题域上对该算法进行了实验,结果表明,该算法可以显著提高爬行结果的精度和召回率.
For many fields of Web research,how to fetch the interesting resources is crucial.At present,the chief method for obtaining the domain-specific resources on Web is to adopt the strategy of focused crawling.However,for the most current techniques of focused crawling,there are many problems in simultaneously meeting the high efficient crawl and the high quality of crawl results.This paper proposes a joint link similarity evaluation based algorithm.When evaluating the similarity between a link and a specific topic,the algorithm combines the direct evidence with indirect evidence on the topic similarity of the link.The direct evidence can be obtained by computing the topic similarity of the anchor text corresponding to the link.As to the indirect evidence,this paper presents a Q learning based algorithm for incrementally learning Web Link graph.The algorithm firstly builds a Web link graph by exploiting the on-topic Web pages fetched by focused crawler and then gets the map relationship between the link and topic similarity through online learning.Modeling any link as a multi-attribute vector,the system gives the link evaluator the ability to map the current link into the space of the Web link graph and thus obtains its approximate topic similarity.The experimental results for three specific topics show that the algorithm can significantly improve the precision and the recall of crawl results.