通用搜索引擎搜索的网页量大,引入主题爬虫搜索策略的搜索引擎检索的网页相关度高,减少了无关网页的采集。为了提高主题爬虫的搜索效率,设计一种基于布谷鸟搜索算法的主题爬虫搜索策略,将爬取的网页URL作为鸟巢个体,计算待选择的URL集合中所有网页的相关度,采用莱维飞行进行多次迭代,找出相关度高的,然后通过随机数与发现概率Pa进行比较,产生新的URL。实验结果表明,与主题爬虫的其他相关技术比较,此策略在爬取主题相关网页时具有更高的效率。
The general search engine searches for a large amount of web pages,and the search engines that introduce the theme crawler search strategy have high relevance to the web and reduce the collection of irrelevant web pages.In order to improve the search efficiency of the crawler,a crawler search strategy based on the cuckoo search algorithm is designed.The crawling web page URL is used as a bird nest to calculate the relevance of all the pages in the selected URL set.Multiple iterations to find the high correlation,and then through the random number and discovery probability Pa compared to generate a new URL.The experimental results show that this strategy is more efficient than crawling the related web pages of the topic crawler.