为了提高互联网网页的抓取速度,提出了一个改进的T-Spider分布式爬虫模型.该爬虫在解析URL阶段将页面进行切割以并行解析,在页面调度阶段使用改进的链接优先权计算方法,提高爬虫的抓取速度和稳定性.通过实验结果分析,验证了该方法的有效性.
To increase the speed of the crawler,this paper proposes a model that is based on the T-Spider.During the time of extracting links from the page content,the crawler takes use of the page cutting algorithm,and then uses a new algorithm of link priority computing to enhance the stability and increase the speed of the crawler.The experiment shows that it is availability.