针对并行爬虫系统在多任务并发执行时所遇到的模块间负载平衡问题,提出流水线负载平衡模型(PLB),将不同的任务抽象为独立模块而达到各模块的处理速度相等,采用多线程的方式实现基于PLB的并行爬虫,根据线程的休眠和缓冲区的变化对线程数量进行动态调整以实现PLB。实验结果表明该方法具有良好的运行效率和稳定性。
This paper proposes a load balancing model named Pipeline Load Balancing(PLB), to address the load balancing problem among concurrent modules in a parallel crawling system. Different tasks in PLB are implemented as independent modules which have similar processing abilities. Dynamic multi-threading and buffering mechanisms are employed to implement a PLB-based parallel crawler. The number of threads is adjusted according to the changing in buffer size and waiting interval of a thread. Experimental results show that the PLB-based crawler provides high performance as well as good stability.