针对双结构网络的特点及其URL去重面临的挑战,根据Bloom Filter的工作原理,提出一种基于可扩展的动态可分裂Bloom Filter的URL去重机制,并在原型系统中进行实现和部署.实验结果表明,该机制能够有效适用于大规模、高性能和分布式的双结构网络爬虫应用.
In this paper, the concept of Dual-Structural Network is firstly introduced and theprinciples of Bloom Filter are surveyed. Then, the basic requirements for detecting duplicatedURLs in Dual-Structural Network are analyzed. Moreover,a dynamic splittable Bloom Filter forweb crawlers is proposed, which can increase its capacity according to application requirementsand fit large-scale, high-performance and distributed web crawlers. Finally, the feasibility and ef-ficiency of the proposed Bloom Filter is demonstrated by a series of experiments.