针对如何尽早地获取电商网站中产品的评论信息,进而实时地把握产品舆情,提出了一种基于Storm的在线产品评论信息采集方法.该方法将流计算的概念应用于网络爬虫中,并通过SHHD算法对采集周期进行动态调整.实验结果表明:基于Storm平台进行信息采集具有吞吐量大、可扩展性强等优点;SHHD算法可以有效地降低采集系统对网络带宽和系统资源的消耗,实现了适应性的增量的在线产品评论信息采集过程;SHHD在产品的评论信息获取的滞后时间上较Poisson、SART等方法具有明显的优势.
With regard to getting comment information of the products in the electricity sales website as soon as possible and grasping product public opinion in real time,a method of online product reviews information collection based on Storm is presented.The concept of flow computation is applied to the web crawler,and the SHHD(Simhash Hamming Distance)algorithm is used to dynamically adjust the acquisition period.Experimental results show that information collection based on Storm has the advantages of large throughput and easy updating.The SHHD algorithm can effectively reduce the acquisition system on the network bandwidth and system resources consumption and achieve an adaptive incremental online product review information collection process.SHHD has certain advantages in the lag of product comment information acquisition than Poisson and SART.