针对关联规则Apriori算法在信息爆炸时代面对海量数据时,其计算周期大、算法效率低等问题,将数据以特定的数据结构进行存储,降低数据遍历次数;在连接操作前进行剪枝操作,并且改变剪枝操作的判定条件;同时将改进算法IApriori与基于内存的大数据并行计算处理框架Apache Spark相结合,提出了一种基于Spark的Apriori改进算法(Spark+IAprior)。实验结果表明,Spark+IApriori算法在集群伸缩性和加速比方面都优于Apriori算法。
Association rules Apriori algorithm have problems with large calculation cycle and low algorithm efficiency faced with huge amounts of data in the era of information explosion, data in a specific storage on the data structure to reduce the data on the number of times past, pruning operation before the items self-joins and changing the terms of judgment have been adopted in the paper, and the algorithm combined with Spark computing framework, an improved algorithm based on the Spark(Spark +IApriori) can be put forward. Experimental results show that the Spark+IApriori algorithm has a good data scalability and speed ratio than Apriori.