针对传统Apriori算法处理速度和计算资源的瓶颈,以及Hadoop平台上Map-Reduce计算框架不能处理节点失效、不能友好支持迭代计算以及不能基于内存计算等问题,提出了Spark下并行关联规则优化算法。该算法只需两次扫描事务数据库,并充分利用Spark内存计算的RDD存储项集。与传统Apriori算法相比,该算法扫描事务数据库的次数大大降低;与Hadoop下Apriori算法相比,该算法不仅简化计算,支持迭代,而且通过在内存中缓存中间结果减少I/O花销。实验结果表明,该算法可以提高关联规则算法在大数据规模下的挖掘效率。
In view of the bottleneck of traditional Apriori algorithm in processing speed and computing re-sources, and that Map-Reduce on Hadoop could not handle node failures, friendly support iterative calcu-lation, and calculate based on memory issues ,a parallel association rule optimization algorithm based on Spark was proposed. The optimization algorithm only needed to scan the transaction database twice and it took advantage of Spark’ s RDD storage structure. By comparing with the traditional Apriori and Apriori based on Hadoop, analysis showed that Apriori based on Spark more greatly reduced the number of scan database than that of traditional Apriori, and it used less I/O overhead than Apriori based on Hadoop, because it supported storing temporary results in memory and iterative calculation. Experimental results showed that Apriori based on Spark performed effectively on big data for mining association rules.