基于Spark大数据框架,将传统Apriori算法进行并行化处理,提出了一种改进的并行化AMRDD算法,使Apriori算法能够适用于大数据关联规则的挖掘.该算法利用Spark基于内存计算的抽象对象存储频繁项集,通过引入矩阵概念减少扫描事务数据库的次数,应用局部剪枝和全局剪枝方法缩减生成候选频繁项集的数量.通过搭建Spark平台实现该算法,并与传统Apriori算法和基于Hadoop的Apriori算法进行性能上的比较.结果表明,该算法能够较大程度地提高大数据关联规则挖掘的效率.
The AMRDD algorithm is proposed on the basis of the traditional Apriori algorithm,which is a distributed association rules algorithm based on Spark.To reduce the times of scanning the database,the matrix is introduced,and the number of candidate frequent itemsets is reduced by using local pruning strategy and global pruning strategy.The algorithm is realized on Spark platform,and compare with the traditional Apriori algorithm and the Apriori algorithm based on Hadoop.The experimental results show that AMRDD algorithm performs effectively on big data for mining frequent itemsets.