针对Apriori算法需要多次扫描数据库、产生庞大的候选项集和计算时间过长等问题,提出一种基于Hadoop平台的DG-Apriori算法。该算法改进了频繁项集的连接方式,只需用频繁(k-1)-项集与频繁1-项集连接即可生成频繁融项集,极大地减少了连接次数,避免了产生庞大的候选项集,并且将改进后的Apriori算法以并行处理方式移植到Hadoop平台,并行地计算频繁项集,减少了计算时间。实验结果表明,DG-Apriori算法大大提高了Apriori算法的性能。
Aiming at the problem that the Apriori algorithm needs to scan the database repeatedly and generates large candi- date item sets and has long computation time, a DG-Apriori algorithm based on Hadoop is proposed, The algorithm im- proves connection of frequent item sets, the generation of k-frequent item sets is only needed to join 1-frequent item sets with (k-1)-frequent item sets, the connection number is greatly reduced and the huge candidate item sets are avoided. And the improved Apriori algorithm is used for Hadoop platform to compute parallel frequent item sets and reduce the computa- tion time. Experimental results show that DG-Apriori algorithm can effectively improve the performance of Apriori algo- rithm.