提出基于因子项集的并行化策略GP以发挥串行算法的剪枝功效。其基本思想是利用因子项集的完全包含关系在处理机之间贪心分配等价类.根据等价类的需要相应地划分和复制数据库记录,使各处理机得以异步计算.达到较好的负载平衡、较高的剪枝效率和较少的数据库记录复制,缩短算法的执行时间。分析和实验表明,基于GP策略的并行算法有较好的可扩展性.其性能优于已有同类算法。
Mining frequent itemsets is a crucial issue in data mining applications. The complexity of the problem has been shown as NP-hard. Parallel techniques are widely used to improve the efficiency of mining algorithms. A novel parallel strategy for mining maximal frequent itemsets, called GP, is proposed in this paper. The basic idea is to increase the pruning efficiency by distributing work greedily among the processors with gene itemsets' complete inclusive relation and selectively duplicates databases on demand of equivalence class for the records in such a way that each processor can compute the frequent itemsets independently. These techniques eliminate the need for synchronization, drastically cutting down the I/O overhead. The analysis and experimental results demonstrate the superb efficiency of the approach in comparison with the previous work.