如何有效地约简频繁项集的数量是目前数据挖掘研究的热点。对频繁项集进行聚类是该问题的解决方法之一。由于生成子是全体频繁项集的无损精简表示,故对生成子进行聚类与对全体频繁项集进行聚类具有相同的效果。提出了一种基于生成子的频繁项集聚类算法。首先,利用最小描述长度原理,讨论了选择生成子进行聚类的合理性;其次,给出了生成子的剪枝策略及挖掘算法;最后,在一种新的项集相似性的度量标准的基础上,给生成子的聚类算法。实验结果表明,该方法可有效地减少项集的数量,并具有较高的挖掘效率。
How to reduce the number of frequent itemsets effectively is a hot topic in data mining research.Clustering frequent itemsets is one solution to the problem.Since generators are lossless concise representations of all frequent itemsets,clustering generators is equivalent to clustering all frequent itemsets.A new algorithm for clustering frequent itemsets based on generators is proposed.Firstly,based on minimum description length principle,the rationality of clustering generators is discussed.Secondly,the pruning strategies and mining algorithm for generators are proposed.Finally,based on a new similarity criterion of frequent itemsets,the clustering algorithm is presented.Experimental results show that the proposed method can not only reduce the number of discovered itemsets,but also is efficient.