现有的关联规则数据挖掘算法或方法中,获取规则的计算时间很大一部分都耗费在关联项目集的扫描、数据库频繁扫描和生成冗余候选频繁项目集中。传统方法虽然得到的挖掘结果比较全面,但并不是所有挖掘结果中的规则都是重要的,以往的方法没有反映出重要的关联规则而使得挖掘结果的有效性不高,不利于得到需要的重要目标结果。针对重要目标的挖掘,提出一种基于堆排序及链表结构的改进Apriori算法。算法通过扫描数据库,统计得到各个项目集在所有事务集中出现的频率,并按照项目集的频率次数进行堆排序。然后根据建立的堆得到所有k阶候选项目集并计算其相对应的支持度,将不同项目集的支持度与预先设定的最小支持度进行比较,若满足最小支持度,就将对应的频繁项目集加入链表中,否则依据剪枝策略剪去这个对应项,将通过连接运算生成的候选k+1阶项目集采用同样的操作可以生成k+1阶频繁项目集。这样可以很大程度上优化算法的频繁项目集的生成过程并加速了重要关联规则的生成过程,从整体上提高了运算速度。
The existing association rule mining algorithms or methods waste most of their time on the correlation set database scanning, the frequent scanning and the generating of redundant frequent itemsets candidates during their rule acquisition computation. The traditional methods can get more comprehensive mining results, but not all of the rules that came from the mining result are important. Traditional methods don' t reflect the importance of association rules so as to have inefficiency for mining results, and they are not conducive to the gaining of main target results. Aimed at the mining of important goal, an improved Apriori algorithm based on linked list structure and heap sort is proposed. The algorithm scans the whole database to get the frequency of the appearance of each item set among the whole datasets and do the heap sort. Then,according to the established heap,all the k rank candidate sets are obtained and the relative support is calculated. The support degree of different project sets is compared with the minimum support degree. If the minimum support is met,the corresponding frequent item set should be added to the list, or it should be cut according to the shear or pruning strategy. By connecting operation, the candidate k + 1 order item set can be obtained from the generated k order frequent item sets, so to generate the k + 1 order frequent item sets. In this way, the generation of frequent itemsets can be greatly improved, and the mining results of important association rules can be provided, which can improve the speed of operation.