随着现实待挖掘数据库规模不断增长,系统可使用的内存成为用FP-GROWTH算法进行关联规则挖掘的瓶颈.为了摆脱内存的束缚,对大规模数据库中的数据进行关联规则挖掘,基于磁盘的关联规则挖掘成为重要的研究方向.对此,改进原始的FP-TREE数据结构,提出了一种新颖的基于磁盘表的DTRFP-GROWTH(disktableresidentFP-TREEgrowth)算法.该算法利用磁盘表存储FP-TREE,降低内存使用,在传统FP-GROWTH算法占用过多内存、挖掘工作无法进行时,以独特的磁盘表存储FP-TREE技术,减少内存使用,能够继续完成挖掘工作,适合空间性能优先的场合.不仅如此,该算法还将关联规则挖掘和关系型数据库整合,克服了基于文件系统相关算法效率较低、开发难度较大等问题.在真实数据集上进行了验证实验以及性能分析.实验结果表明,在内存空间有限的情况下,DTRFP-GROWTH算法是一种有效的基于磁盘的关联规则挖掘算法.
As the size of the database to be mined is increasing constantly, the size of physical memory available has become a bottleneck when using FP-GROWTH algorithm for association rules mining. So, it is necessary to tackle space scalability by some new algorithms in order to mine association rules in huge database. Nowadays, disk-resident algorithm has become the main target. Therefore, the original data structure of FP-TREE is improved and a novel algorithm called DTRFP-GROWTH (disk table resident FP-TREE growth) is presented. This algorithm uses disk table for storing FP-TREE to decrease memory usage. When the mining works failed for FP-GROWTH using too much memory, DTRFP-GROWTH can continue to mine association rules from huge database by its special skill called disk table resident FP-TREE, which is suitable to occasions of space performance priority. In addition, this algorithm also integrates association rules mining with RDBMS system. It overcomes the problems of some relative solutions based on file system, such as low performance, high difficulty in development, etc. The correctness verification and performance analysis of the above mentioned algorithm are presented. The experiment results on real world data sets also support effectiveness of this algorithm. The study shows that in memory limited occasion, DTRFP-GROWTH is an effective association rules mining algorithm based on disk.