处理海量数据一直是数据挖掘要解决的一个重要问题.目前已有许多并行或串行的算法来处理海量数据,然而这些算法通常都不能很好地解决速度和正确率之间的矛盾.分布式运算在处理数据上具有明显优势,因此本文考虑将一个原始的海量数据集分割成许多个独立的小数据集进行分布式处理.本文首先根据Rough Set的特点提出最佳分割的定义,然后提出一种海量数据分割算法来寻找最佳分割.通过实验测试证明结合本文提出的数据分割算法的分布式处理方案能够快速处理海量数据,而且与处理整个数据集的算法相比,正确性较高.
Processing huge data sets is an important topic in data mining nowadays . Although many serial or parallel algorithms have been developed to deal with huge data sets, most of them are not ideal to resolve the conflict between speed and accuracy. In this paper, the whole huge data set is partitioned into many small subsets for the advantage of distributed computing. At first, a definition of best partition is proposed. Then, a rough-set-based partition algorithm is developed to look for the best partition. Experimental results prove that the distributed information processing method based on the rough-set-based partition algorithm is an effective method in dealing with huge data sets. It is faster than original rough-set-based algorithms and its performance is as good as those processing the original data set as a whole.