结合Rough Set理论研究了分布式处理海量数据中的关键问题,即分割海量数据集的问题。经典的RoughSet算法要求数据常驻内存,因此不能有效地处理海量数据。为了能够直接处理海量数据集,根据最佳分割的定义,结合属性约简的思想,提出基于属性约简的粗糙集海量数据分割算法(Mass Data Partition for Rough Seton Attribute Reducdon,MD-PRS—AR)。通过实验表明,MDPRS—AR算法的分割效率比传统的算法约高70%,而且与处理整个数据集的算法相比,正确性损失不大。
An effective rough-set-based method is developed to study the key problem of process distributed mass data, which is the problem of segment massive dataset. Most other rough- set - based algorithms are designed only for memory- resident data, so it is hard for these algorithms to deal with mass data set. On the base of definition of best partition, and combined with the idea of attribute reduction, a mass data partition for rough set on attribute reduction algorithm is developed for processing mass data sets directly. It is proved by simulation experiments that the MDPRS- AR method presented is faster than original rough- set- based algorithms by about 70%, while its performance is close to those algorithms that process the original data set as a whole.