知识约简是数据挖掘应用中知识获取的重要步骤。经典的知识约简算法是一次性将小数据集装入内存中进行知识约简,而传统的并行知识约简仅仅利用任务并行来提高约简算法效率,都无法处理海量数据。通过分析经典的知识约简算法,构建了不可辨识的对象对,提出了保持边界域划分的知识约简算法,并探讨了保持边界域划分的知识约简算法之间的关系。深入剖析了知识约简算法中数据和任务同时并行的可行性,提出了云计算环境下保持边界域划分的知识约简算法框架模型,在Hadoop平台上构建了云计算环境并进行了相关实验。实验结果表明该知识约简算法可以处理海量数据集。
Knowledge reduction in rough set theory is the critical process of knowledge acquisition among data mining applications. Classical knowledge reduction algorithms assume all the datasets can be loaded into the main memory, while the existing parallel knowledge reduction algorithms only implement reduction tasks concurrently, which are infeasible for large-scale datasets. Massive data with high dimension makes attribute reduction a challenging task. To solve this problem,the concept of indiscernibility object pairs is defined and a new knowledge reduction algorithm for boundary region partition preserving is proposed. The relationship among these algorithms is illustrated in detail. Then, the parallelism strategies of data and task parallel are implemented and discussed. The corresponding attribute reduction framework model for boundary region partition preserving is presented. The experimental results demonstrate that knowledge reduction algorithms in cloud computing can efficiently process massive datasets on Hadoop platform.