不完整数据填充是数据预处理领域重要研究课题.传统数据填充算法时间复杂度高,算法不具有分布式处理特性,不满足大数据环境下对数据快速处理的要求.提出一种基于分布式减法聚类的不完整数据填充算法,算法首先利用改进的减法聚类算法对整个数据集进行聚类.为了提高聚类算法的效率,利用云计算技术对聚类算法进行优化,实现基于多级MapReduce的分布式减法聚类算法.然后根据聚类结果和加权距离对缺失数据值进行填充,在保证数据填充精度的同时大幅度降低了填充过程的处理时间.实验结果表明,本文提出的方法能够对大数据进行快速聚类,同时有效保证缺失数据的填充精度.
Incomplete data imputation is an important issue in data analysis and preprocessing. Existing incomplete data imputation algorithms' time complexity is pretty high,and they don't have the characteristic of distributed processing. Therefore,they are not suitable for the processing requirement in big data environment. The paper proposes a novel algorithm based on distributed subtractive clustering for imputing incomplete data,which clusters incomplete data directly by designing a newsimilarity metrics,and then cloud computing technology is used to improve the clustering efficiency by deriving M uti-M apReduce-based distributed clustering algorithm.Then the data objects in the same cluster with the target and the weighted distance are utilized to fill in the missing values. The algorithm of this paper significantly reduces the processing time of filling process. M eanwhile,it ensures the filling accuracy of incomplete data imputation. Experiment demonstrates the proposed algorithm can cluster the incomplete big data directly and ensure the accuracy for filling in missing data effectively.