随着信息技术、云计算、互联网以及社交网络等技术的不断发展,数据规模呈爆炸态势增长.在海量数据带来丰富信息的同时,如何对海量信息进行高效的预处理成为研究的热点.其中,对于缺失数据的处理就是数据预处理技术中一项重要的挑战.传统的缺失数据的填补方法大部分都只考虑不完备集中数据完全缺失情况下的填补,然而,在海量数据集中,由于人为或者机械等原因会对数据造成一定程度的损坏,有些数据会完全缺失,而有些数据只是部分缺失,传统的填补方法未对不同程度上损坏的数据进行划分,全部按照完全缺失数据进行填补分析,忽略了部分缺失数据对数据填补结果的影响.因此,提出一种基于泛化中心聚类的填补方法(GCF),采用泛化中心聚类思想对数据进行分簇,并对随机损坏数据与聚类结果一起进行缺失数据的填补,以提高填补后数据集的正确率.实验表明,针对不同缺失度的数据集样本,提出的GCF策略在填补正确率方面都具有良好的表现.
With the development of information technology,cloud technology,internet and social network,The scale of the data has grown explosively.Althouth mass data can provide wealthy information,and at the same time,how to preprocess the information efficiently has become a research focus.Among them,preprocessing the missing data is an important challenge in the pretreatment,Mosttraditional filling method for missing data only consider filling incomplete centralized data in the completely missing cases.However,due to artificial or mechanical and other reasons in mass data,this will cause a certain degree of damage to the data.Some data will be completely missing,and some missing is only partially,the traditional filling method didn't divide the data in different degrees of damage.They all analysis completely missing ,but ignore the influence of partially missing data.In this paper,a kind of method based on generalized center-clustering fill (GCF) has been proposed,thispaperadoptsthe idea of generalization center clustering to cluster the data,and fill the missing databetween the random damage data and clustering results in order to improve the accuracy of the dataset filled.Experimental results show that the proposed GCF strategy in the accuracy of filling missing datasets that has different degree have good performance.