提出一种MapReduce框架下基于抽样的分布式K-Means聚类算法,解决海量数据环境下并行执行K-Means算法时,时间开销较大的问题.该算法使用抽样方法,在保证数据分布不变的前提下,对数据集的规模进行约减,并在MapReduce框架下对聚类算法进行优化.实验结果表明,该算法在保持良好聚类效果的同时,能有效缩短聚类时间,对大规模数据集具有较高的执行效率和较好的可扩展性.
We proposed a distributed K-Means clustering algorithm based on sampling under MapReduce framework, in order to solve the problems of high time cost of parallel execution of K-Means algorithm in the massive data environment. The algorithm used sampling algorithm to reduce the original data size on the premise of ensuring the invariant data distribution, and the clustering algorithm was optimized under the MapReduce framework. The experimental results demonstrate that the algorithm can effectively reduce the clustering time while maintaining good clustering effect, and it has high execution efficiency and good scalability for large scale datasets.