聚类算法作为机器学习领域的一个至关重要的算法体系,已经被深入研究和广泛使用了很多年,其主要研究内容集中在用相似距离作为依据,其中Kmeans算法应用极为广泛,被添加到各种数据挖掘软件包中.传统的Kmeans算法不能满足今天大数据环境下的应用,文中利用Spark技术将其改进为并行化的设计思想并进行优化.
As one of the most important algorithms in the field of machine learning, clustering algorithm has been studied and widely used for many years. The main research content is based on the similarity distance, in which the Kmeans algorithm is widely used and is added to various data mining software packages. The traditional Kmeans algorithm can not meet the needs of today's big data environment, in this paper, we use Spark technology to improve the design idea and optimize it.