聚类分析是数据挖掘中经常用到的一种分析数据之间关系的方法.它把数据对象集合划分成多个不同的组或簇,每个簇内的数据对象之间的相似性要高于与其他簇内的对象的相似性.密度中心聚类算法是一个最近发表在《Science》上的新型聚类算法,它通过评估每个数据对象的2个属性值(密度值ρ和斥群值δ)来进行聚类.相对于其他传统聚类算法,它的优越性体现在交互性、无迭代性、无数据分布依赖性等方面.但是密度中心聚类算法在计算每个数据对象的密度值和斥群值时,需要O(N^2)复杂度的距离计算,当处理海量高维数据时,该算法的效率会受到很大的影响.为了提高该算法的效率和扩展性,提出一种高效的分布式密度中心聚类算法EDDPC(efficient distributed density peaks clustering),它利用Voronoi分割与合理的数据复制及过滤,避免了大量无用的距离计算开销和数据传输开销.实验结果显示:与简单的MapReduce分布式实现比较,EDDPC可以达到40倍左右的性能提升.
Clustering is a commonly used method for data relationship analytics in data mining.The clustering algorithm divides a set of objects into several groups(clusters),and the data objects in the same group are more similar to each other than to those in other groups.Density peaks clustering is a recently proposed clustering algorithm published in Science magazine,which performs clustering in terms of each data object's ρ value andδvalue.It exhibits its superiority over the other traditional clustering algorithms in interactivity,non-iterative process,and non-assumption on data distribution.However,computing each data object'sρandδvalue requires to measure distance between any pair of objects with high computational cost of O(N~2).This limits the practicability of this algorithm when clustering high-volume and high-dimensional data set.In order to improve efficiency and scalability,we propose an efficient distributed density peaks clustering algorithm—EDDPC,which leverages Voronoi diagram and careful data replication/filtering to reduce huge amount of useless distance measurement cost and data shuffle cost.Our results show that our EDDPC algorithm can improve the performance significantly(up to 40x)compared with naive MapReduce implementation.