为了解决集中式聚类算法不能处理海量大数据的问题,提出基于Fisher判别确定置信半径的分布式聚类算法.应用网络上各个节点的计算、存储能力,以及网络的带宽,将聚类所需的时间复杂度和空间复杂度平摊到各个节点.通过应用Fisher线性判别找到节点在同一子类数据上的稠密和稀疏分布,从而快速确定聚类的置信半径并指导下一步的聚类过程,使得保持聚类精度的同时能提高分布式聚类的速度.对算法进行了数值模拟,并使用真实数据完成了试验.结果表明,所提出算法相比DFEKM聚类算法,能根据数据分布的不同在聚类结果和聚类速度上达到很好的平衡,这表明该算法具有更好的健壮性.
To solve the problem that centralized clustering algorithms could not deal with big data sets, a distributed K-Means clustering algorithm was proposed based on the confidence radius by Fisher discriminant ratio in local nodes. The computing and storage capacitates as well as bandwidth of each nodes were used to share the time and space expenses to each nodes in the P2P networks. The Fisher discriminant ratio was applied to find the difference of dense and sparse distributions in the same cluster in local nodes. The ratio was used to deduce the confidence radius for the next clustering processing to maintain clustering accuracy, and the distributed clustering was speeded up at the same time. The numerical simulation of algorithm and experiments were completed based on real data. The results show that a good balance between accuracy and speed is obtained according to the data distributions. The proposed algorithm has better robustness than the DFEKM algorithm.