K-means聚类算法是基于划分的经典聚类算法之一,因其简洁、高效得到了广泛的应用。K-means算法具有容易实现、时间和空间复杂度较小的优点。但该算法的初始聚类数K通常不能通过有效的手段事先确定,其初始聚类中心往往是随机选取的,易收敛于局部最优解,造成聚类结果的不准确。基于多叉树确定K值的动态K-means聚类算法是对传统算法的改进,力求在迭代过程中动态分裂合并簇来确定最合理的聚类数,并且能在一定程度上解决聚类结果收敛于局部最优解的问题。文中还探索了相应的数据模型以支持所改进算法的研究,并从横向与纵向两方面与二分K-means算法作了对比实验。实验结果表明,改进后的K-means算法不依赖于全局数据集,更适用于分布式平台运算;算法相对效率随着数据集规模的增大,特别是在洪量数据集下具有明显的优势。
K-means algorithm is the one of most classical clustering algorithms with repartition and has been widely used because it' s re- ally concise and efficient. What' s more,it has advantages such as being easy to be implemented and low cost of complexity in running time and storing space. However, it' s normally initial number called K-value which cannot be precisely predicted by effective method. The initial clustering center used to be chosen randomly, so that the result usually converges to local optimal solution, which makes the latest clustering results inaccurate. The dynamic clustering algorithm of K-means based on multi-branches tree to determine the K-value is an improved one. The improved algorithm has been designed to determine the most reasonable K-value by dynamically dividing and merging cluster during the iterative process and partly solved the problem that clustering result converges to local optimal solution. Fur- thermore, exploration for corresponding data structure model has also been conducted to the investigation of the algorithm mentioned. Hor- izontal and vertical comparison with the binary K-means algorithm has been achieved. The comparison and analysis results show that the improved K-means algorithm is independent of improved global data sets, which makes it more suitable for distributed computing plat- form and that relative efficiency has been increased with increase of the size of the data set,especially in magnanimity data set. Therefore the improved K-means algorithm has promoted the clustering performance and can lead to a more stable clustering result.