为提高层次凝聚聚类(HAC)算法的执行效率和结果质量,对其进行了动态分析,研究了一次合并对后续合并的影响。分析表明,合并两个类会生成一个新类,并使被合并的类的共享邻居的邻居数减小1;当新生成的类或邻居数减小的类参与后续合并时,会影响执行效率;一次合并会改变参与合并的类和它们的候选邻居之间的准则函数值,从而影响后续合并提高质量的程度。基于上述分析并结合模块性的定义,研究了现有准则函数对凝聚过程的影响以及它们的缺陷,并设计了两个新的准则函数。在大量数据集上的买验表明,新的准则函数提高了层次凝聚聚类算法的执行效率和结果质量。
To improve the efficiency and the result quality of the hierarchical agglomerative clustering (HAC) algorithm, its dynamic analysis was conducted and the problem that how a merge influences the subsequent merges was stud- ied, with the conclusions below : merging two clusters generates a new cluster, and reduces the number of neighbors of the shared neighbors of the two clusters; The new cluster and those clusters whose number of neighbors are de- creased will be involved in the subsequent merges, and the efficiency will be influenced; A merge changes the val- ue of the criterion function over the involved clusters and their candidate neighbors, and thus influences the quality of the subsequent merges. Based on the above analyses and the definition of modularity, existing criterion func- tions' influence on agglomeration and their limitations were investigated, and two new criterion functions were de- signed. The results of the experiments conducted based on many datasets show that the new criterion functions can improve the efficiency and the result quality of the HAC algorithm.