聚类数直接关系到聚类算法的聚类质量,但在K-means等经典聚类算法中,对于聚类数的确定目前尚无合适的理论,一般凭经验或试凑指定.这样不仅需要较多的人机交互和耗费较多的试算开销,并且由于最优聚类数常常难以获得,而影响聚类结果的精度.本文提出一种自适应逼近最佳聚类数的算法ADNC(adaptively determining the number of clusters),可以通过自适应方法逼近最优聚类数.逼近是一个反复迭代聚类的过程.每迭代一次,对输出的聚类评估分类空间各图像特征值(输入向量各分量)标准差的平均误差,并构成多特征综合误差;根据梯度下降原理调整聚类数,即在使多特征综合误差逐步减小的同时,逼近最优聚类数.这个最优聚类数一般出现在多特征综合误差开始震荡之前最邻近的位置.以这个聚类数做K-means聚类,可以使同类间特征值异质性降到最小,取得理想的聚类结果.与此同时,还提出了较不适宜聚类数的概念,即可能使聚类误差最大的聚类数.实验表明,最适宜和较不适宜的聚类数两个概念对于改善聚类精度都有实践意义.
A new algorithm,named adaptively determining the number of clusters (ADNC),has been proposed.By using ADNC,the optimal clustering number for K-means clustering,usually determined by human conjecture or manual try,can now be determined by computer in a self-adaptive way.ADNC typically is an iterative process including the adjustment of clustering number and the assessment of average standard deviation during the iteration.The adjustment will refer the assessment following the principle of gradient descent,namely,to get a better clustering number and to reduce the deviation in the same time.The optimal clustering number most likely locates at the point just before the deviation begins to oscillate.The clustering results will be perfectly reasonable with the clustering number decided by ADNC because the feature heterogeneity in a class will be reduced to the minimum.By the way,the concept of inappropriate clustering number,by using which the deviation may increase to the maximum,has been proposed as a try.It has been revealed by experiment that both the optimal and the inappropriate clustering numbers have practical significance to improve the clustering accuracy.