针对分类变量相似度定义存在的不足,提出一种新的相似度定义.利用新的相似度定义,将数据集抽象为无向图,将聚类过程转化为求无向图连通分量的过程,进而提出一种基于连通分量的分类变量聚类算法.为了定量地分析该算法的聚类效果,针对类别归属已知的数据集,提出一种新的聚类结果评价指标.实验结果表明,所提出的算法具有较高的聚类精度和聚类效率.
For the insufficient similarity concepts for categorical variables, a new more reasonable concept is proposed. Firstly, a data set is organized into an undirected graph by the new definition. The clustering process is converted into the problem of determining connected components in the undirected graph. Then a novel clustering algorithm for categorical variables based on connected components is proposed. In order to analyze the clustering results quantitatively, a new index is proposed for the known labels. Finally, the experimental results show that the proposed algorithm has a higher clustering precision and faster execution speed compared with several existing ones.