为解决大规模数据集聚类过程中内存容量受限问题,提出了一种基于聚类个数约束的快速聚类算法,只需扫描一趟原始数据集,半径阈值随聚类过程动态变化;同时定义了一种包含分类属性取值频率信息的类间差异性度量,可用于混合属性数据集,时间复杂度与空间复杂度同数据集大小、属性个数近似成线性关系。在KDDCUP99数据集上的实验结果表明,提出的算法输入参数少,具有良好的聚类特性,可用于大规模数据集。
To solve the constraint of the memory capacity during clustering the large-scale dataset, a fast clustering algorithm based on the constraint of the number of clusters is put forward. The original dataset is read only once and the radius threshold changes dynamically. At the same time an inter-cluster dissimilarity measure taking into account the frequency information of the categorical attribute values is introduced, which can be used for the mixed dataset. The time complexity and space complexity are nearly linear with the size of dataset and the number of attributes. The experimental results on the KDDCUP99 dataset show that the proposed algorithm is feasible and effective, which can be used for the large-scale dataset.