真实数据集通常密度分布不均,多数基于网格和密度的聚类算法采用的单调性搜索方法难以形成有效聚类.为此,文中提出了基于网格密度和距离信息特征的聚类算法(GDD).该算法将数据空间划分成网格单元,并构建基于簇中心距离信息的跃迁函数,通过考察局域范围内网格单元的密度跃迁比,并比对计算出的当前网格单元的跃迁函数值,以决定是否继续扩展和增长聚类簇规模.具体的跃迁函数在真实和模拟集上的实验结果表明:GDD算法能够发现任意形状的簇,对噪音数据不敏感,且具有线性于网格数目的时间复杂性,适合对大规模真实数据集的聚类.
When disposing of a real data set with skewed data distribution using most grid- and density-based clustering algorithms, effective clustering cannot be obtained due to the monotonic search employed in the algorithms. In order to solve this problem, a new clustering algorithm GDD based on grid density and distance is proposed. In GDD, the data space is divided into many grid cells and a transition function related to the distance from the current clustering center is constructed. Then, the density transition ratios of grid cells in the local area are compared with the computed transition function values of the current grid cell to determine whether the current cluster should be extended. Moreover, by using a transition function, some experiments are made with real and synthetic data sets. The results show that the proposed algorithm which is insensitive to noise data, can discover clusters with arbitrary shape, with a time complexity linear to grid number, and that the algorithm is suitable for the clustering of real large-scale data sets.