东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于网格密度和距离信息特征的聚类算法

时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]天津大学计算机科学与技术学院,天津300072, [2]天津大学软件学院,天津300072
相关基金：基金项目：国家自然科学基金资助项目（60603027）;天津市科技计划项目（08ZCKFGX01800,08ZCKFGX01600）
相关项目：基于信息几何方法的维数约减和信息抽象模型研究

关键词：聚类, 密度, 网格, 距离, 跃迁函数, clustering , density , grid , distance , transition function

中文摘要：

真实数据集通常密度分布不均，多数基于网格和密度的聚类算法采用的单调性搜索方法难以形成有效聚类．为此，文中提出了基于网格密度和距离信息特征的聚类算法（GDD）．该算法将数据空间划分成网格单元，并构建基于簇中心距离信息的跃迁函数，通过考察局域范围内网格单元的密度跃迁比，并比对计算出的当前网格单元的跃迁函数值，以决定是否继续扩展和增长聚类簇规模．具体的跃迁函数在真实和模拟集上的实验结果表明：GDD算法能够发现任意形状的簇，对噪音数据不敏感，且具有线性于网格数目的时间复杂性，适合对大规模真实数据集的聚类．

英文摘要：

When disposing of a real data set with skewed data distribution using most grid- and density-based clustering algorithms, effective clustering cannot be obtained due to the monotonic search employed in the algorithms. In order to solve this problem, a new clustering algorithm GDD based on grid density and distance is proposed. In GDD, the data space is divided into many grid cells and a transition function related to the distance from the current clustering center is constructed. Then, the density transition ratios of grid cells in the local area are compared with the computed transition function values of the current grid cell to determine whether the current cluster should be extended. Moreover, by using a transition function, some experiments are made with real and synthetic data sets. The results show that the proposed algorithm which is insensitive to noise data, can discover clusters with arbitrary shape, with a time complexity linear to grid number, and that the algorithm is suitable for the clustering of real large-scale data sets.

同期刊论文项目