东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

改进的共享最近邻聚类算法

期刊名称：计算机工程与应用
时间：2011.3.3
页码：138-142
分类：TP393[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]广东外语外贸大学思科信息学院,广州 510006
相关基金：国家自然科学基金（the National Natural Science Foundation of China under Grant No.61070061）
相关项目：面向不平衡数据的学习算法及应用研究

作者：李霞|蒋盛益|

关键词：共享最近邻聚类算法, 一趟聚类算法, 大规模数据集, shared nearest neighbor clustering algorithm one-pass clustering algorithm large dataset

中文摘要：

聚类是一种无监督的机器学习方法，其任务是发现数据中的自然簇。共享最近邻聚类算法（SNN）在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果，但该算法还存在以下不足：（1）时间复杂度为O（n2），不适合处理大规模数据集；（2）没有明确给出参数阈值的简单指导性操作方法；（3）只能处理数值型属性数据集。对共享最近邻算法进行改进，使其能够处理混合属性数据集，并给出参数阈值的简单选择方法，改进后算法运行时间与数据集大小成近似线性关系，适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明，提出的改进算法是有效可行的。

英文摘要：

Clustering is a method of unsupervised learning in machine learning,the typical task of which is to discovery “natural” clusters present in the data.The shared nearest neighbor algorithm is one of the most efficient clustering algorithm which can handle datasets of different sizes,shapes and densities.But there are still some shortages about the algorithm.SNN can’t handle large dataset because of its high complexity.There are no definite methods about threshold of the algorithm.SNN can not process databases with mixture attributes.This paper improves the SNN algorithm to process the data with categorical attributes,gives a simple definite method to select threshold of the algorithm.The time complexity of the improved algorithm is nearly linear with the size of dataset and can be used to large dataset.The experimental results on real datasets and synthetic datasets show that the improved algorithm is effective and practicable.

同期刊论文项目