东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

Asyn-SimRank:一种可异步执行的大规模SimRank算法

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：2015.7.15
页码：1567-1579
分类：TP301.6[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：东北大学计算机科学与工程学院,沈阳110819
相关基金：国家自然科学基金项目（61300023,61528203,61272179）; 中央高校基本科研业务费专项资金项目（N141605001,N120816001）
相关项目：基于Hadoop的分布式并行联机分析处理技术研究

关键词：密度中心, 数据聚类, Voronoi分割, MAPREDUCE, 大数据, density peaks, data clustering, Voronoi partition, MapReduce, big data

中文摘要：

聚类分析是数据挖掘中经常用到的一种分析数据之间关系的方法.它把数据对象集合划分成多个不同的组或簇,每个簇内的数据对象之间的相似性要高于与其他簇内的对象的相似性.密度中心聚类算法是一个最近发表在《Science》上的新型聚类算法,它通过评估每个数据对象的2个属性值（密度值ρ和斥群值δ）来进行聚类.相对于其他传统聚类算法,它的优越性体现在交互性、无迭代性、无数据分布依赖性等方面.但是密度中心聚类算法在计算每个数据对象的密度值和斥群值时,需要O（N^2）复杂度的距离计算,当处理海量高维数据时,该算法的效率会受到很大的影响.为了提高该算法的效率和扩展性,提出一种高效的分布式密度中心聚类算法EDDPC（efficient distributed density peaks clustering）,它利用Voronoi分割与合理的数据复制及过滤,避免了大量无用的距离计算开销和数据传输开销.实验结果显示：与简单的MapReduce分布式实现比较,EDDPC可以达到40倍左右的性能提升.

英文摘要：

Clustering is a commonly used method for data relationship analytics in data mining.The clustering algorithm divides a set of objects into several groups（clusters）,and the data objects in the same group are more similar to each other than to those in other groups.Density peaks clustering is a recently proposed clustering algorithm published in Science magazine,which performs clustering in terms of each data object＇s ρ value andδvalue.It exhibits its superiority over the other traditional clustering algorithms in interactivity,non-iterative process,and non-assumption on data distribution.However,computing each data object＇sρandδvalue requires to measure distance between any pair of objects with high computational cost of O（N~2）.This limits the practicability of this algorithm when clustering high-volume and high-dimensional data set.In order to improve efficiency and scalability,we propose an efficient distributed density peaks clustering algorithm—EDDPC,which leverages Voronoi diagram and careful data replication/filtering to reduce huge amount of useless distance measurement cost and data shuffle cost.Our results show that our EDDPC algorithm can improve the performance significantly（up to 40x）compared with naive MapReduce implementation.

同期刊论文项目