对基于密度的分布式聚类算法DBDC(density based distributed clustering)进行改进,提出了一种基于密度的分布式聚类算法DBDC*.该算法在局部筛选代表点时结合贝叶斯信息准则BIC,得到少量精准反映局部站点数据分布的BIC核心点,有效降低了分布式聚类过程中的数据通信量,全局聚类时综合考虑了各站点数据的分布情况.实验结果表明,算法DBDC*的效率优于DBDC,聚类效果好.
A large number of data are distributed with the application of networks. Distributed clustering is a challenging research topic due to variety of the real-life constrains including bandwidth, the storage of the site memory, etc. An effective density-based distributed clustering algorithm (DBDC * ) is proposed to improve efficiency of the distributed clustering algorithm (DBDC). DBDC * , which is combined with the Bayesian Information Criterion, only selecting less BIC_ core_ points to represent each local site, effectively decrease network overload and improves the quality of global clustering. DBDC * is carried out on two different levels, i.e. the local level and the global level. On the local level, all sites carry out a DBSCAN clustering independently from each other. After having completed the clustering, a BIC core points local model is de/ermined. Next the local model is transferred to a central site, where the local models are merged in order to form a global model on the global level by analyzing the local BIC core points. To each local representatives a global cluster-identifier is assigned. This resulting global clustering is broadcasted to all local sites. Then all local models are updated. Experimental results show that the efficiency of the algorithm DBDC * is superior to that of the algorithm DBDC.