针对K-medoids算法中初始聚类中心的结点的选取的随机性导致影响聚类结果质量的问题,采用标签共现原则对该算法进行改进。根据标签共现频率和相似度先对标签进行聚类,根据标签聚类结果,选取K个由其代表的资源作为聚类初始中心结点。通过聚类中心的优化设置,降低了抽样选取的随机性。最后采用MapReduce框架对其进行并行化,以豆瓣图书的标签数据为应用背景进行实验,验证了算法的实用性。
The K-medoids algorithm suffered from one problem which the quality of clustering results was sensitive to the initial clustercenters selection. The paper improves the algorithm using the principle of the tag co-occurrence. According to the tag co-occurrencefrequency and similarity, clustering is carried out on the tags, and K resources are selected as the initial clustering center nodes on thebasis of different tag cluster. After that, the paper reduces the randomness of sampling selection by optimizing the clustering center.Furthermore, MapReduce framewok is adopted to carry out the parallel algorithm. Finally through the experiment with the applicationbackground of Douban books, the experimental result verifies the practicability of the algorithm.