在网页聚类中,HAC(Hierarchical Agglomerative Clustering)算法和K-means算法都是经常用到的。但它们都有各自的不足。提出一种两阶段聚类方法。第一阶段利用HAC聚类算法对网络检索结果的标题进行聚类,第二阶段以第一阶段结果作为初始中心用K-means算法聚类标题和摘要取得比较合理的聚类结果。由于标题一般都比较短,可以大大减少HAC算法的运行时间。这样既满足网络检索对时间的要求又可以得到较好的聚类结果。
In web search result clustering, HAC(Hierarchical Agglomerative Clustering) and K-means are usually used.But each of them has its own fault.This paper advances a two-stage clustering method.In the first stage, it clusters the topics by HAC, in the second stage, it clusters the topics and abstracts by K-means with the initial cluster center from the first stage clustering to get a reasonable clustering result.Because the topics are always short,the running time of HAC is greatly shorter.This method satisfies the need of time to web search and gets a better clustering result.