面向结构相似的网页聚类是网络数据挖掘的一项重要技术。传统的网页聚类没有给出网页簇中心的表示方式,在计算点簇间和簇簇间相似度时需要计算多个点对的相似度,这种聚类算法一般比使用簇中心的聚类算法慢,难以满足大规模快速增量聚类的需求。针对此问题,该文提出一种快速增量网页聚类方法FPC(Fast Page Clustering)。在该方法中,先提出一种新的计算网页相似度的方法,其计算速度是简单树匹配算法的500倍;给出一种网页簇中心的表示方式,在此基础上使用Kmeans算法的一个变种MKmeans(Merge-Kmeans)进行聚类,在聚类算法层面上提高效率;使用局部敏感哈希技术,从数量庞大的网页类集中快速找出最相似的类,在增量合并层面上提高效率。
Structure-oriented web page clustering is one of the most important technique in web data mining.Previous traditional methods haven't given a formal definition of the web page cluster center and have to calculate several point-wise similarities for the purpose of getting the similarity between a point and a cluster or the similarity between two clusters.The efficiency of these methods is much slower than the clustering algorithms using cluster center,especially they can't satisfy the need of large scale clustering in fast incremental web pages clustering.To solve these issues,this paper proposes a fast incremental clustering method FPC(Fast Page Clustering).In our method,a new approach is given to calculat the similarity between two web pages which is 500 times faster than the Simple Tree Matching algorithm;then a formal representation of web page cluster center is described and a Kmeans-like MKmeans(Merge-Kmeans)clustering algorithm for fast clustering is applied;Moreover,we use local sensitive hashing technique to quickly find the most similar cluster in a large scale cluster set and improve the efficiency in terms of the incremental clustering.