为了便于用户浏览搜索引擎产生的搜索结果,结合STC算法和变色龙算法提出了一种中文网页的层次聚类方法-STCC算法。该方法采用雅可比系数修改了STC算法中基本类相似度的计算方法,然后根据基本类相似度矩阵,利用变色龙算法完成网页聚类。实验结果表明:STCC算法与STC算法相比。聚类精度提高将近10%,避免了单链接算法的链式效应,适用于大规模网页聚类。
In order to facilitate users browsing web search results produced by search engines, a new method called STCC algorithm is proposed, which combines STC algorithm and chameleon algorithm to group similar Chinese web pages in a hierarchical fashion. This method employs Jaccard coefficient to modify the similarity measure of base cluster in STC, then according to the similarity matrix of base cluster, chameleon algorithm is used to cluster web pages. Experimental results show that the precision in STCC increases by nearly ten percent compared with that in STC, meanwhile, chain effect in single-link algorithm can be avoided by using STCC algorithm, and it is suitable for large scale web pages clustering.