现有的搜索引擎查询结果聚类算法大多针对用户查询生成的网页摘要进行聚类,由于网页摘要篇幅较短,质量良莠不齐,聚类效果难以有较大的提高(比如后缀树算法,Lingo算法);而传统的基于全文的聚类算法运算复杂度较高,且难以生成高质量的类别标签,无法满足在线聚类的需求(比如KMeans算法)。该文提出一种基于全文最大频繁项集的网页在线聚类算法MFIC(Maximal Frequent Itemset Clustering)。算法首先基于全文挖掘最大频繁项集,然后依据网页集合之间最大频繁项集的共享关系进行聚类,最后依据类别包含的频繁项生成类别标签。实验结果表明MFIC算法降低了基于网页全文聚类的时间,聚类精度提高15%左右,且能生成可读性较好的类别标签。
Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance (e. g. , STC and Lingo algorithms). On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering (for example, Kmeans algorithm). To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets (MFIC). At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.