东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于最大频繁项集的搜索引擎查询结果聚类算法

期刊名称：中文信息学报，2010.02
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]哈尔滨工业大学深圳研究生院智能计算研究中心,广东深圳518055
相关基金：863专题目标导向类资助项目（2006AA01Z197）;国家自然科学基金资助项目（60703015）
相关项目：异构信息互动模型中的关键技术研究

关键词：计算机应用, 中文信息处理, 搜索引擎, 网页聚类, 频繁项集, computer application, Chinese information processing, search engine, Web page clustering, frequent itemset

中文摘要：

现有的搜索引擎查询结果聚类算法大多针对用户查询生成的网页摘要进行聚类，由于网页摘要篇幅较短，质量良莠不齐，聚类效果难以有较大的提高（比如后缀树算法，Lingo算法）；而传统的基于全文的聚类算法运算复杂度较高，且难以生成高质量的类别标签，无法满足在线聚类的需求（比如KMeans算法）。该文提出一种基于全文最大频繁项集的网页在线聚类算法MFIC（Maximal Frequent Itemset Clustering）。算法首先基于全文挖掘最大频繁项集，然后依据网页集合之间最大频繁项集的共享关系进行聚类，最后依据类别包含的频繁项生成类别标签。实验结果表明MFIC算法降低了基于网页全文聚类的时间，聚类精度提高15％左右，且能生成可读性较好的类别标签。

英文摘要：

Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance （e. g. , STC and Lingo algorithms）. On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering （for example, Kmeans algorithm）. To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets （MFIC）. At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.

同期刊论文项目