东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

GML文档结构聚类算法Clu-GML

期刊名称：南京大学学报，2008，44（2）：188-194
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]南京师范大学计算机系,南京210097
相关基金：国家自然科学基金（40771163）
相关项目：面向GML的空间聚类分析与异常检测方法研究

关键词： GEOGRAPHY, MARKUP, LANGUAGE, 结构聚类, 最大频繁induced子树, geography markup language（GML）, clustering by structure, the maximal frequent induced subtree

中文摘要：

提出了一种geography markup language（GML）文档结构聚类新算法Clu-GML，与其它相关算法不同，该算法在凝聚的层次聚类中引入代表树的计算，通过计算最大频繁Induced子树得到簇的代表树，通过对代表树的比较发现新的簇，并更新新簇的代表树来完成聚类，不仅减少了聚类的时间开销，而且为每个簇形成聚类描述．实验结果表明算法Clu-GML是有效的，且性能优于其它同类算法．

英文摘要：

Algorithm Clu-GML for clustering geography markup language （GML） documents by structure is proposed in this paper. Compared with other tree-based clustering algorithms, it introduces the computation of representative trees during the agglomerative hierarchical clustering process. The representative trees can be gotten through the computation of the maximal frequent induced subtrees. The new clusters are gotten by the comparison of representative trees, and the representative trees of new clusters are updated to finish the clustering. In all the papers that have researched this issue the similarity or distance between every two documents needs to be computed. It costs a lot of time. When the dataset is large, the time performance doesn＇t satisfy us at all. While the algorithm that this paper has presented just needs to compute the similarity between two representative trees for the two clusters. It＇s fast and scalable when the dataset is very large. It not only reduces the running time of the algorithm, but also creates a description for every cluster. The experiment results show that Clu-GML is effective, and the performance is superior to that of other GML clustering algorithms.

同期刊论文项目