提出了一种geography markup language(GML)文档结构聚类新算法Clu-GML,与其它相关算法不同,该算法在凝聚的层次聚类中引入代表树的计算,通过计算最大频繁Induced子树得到簇的代表树,通过对代表树的比较发现新的簇,并更新新簇的代表树来完成聚类,不仅减少了聚类的时间开销,而且为每个簇形成聚类描述.实验结果表明算法Clu-GML是有效的,且性能优于其它同类算法.
Algorithm Clu-GML for clustering geography markup language (GML) documents by structure is proposed in this paper. Compared with other tree-based clustering algorithms, it introduces the computation of representative trees during the agglomerative hierarchical clustering process. The representative trees can be gotten through the computation of the maximal frequent induced subtrees. The new clusters are gotten by the comparison of representative trees, and the representative trees of new clusters are updated to finish the clustering. In all the papers that have researched this issue the similarity or distance between every two documents needs to be computed. It costs a lot of time. When the dataset is large, the time performance doesn't satisfy us at all. While the algorithm that this paper has presented just needs to compute the similarity between two representative trees for the two clusters. It's fast and scalable when the dataset is very large. It not only reduces the running time of the algorithm, but also creates a description for every cluster. The experiment results show that Clu-GML is effective, and the performance is superior to that of other GML clustering algorithms.