提出了一种基于最大频繁Induced子树的GML文档结构聚类新算法TBCClustering.通过挖掘GML文档集合中的最大频繁Induced子树构造特征空间,并对特征空间进行优化;采用CLOPE聚类算法聚类GML文档,可自动生成最小支持度与聚类簇的个数,无需用户设置;不仅减少了特征的维数,而且得到了较高的聚类精度.实验结果表明算法TBCClustering是有效的,且性能优于PBClustering算法.
This paper presents an algorithm TBCClustering for clustering GML document structure based on maximal fre- quent subtree patterns. During the maximal frequent subtree mining process, it optimizes characteristic spaces, gets the minimum support automatically, chooses some subtree pattern to form the optimistic clustering features, and uses CLOPE algorithm to cluster documents by clustering features without giving the number of cluster. Not only the dimensions of features are reduced, but also the higher clustering precision is obtained. Experiment results show that TBCClustering is more effective and efficient than PBClustering.