XML文档由于其自身的可扩展性、半结构化和自描述性等特点,已成为数据表示和交换的数据格式标准。一个高效、快速的XML文档聚类机制能够大幅缩短信息检索时间,提高数据查询的效率,挖掘出潜在的信息价值。为此,提出一种改进的k—medoids算法对XML文档进行聚类。运用模糊聚类方法确定聚类个数,利用遗传算法的全局最优的搜索能力求解最佳聚类中心点或质心,从而提高大规模XML文档集的聚类质量。实验结果表明,与基于传统k—medoids算法的聚类方法相比,改进的聚类方法具有较高的聚类准确性和收敛度。
Due to extensibility, semi-structured and ability of self-description and other characteristics, eXtensible Markup Language(XML) has been the standard of data representation and exchange. An efficient, fast XML clustering mechanism, will greatly shorten the information retrieval time, improve the efficiency of data query and find out the potential information value. In order to improve the clustering quality of massive XML document collections, a novel XML document clustering method is proposed based on the study of structure and the similarity in the XML documents, according to the improved k-medoids clustering algortihm. The analyses of experimental results show that the proposed method has satisfactory clustering convergence and accuracy.