针对由领域本体匹配产生的文本特征矩阵,描述了一种基于改进相似度计算公式的文本资料聚类算法。实验证明,当我们以生物医药领域的文本作为实验样本时,不管是从熵值还是从纯度来考虑,基于领域本体改进的聚类算法都要优于K-means算法和凝聚层次聚类算法。
This paper describes a new clustering method for texts based on an improved similarity calcula tion formula for text feature matrix which is generated by domain ontology matching.The experiment shows that: when they use texts in the field of bio-medicine as the experimental samples,the new cluster ing method for texts based on an improved similarity calculation formula is better than the K-means clus tering method and agglomerative hierarchical clustering method from entropy and purity considerations.