为了解决基于传统向量空间模型的文本相似性算法没有考虑向量高维及关键词的微变,而导致文本相似性计算结果不够精确的问题,提出了关键词微变情况下基于聚类和LD算法的文本相似性算法TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm)。对文本进行移除数字、标点符号和停用词等预处理;采用聚类的方法约简文本中的低频词,利用LD算法计算特征词间的相似度,建立文本相似度矩阵;用特征词相似度及其权重构建的空间向量计算文本间的相似度,这样不仅考虑了关键词微变的情况,而且有效地解决了文本向量的高维问题,将其应用于文本挖掘中,能够提高相似文本的挖掘效率。实验结果表明,由于考虑了关键词微变情况,在一定的阈值范围内,该算法文本相似性的准确率得到了明显的提高。
In order to solve the problem of the imprecise calculation result of text similarity which comes from text similarity algorithm based on traditional vector space model, it doesn't consider vector dimension and micro variation of key word, proposes TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm)with the situation of micro variation of key word. In the present work, it makes some pretreatment of removing the number, punctuation and stop word. It reduces the low-frequency words in the text with clustering method, calculates the similarity between characteristic words by LD algorithm, builds text similarity matrix. It calculates the similarity between texts by characteristic words similarity matrix and space vector which is built by weight. It not only considers the micro variation situation of key word, but also solves the high dimensional problems of text effectively. If applied to text mining, it will improve the efficiency of mining of similarity text. The experimental results show that precise of the algorithm is improved obviously with the discovery of similarity text in situation of micro variation and a certain range of threshold values.