基于词频统计思想的传统文本相似度算法,往往只考虑特征项在文本中的权重,而忽视了特征项之间的语义关系。综合考虑了特征项在文本中的重要程度以及特征项之间的语义关系,提出构建文本特征项的加权语义网模型来计算文本之间的相似度,并在模型构建的过程中,对特征项的选取、权值计算做了适当的改进。最后用实验验证了基于加权语义网的文本相似度算法相较于传统的算法,相似度计算的精确度有了进一步的提高。
The traditional documents similarity algorithm based on the thought of statistical information of word frequency only considers the w eight of feature items in a document,thus ignores the semantic relations among feature items.This paper considers both the importance of feature items in a document and the semantic relations among feature items,and proposes to construct a w eighted semantic netw ork of document feature items to calculate the similarity of documents.In the process of constructing the model,there are some appropriate improvements in the selection of feature items and the calculation of feature items w eight.With an experiment,it is w ell-proved that,compared w ith the traditional algorithm,the suggested algorithm based on w eighted semantic netw ork promotes the precision of the calculation of documents similarity.