随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。
With the rapid development of network and information technology, a large number of electronic documents appear on the network, and the similarity computaion between the documents is an important means of document processing. For large-scale collection of documents, vector space model (VSM) is usually used for document representation, but the method is facing the problems of higher dimension and lack of semantic simi larity. An improved method for calculating the similarity of document is proposed. Metadata feature vectors are selected from a large number of representative feature space, so that it can reduce the dimension of the vector space. The domain concept tree is constructed and the algorithm for computing document similarity is designed. In order to improve the document semantic similarity of algorithm performance, the synonym concepts which exist in widespread areas are processed. The experimental results show that the proposed method can improve the performance of document similarity computation based on the dimensionality reduction and the concepts sim ilarity computing.