该文提出了一种基于词汇集聚的文档相关性计算方法。该方法以知网为知识库,依靠词汇集聚,将文档中存在语义关联的词语连接成链,计算各条词汇链权重,以词汇链为元素对文档进行形式化表示,最终利用文档的此种形式化表示进行相关匹配计算。该文在中图法分类的语料上,开展了文档相关性计算的实验,准确率达到了85.4%。实验结果表明,该方法在一定程度上描述了文档的语义信息,将文档间的相关比较从字符或词层面的直接比较提升到近似概念层次的比较,是一种计算文档间相关性的有效方法。
A new document relevance calculating method based on lexical cohesion is presented in this paper. The main principle is: documents are formalized with lexicon chains which are constructed by extracting semantic-relative word clusters according to the lexicon cohesion principle under the help of semantic dictionary HowNet; then weight of each lexical chain is evaluated; finally relevance of documents is calculated with their representations. Experiments are conducted on corpus of Chinese Library Classification, and precision about 85.4% is achieved. The experimental results show that the method describes the semantic feature of documents to a certain extent, and it is an effective method for relevance calculating of documents.