提出一种基于上下文多元信息实现文档相似度计算的方法,该方法首先抽取文档的特征词,对具有相同(或相近)意义特征词的文档,分别获得特征词在上下文中同现词的词性、语义信息、位置关系、平均同现概率等多元信息,以量化形式描述成一个相似函数;然后分别从两两文档的相似函数中得到文档的相似度评价值,作为衡量文档相似程度的重要依据.利用该评价方法,使用NTCIR-3中的跨语言信息检索数据集中的中文文档,对初始检索文档的顺序重新排列,实验结果表明,该方法分别将前10个最佳召回文档和前100个最佳召回文档的平均精确度提高了15.45%-18.49%和11.96%~15.35%;在另一组有关相同网页信息的实验中,几组不同类别文档相似度F1-measure平均值均在95%以上.
A novel solution of computing document similarity based on multi-grams of context is presented in this paper. In this study, the same feature information firstly is acquired from document pairs; and then, the usage of co-occurrence feature information is gotten in the context of speech, semantic, location, weighted average co-occurrence probability, and is expressed as the similarity function; finally, document similarity evaluation value is calculated for each document, The similarity evaluation value plays an important role in judging the document similarity degree. The Chinese document set from the NTCIR-3 workshop collection is used to evaluate the method, it shows that an average 15,45%-18.49% and 11.96%-15.35% increase in precision can be achieved at top 10 and 100 ranking documents level respectively. In another group experiment about the same Web information, average FTmeasure of textual similarity is above 95 %.