词汇间的语义相似度计算在自然语言处理相关的许多应用中有基础作用。该文提出了一种新的计算方法,具有高效实用、准确率较高的特点。该方法从传统的分布相似度假设“相似的词汇出现在相似的上下文中”出发,提出不再采用词汇在句子中的邻接词,而是采用词汇在二词名词短语中的搭配词作为其上下文,将更能体现词汇的语义特征,可取得更好的计算结果。在自动构建大规模二词名词短语的基础上,首先基于tgidf构造直接和间接搭配词向量,然后通过计算搭配词向量间的余弦距离得到词汇间的语义相似度。为了便于与相关方法比较,构建了基于人工评分的中文词;12语义相似度基准测试集,在该测试集中的名、动、形容词中,方法分别得到了0.703、0.509,0.700的相关系数,及100%的覆盖率。
The word similarity measure plays a basic role in many NLP related applications. In this paper, we propose a novel and practical method for this purpose with acceptable precision. Guided by the classic distribution hypothesis that "similar words occur in similar contexts", we suggest the collocations in two-word noun phrases can serve as better contexts than the adjacent words because the former are more semantic related. By using automatic built large-scale noun phrases, we firstly construct tf-idf weighted words vectors containing direct and indirect collocations, and then take their cosine distances as desired semantic similarities. In order to compare with related approa ches, we manually design a benchmark test set. On the benchmark test set, the proposed method achieves the correlation coefficients of 0.703, O. 509, and 0.700 on nouns, verbs, and adjectives, respectively, at a coverage 100%.