针对基于语料库统计的词语相似度计算方法存在的一些缺陷,如:计算量大、向量的特征维度高、特征稀疏、忽略了词语的语义信息等,提出了一种基于latent Dirichlet allocation(LDA)的词语相似度计算方法,通过将词语的特征向量映射为词语的主题分布来计算词语间的相似度;通过与基于《知网》的词语相似度计算方法的对比,证明了该方法能有效降低特征维度,并具有较好的词语相似度计算效果。
Word similarity measurement approaches based on corpus statistics have some defects,such as requiring complex calculations and high dimensions of vectors,having sparse feature words,and ignoring the sematic information within words.This paper proposes a word similarity calculation approach based on the latent Dirichlet allocation( LDA) model.Word vectors are mapped into a topic distribution to calculate extent of word similarity.A comparative experiment was conducted to compare the results obtained with word similarity measurements based on"Hownet"and the results show that our method can reduce the dimension of the feature space efficiently and afford good results.