由于文档中的词符合幂律分布,使得LDA模型的主题分布向高频词倾斜,导致能够代表主题的多数词被少量的高频词淹没使得主题表达能力降低.通过一种高斯函数对特征词加权,改进LDA主题模型的主题分布.实验显示加权LDA模型获得的主题间的相关性以及复杂度(Perplexity)值都降低,说明改进模型在主题表达和预测性能方面都有所提高.
The distribution of words in the document satisfy power rules,which cause the topics incline the high frequency words,and then many words which can represent topics are submerged.It leads to reduce the expression capability of LDA topics.An improved LDA topic model is showed by weighting the feature words using Gauss function.The experiments indicate that the weighting topic model is better generalization performance by validating the correlations among the topics and the perplexity value of model.