近年来概率主题模型受到了研究者的广泛关注,LDA(Latent Dirichlet Allocation)模型是主题模型中具有代表性的概率生成模型之一,它能够检测文本的隐含主题。提出一个基于LDA模型的主题特征,该特征计算文档的主题分布与句子主题分布的距离。结合传统多文档自动文摘中的常用特征,计算句子权重,最终根据句子的分值抽取句子形成摘要。实验结果证明,加入LDA模型的主题特征后,自动文摘的性能得到了显著的提高。
Probabilistic topic models have received considerable attentions in recent years.LDA model,as a topic model,is one representative among probabilistic generative models,which is used to detect latent topics from documents.In this paper,an LDA-model-based topic feature is proposed.The feature is applied to calculating the distance between distributed document topics and distributed sentence topics.By combining common features in conventional multi-document automatic summarizations,sentences are ranked,and the summary is formed by extracting sentences ordered by their weights.Experiment results show that the automatic summarization performance is significantly improved by the integration of LDA model topic feature.