[目的/意义]潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)在科技情报分析中用来发现学科主题、挖掘研究热点以及预测研究趋势等。对常见的科学文献文本语料库(关键词、摘要、关键词+摘要)进行LDA主题抽取效果的评价,以揭示不同语料库的主题抽取效果,提高LDA在科技情报分析中的应用效果。[方法/过程]对上述3种语料库下的LDA主题模型进行对比研究,采用基于查全率、查准率、F值以及信息熵的定量分析和基于主题抽取的广度和主题粒度的定性分析相结合的方法对主题抽取效果进行评价。[结果/结论]通过国内风能领域的科学文献数据实证研究发现,无论是从定量分析还是从定性分析来看,摘要和关键词+摘要作为语料的LDA主题抽取的效果均优于
[ Purpose/significance] Latent Dirichlet Allocation (LDA) is used to discover subject topic, hot topic and development trend in scientific and technical intelligence analysis. The paper evaluates the effect of LDA topic extraction with three common scientific literature corpuses, which are structured by keywords, abstracts or mixture of keywords and abstracts. The purpose of this thesis is to promote the effect of using LDA in science and technology intelligence analysis. [ Method/ process ] We analyze effect of topic extraction by LDA under three above-mentioned corpuses and evaluate the results by two pat- terns. One is quantitative analysis by using quantitative indexes, including precision rate, recall rate, F-score and information entropy ; the other one is qualitative analysis, including two dimensionalities : extent of topic extraction and granularity of topic. [ Result/conclusion] Experiments on scientific and technical literatures of domestic wind energy field show that the effect of top- ic extraction by LDA with abstracts or mixture of keywords and abstracts is better than LDA with keywords, whether from quantitative analysis or qualitative analysis. LDA with abstracts and mixture of keywords and abstracts has different application scenarios. The former has larger extent of topic extraction and the latter has smaller granularity of topic.