针对潜在狄利克雷分析(LDA)模型分析大规模文档集或语料库中潜藏的主题信息计算时间较长问题,提出基于MapReduce架构的并行LDA主题模型建立方法.利用分布式编程模型研究了LDA主题模型建立方法的并行化实现.通过Hadoop并行计算平台进行实验的结果表明,该方法在处理大规模文本时,能获得接近线性的加速比,对主题模型的建立效果也有提高.
The existing latent Dirichlet allocation (LDA) model used to analyze the theme of information hidden in the massive set of documents or corpus has the shortcoming of longer computation time. To overcome such a disadvantage, we propose a parallel LDA topic modeling method based on MapReduce architecture using a distributed programming model, that is, the parallel implementation of the LDA topic model. Experiment has been fulfilled by utilizing the Hadoop parallel computing platform. The results show that, when dealing with large amounts of text, the proposed method can get near-linear speedup and improve the establishing effect of the topic modeling.