基于OpenMP技术提出并行置信传播算法,在多核服务器上通过共享内存的方式快速推断潜在狄利克雷分布(LDA)主题模型的参数,建立文本中不同主题与文本表面单词之间的联系.利用Enron和Wikipedia数据集,通过3组实验对比了传统吉布斯算法和并行置信传播算法的运行效果.结果表明.并行置信传播算法能够快速推断LDA模型参数,高效处理大规模数据,比传统吉布斯采样算法具有更高的精度.
Fast probabilistic topic modeling such as Latent Dirichlet Allocation (LDA) is widely employed in many fields including documents topic detection, automatic documents abstracting. To learn the parameters of LDA model, a parallel Belief Propagation(BP) algorithm is designed and implemented. Running on a multi-core server in a shared-memory way, the algorithm can immediately be used to infer LDA parameters to find the relationship between different topics and words within the documents. Experimental results on Enron and Wikipedia datasets confirm that the proposed fast BP algorithm can efficiently process data on a large scale and achieve a much better accuracy than the traditional Gibbs Sampling (GS) algorithm in terms of perplexity.