Replicated Softmax model,是用于文本数据挖掘的无向概率主题模型,为描述语料库的主题分布提供了一个功能强大的框架.然而,作为一个无向的概率图模型,由于归一化常数的存在,该模型的参数学习是十分困难的.针对这一问题,利用退火过渡马尔科夫蒙特卡洛采样方法,借助近似极大似然学习的思想,实现了模型的参数学习.该算法采用基于退火过渡的马尔科夫蒙特卡洛采样方法,高效地探索存在多个孤立的模态的概率分布,提高对概率分布的逼近程度,从而提高了参数学习的效率和精度.实验结果证明了算法在训练时间、泛化能力和文档检索等三个方面的优势.
Replicated Softmax model,an undirected topic model for text data mining,provides a powerful framework for extracting semantic topics form document collections.Compared to the directed topic models,it has a better way of dealing with documents of different lengths,and computing the posterior distribution over the latent topic values is easy.However,due to the presence of the global normalizing constant,maximum learning procedure for this model is intractable.Constrastive Divergence(CD)algorithm,is one of the dominant learning schemes for RBMs based on Markov chain Monte Carlo(MCMC)methods.It relies on approximating the negative phase contribution to the gradient with samples drawn from a short alternating Gibbs Markov chain starting from the observed training sample.However,using these short chains yields a low variance,but biased estimate of the gradient,which makes the learning procedure rather slow.The main problem here is the inability of Markov chain to efficiently explore distributions with many isolated modes.In this paper,a new class of stochastic approximation algorithms is considered to learn Replicated Softmax model.To efficiently explore highly multimodal distributions,we use a MCMC sampling scheme based on tempered transitions to generate sample states of a thermodynamic system.The tempered transitions move systematically from the desired distribution,to the easily-sampled distribution,and back to the desired distribution.This allows the Markov chain to produce less correlated samples between successive parameter updates,and henceconsiderably improves parameter estimates.The experiments are conducted on three popular text datasets,and the results demonstrate that we can successfully learn good generative model of real text data that performs well on topic modelling and document retrieval.