大数据时代的到来使得文本数据的数据量暴增,因此准确而高效地识别和分析文本数据的潜在结构变得越来越重要。要从海量的数据中挖掘模式和知识,需要借助于强大的计算工具,所以机器学习科学家提出了概率主题模型。当前,以隐含狄利克雷分布(LDA)模型为代表的经典概率主题模型已经被广泛地应用到数据挖掘的各个方面。由于LDA模型对区分相似主题的能力非常差,影响了LDA的实际应用性能,为解决这一重要问题,论文基于LDA模型提出了一种名为NRLDA的新模型。考虑到相似主题的文本中存在大量的对区分不同主题没有贡献的噪音词语,在NRLDA中引入了相关变量来区分有用词和噪音词,使噪音词从一个噪音主题的词分布中产生,而有用词从多个特征主题的词分布中产生,从而削弱噪音词所带来的不良影响。此外,我们还使用吉布斯抽样方法对NRLDA的参数进行了推断,这些参数对分析文本数据中潜藏的结构有至关重要的作用。实验结果表明我们的NRLDA模型有较强的区分相似主题的能力,这同时也验证了我们建模思想的正确性。
With the arrival of big data era,recognizing and analyzing the hidden structure of text data efficiently has been more and more important.Powerful computational tools are needed to help understand text data better.Probabilistic topic models,especially the Latent Dirichlet Allocation(referred as LDA)model,have been proposed and applied in machine learning and text mining widely.Because the LDA model has very poor ability to distinguish similar topics,which has a bad influence on its practical performance.In order to solve this important problem,a new topic model named Noise Reduction Latent Dirichlet Allocation(referred as NRLDA)is proposed on the basis of LDA.There are a lot noise words making no contribution to discriminating similar topics,so this phenomenon is taken into consideration by introducing new variables to distinguish the different generative processes of noise words and non-noise words,which is absolutely beyond LDAs ability.Besides,agibbs sampler is developed to infer NRLDAs parameters which is critical to investigating the structure of text corpus.Experimental results show that NRLDA model has a much stronger ability to differentiate similar topics,which proves that the idea in our model is reasonable.