东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种具有降噪能力的概率主题模型

ISSN号：1672-9722
期刊名称：《计算机与数字工程》
时间：0
分类：TP181[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]贵州省公共大数据重点实验室,贵阳550025, [2]贵州大学计算机科学与技术学院,贵阳550025
相关基金：国家自然科学基金项目（编号：61540050;61462011）; 贵州省重大应用基础研究项目（编号：黔科合JZ字[2014]2001）; 贵州省科技厅联合基金（编号：黔科合LH字[2014]7636号）; 贵州大学研究生创新基金项目（编号：研理工2016051）资助.

关键词：概率主题模型, 隐含狄利克雷分布, 吉布斯抽样, 降噪, probabilistic topic model, LDA, gibbs sampling, noise reduction

中文摘要：

大数据时代的到来使得文本数据的数据量暴增,因此准确而高效地识别和分析文本数据的潜在结构变得越来越重要。要从海量的数据中挖掘模式和知识,需要借助于强大的计算工具,所以机器学习科学家提出了概率主题模型。当前,以隐含狄利克雷分布（LDA）模型为代表的经典概率主题模型已经被广泛地应用到数据挖掘的各个方面。由于LDA模型对区分相似主题的能力非常差,影响了LDA的实际应用性能,为解决这一重要问题,论文基于LDA模型提出了一种名为NRLDA的新模型。考虑到相似主题的文本中存在大量的对区分不同主题没有贡献的噪音词语,在NRLDA中引入了相关变量来区分有用词和噪音词,使噪音词从一个噪音主题的词分布中产生,而有用词从多个特征主题的词分布中产生,从而削弱噪音词所带来的不良影响。此外,我们还使用吉布斯抽样方法对NRLDA的参数进行了推断,这些参数对分析文本数据中潜藏的结构有至关重要的作用。实验结果表明我们的NRLDA模型有较强的区分相似主题的能力,这同时也验证了我们建模思想的正确性。

英文摘要：

With the arrival of big data era,recognizing and analyzing the hidden structure of text data efficiently has been more and more important.Powerful computational tools are needed to help understand text data better.Probabilistic topic models,especially the Latent Dirichlet Allocation（referred as LDA）model,have been proposed and applied in machine learning and text mining widely.Because the LDA model has very poor ability to distinguish similar topics,which has a bad influence on its practical performance.In order to solve this important problem,a new topic model named Noise Reduction Latent Dirichlet Allocation（referred as NRLDA）is proposed on the basis of LDA.There are a lot noise words making no contribution to discriminating similar topics,so this phenomenon is taken into consideration by introducing new variables to distinguish the different generative processes of noise words and non-noise words,which is absolutely beyond LDAs ability.Besides,agibbs sampler is developed to infer NRLDAs parameters which is critical to investigating the structure of text corpus.Experimental results show that NRLDA model has a much stronger ability to differentiate similar topics,which proves that the idea in our model is reasonable.

同期刊论文项目