主题模型在自然语言处理领域受到了越来越多的关注.在该领域中,主题可以看成是词项的概率分布.主题模型通过词项在文档级的共现信息抽取出语义相关的主题集合,并能够将词项空间中的文档变换到主题空间,得到文档在低维空间中的表达.作者从主题模型的起源隐性语义索引出发,对概率隐性语义索引以及LDA等在主题模型发展中的重要阶段性工作进行了介绍和分析,着重描述这些工作之间的关联性.LDA作为一个概率生成模型,很容易被扩展成其它形式的概率模型.作者对由LDA派生出的各种模型作了粗略分类,并选择了各类的代表性模型简单介绍.主题模型中最重要的两组参数分别是各主题下的词项概率分布和各文档的主题概率分布,作者对期望最大化算法在主题模型参数估计中的使用进行了分析,这有助于更深刻理解主题模型发展中各项工作的联系.
Topic models are receiving extensive attention in natural language processing. In this field, a topic is regarded as probabilistic distribution of terms. Topic models extract semantic topics using co-occurrence of terms in document level, and are used to transform documents locating in term space to the ones in topic space, obtaining the low dimensional representation of docu- ments. This paper starts from Latent Semantic Indexing (LSI), the origin of topic models, and describes pLSI and LDA, the fundamental works in the development of topic models, with focus on the relationship among these works. As a generative model, LDA can be easily extended to other models. This paper makes a simple categorization on topic models derived from LDA, and representative models of each category are introduced. Furthermore, EM algorithms in parameter estimation of topic models are analyzed, which help to understand the relationship of works during the development of topic models.