文中介绍了大规模文本网数据的主题建模研究的特点和近年来的重要进展.主题建模方法吸引了世界范围的广泛兴趣,并且促进了许多重要的数据挖掘、计算机视觉和计算生物应用系统的发展,包括文本自动摘要、信息检索、信息推荐、主题检测和追踪、自然场景理解、人体动作识别以及微阵列基因表达分析等.文中重点介绍文本网数据的4个主要特点以及对应的主题模型.文本网数据含有动态、高阶、多通路及分布式的结构,而之前的主题模型仅对部分的结构进行建模.而文中讨论了在三维马尔可夫模型的框架下统一对文本网数据的4个结构特点进行建模,并分析了结合三维马尔可夫模型和二型模糊系统对分布式单词计算和主题建模应用的可能性.除了对文本网数据的结构建模之外,还讨论了一些对三维马尔可夫模型能量最小化的机器学习算法.
This paper reviews important advances that have been made in the past decade for topic modeling of large-scale document network data. Interest in topic modeling is worldwide and touches a number of practical text mining, computer vision and computational biology systems that are important in text summarization, information retrieval, information recommendation, topic detection and tracking, natural scene understanding, human motion categorization and microarray gene expression analysis. The main focus of this review is on the recent advances of topic modeling techniques for document network data. We introduce the four major characteristics of document network data and the current state-of-the-art topic models, with descriptions of what they are, what has been accomplished, and what remains to be done. Document network data contain dynamic, higher-order, multiplex, and distributed structures. Prior efforts on topic mod- els focus on modeling parts of these structures for topic detection and tracking. To handle all doc- ument network structures, we discuss a three-dimensional Markov model that solves dynamic, higher-order, multiplex and distributed structures within a unified framework. In addition, we also discuss the integration of three-dimensional Markov models with type-2 fuzzy logic systems for distributed computing with words. Besides document network structure modeling, we also discuss the inference and parameter estimation method in terms of energy minimization for three- dimensional Markov models.