主题检测近年来在文本挖掘和自然语言处理领域得到了广泛的应用,对主题进行结构建模是主题检测的基础.为了对文本流中的多粒度主题进行建模,提出一种基于语义层次树的主题结构模型.该模型利用领域本体的特点,将主题同本体作一一映射,结合概率理论,将概念集里的概念用主题树的叶子节点表示,每一层中的节点均是下一层节点的多项分布,使之更适合描述文本流中多粒度的主题结构.为了便于构建主题的空间结构,提出主题的相似度和事件相关度计算方法.该文结尾设计了实验构造真实新闻文本流数据上的主题树.实验结果表明,该结构模型能够体现主题丰富的多粒度空间语义特征.
Topic Detection has been widely used in text mining and NLP, while the basis of which is topic structure modeling. In this paper, we propose a semantic hierarchical topic structure model to describe multi-granularity topic structure. This model utilizes the characteristics of domain ontology, with each concept in the ontology mapped to a topic. The concepts in concept list are respresented as topic-tree leaf nodes, and nodes in each layer can be treated as multinomial mixture distribution on the lower layer nodes. This delicate structure is easily adapted to multi-granularity topic structure in real world text stream. Experiment showed that the structure model reflect rich multi-granularity semantic feature of topic.