主题模型已被广泛用于发现文档潜在主题.已有方法多采用词或短语来表示主题,然而这些方法生成的主题缺乏深层次的语义信息,可解释性比较差.文中提出使用结构化的事件来表示主题.一方面,事件包含比词或短语更丰富的语义;另一方面,一组相关的事件能更合理地解释并区分不同的主题.为解决事件作为基本单元所带来的稀疏性问题,该文在Biterm Topic Model(BTM)的基础上提出两种主题模型,采用两种不同的方式将事件的语义知识融入到主题生成过程中.其中,第1种模型利用Generalized Pólya Urn(GPU)模型天然的聚类效果加大语义相近的事件分配到同一主题的概率,而第2种模型则通过为每个biterm引入指示变量,合理地利用语义知识有效地解决同一个biterm中两个事件的主题分配问题.该文不仅从主题凝聚度和KL散度两个指标直接对主题模型进行评估,还通过将主题表示结果引入到文本分类任务中对模型进行了外部评估.实验结果表明文中提出的模型从共现和语义两个层面有效地解决了事件稀疏性问题.与基于词或短语的主题表示相比,事件结构所包含的语义信息提高了主题生成质量,使主题表示具有更强的可读性和主题判别性.
Topic model has been widely used to discover the latent topic of text.Most previous methods exploited words or phrases for topic representation.However,this form of topic representation has a poor interpretability,due to the lack of deep semantic information.This paper proposes to exploit structured events for topic representation.On one hand,events have more abundant semantic information than words or phrases;on the other hand,a set of events are able to interpret and distinguish different topics intuitively.However,the structured events,as basic units of document,add more difficulties to the topic sampling because of the sparseness.To address the problem,we propose two topic models based on Biterm Topic Model.Event semantic knowledge is incorporated into these models using two different ways.The first model exploits the natural clustering performance of Generalized Pólya Urn model to increase the probability of assigning same topic to similar events.Differently,the second model introduces an indicator variable for each biterm,and exploits event semantic information to solve the topic assignment of the events in one biterm more reasonably.We not only directly evaluate the topic models based ontwo metrics,namely topic coherence and KL-divergence,but also conduct the external evaluation by carrying out text classification task based on the results of topic representation.The experimental results demonstrate our topic models effectively diminish the sparseness from two perspectives:event co-occurrence and semantic relatedness.Compared to the topic representation based on words,the semantic information of event effectively promotes the topic quality and improves the interpretability and topic discrimination of topic representation.