针对网络大数据时代文本流的主题演化研究大多基于经典概率主题模型,以词袋假设为前提导致主题的语义缺失问题和批处理问题,提出一种在线增量的基于特征本体的主题演化算法。首先,基于词共现和通用本体库Word Net构建特征本体,用特征本体对文本流主题进行建模;其次,提出一种文本流主题矩阵构建算法,实现在线增量主题演化分析;最后,依据该矩阵提出文本流主题本体演化图构建算法,利用特征本体的子图相似度计算主题相似度,从而获得文本流中主题随时间的演化模式。在科技文献上的实验上,满意度同传统在线潜在狄利克雷分配模型(LDA)不相上下,但时间复杂度降低到O(n K+N)。所提出的方法引入了本体,加入了语义关系标注,可图形化展现主题的语义特征,并在此基础上在线增量地实现了主题演化图的构建,在语义解释性和主题可视化方面更具有优势。
In the era of big data, research in topic evolution is mostly based on the classical probability topic model, the premise of word bag hypothesis leads to the lack of semantic in topic and the retrospective process in analyzing evolution. An online incremental feature ontology based topic evolution algorithm was proposed to tackle these problems. First of all, feature ontology was built based on word co-occurrence and general WordNet ontology base, with which the topic in text stream was modeled. Secondly, a text stream topic matrix construction algorithm was put forward to realize online incremental topic evolution analysis. Finally, a text topic ontology evolution diagram construction algorithm was put forward based on the text steam topic matrix, and topic similarity was computed using sub-graph similarity calculation, thus the evolution of topics in text stream was obtained with time scale. Experiments on scientific literature showed that the proposed algorithm reduced time complexity to O( nK + N), which outperformed classical probability topic evolution model, and performed no worse than sliding-window based Latent Dirichlet Allocation ( LDA). With ontology introduced, as well as the semantic relations, the proposed algorithm can demonstrate the semantic feature of topics in graphics, based on which the topic evolution diagram is built incrementally, thus has more advantages in semantic explanatory and topic visualization.