传统的主题检测方法以统计理论为基础,忽略了数据本身蕴涵的语义,带来了偏差严重、与样本数据高度相关等缺点。针对以上缺点,面向文本流数据,提出一种基于特征本体的主题检测方法。首先构建文本特征本体;其次,将较为复杂的文本特征本体看做是由若干主题组成的连通图,然后将主题连通图分解成单边图集合;再次,将主题相似度计算问题转换为单边图贡献度和图相似度的计算问题;最后,对每一批新文本集检测是否有新主题,从而使得主题的个数随着时间的推移而增加。在科技文献和新闻语料上进行实证研究,结果发现阈值6参数决定文本流中新主题出现的频率,且实验结果同经典主题模型基本保持一致。除此之外,同传统的方法相比,提出的方法能更好地支持主题的语义表示,且适用于流数据,能增量实现主题检测,在应用上具有更大的优势。
Traditional topic detection methods mainly based on statistics, which ignoring the semantics of the data itself, and thus brought such shortcomings as serious deviation and highly dependency on sample data. Aiming at text stream, this paper put forward a novel topic detection approach based on text feature ontology. Firstly, it built text feature ontology. Secondly, complex text feature ontology could be seen as composed of several topics e.g. connected graph, which could then decomposed into unilateral graph collection. Again, the topic similarity computation problem could be cast into simple graph contribution and similarity calculation problem. Finally, for each batch of new text set it could see if there was a new topic, so that the number of topics would grow with time passed by. Empirical research on literature and news corpus was performed, and it was found that the threshold the delta parameter determines the frequency of new topics in text stream, and the results are almost consistent with the classical topic model. In addition, compared with the traditional method, the proposed approach can sup- port the semantic representation of a topic, and is suitable for the data stream, which can realize the online topic detection, and thus has more advantages in applications.