概率主题模型是一种统计生成模型,它从文档集合中抽取一系列主题,并将这些文档表示为不同主题依照一定概率混合而成.通过这种模型发现的主题,能揭示文档的语义信息,在很多领域都有着广泛的应用.为此基于概率主题模型,提出了一种新的层次文本分类方法.该方法首先利用Gibbs抽样提取一系列主题,然后计算测试文档和每个类的基于主题的相似度.在20News Groups数据集上的实验结果表明,该方法的分类性能明显超越支持向量机分类方法.
Probabilistic topic model is a statistical generative model for automatically extracting a set of topics from a collection of documents and then representing these documents as mixtures of topics. Topics obtained by this method pick out significant semantic information of documents, and they have broad applications in many fields. A novel approach was proposed for hierarchical text categorization based on the probabilistic topic model. The approach first extracted a set of topics based on Gibbs sampling, then computed the similarities between test documents and each class based on the topics. Results of experiments on 20 NewsGroups dataset show that this approach is able to produce superior classification performance when compared to support vector machines.