经典的概率主题模型通过词与词的共现挖掘文本的潜在主题信息,在文本聚类与分类任务上被广泛应用。近几年来,随着词向量和各种神经网络模型在自然语言处理上的成功应用,基于神经网络的文本分类方法开始成为研究主流,卷积神经网络(Convolutional Neural Network,CNN)已成为目前一种主流的文本分类模型。本文通过CNN和概率主题模型PLSA(Probabilistic Latent Semantic Analysis)、LDA(Latent Dirichlet Allocation)在文本主题分类上的效果对比,展示了CNN在此任务上的优越性。在此基础上,本文利用CNN模型提取文本的特征向量并将其命名为卷积语义特征。为了让文本特征向量更好地刻画文本的主题信息,本文将卷积语义特征和文本的潜在主题向量分别归一化以消除两者量级上的差异,然后将两者融合,从而得到一种更有效的文本特征表示。实验结果表明,相比于单独的概率主题模型或CNN模型,新的特征向量能显著地提升文本主题分类任务的F1值。
The classical probabilistic topic models, which are widely used in natural language processing, can discover the latent topic information of documents through the co-occurrences of words. In the recent years, with the successful applica-tions of word embedding and neural networks, the research of text categorization based on neural networks has formed the mainstream, and the CNN (Convolutional Neural Networks) has become one of state-of-the-art models in document catego-rization tasks. This paper shows the superiority of neural networks in text categorization tasks by comparing CNN with prob-abilistic topic models PLSA (Probabilistic Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation). And then the document feature vector based on CNN can be extracted, and we name it Convolutional Semantic Feature (CSF) in this paper. In order to describe the topic information of documents better and improve the performance of topic categorization tasks, the CSF and latent topic vector are firstly normalized to eliminate the difference of their magnitude, and then they are combined to get a set of mixed feature for the document. The experimental results presented in this paper show that this set of mixed feature is superior to individual probabilistic topic model or CNN model, and can obviously improve the FI per-formance of topic categorization tasks.