LDA没有考虑到数据输入,在原始输入空间上对所有词进行主题标签,因对非作用词同样分配主题,致使主题分布不精确。针对其不足,提出了一种结合LSI和LDA的特征降维方法,预先采用LSI将原始词空间映射到语义空间,再根据语义关系筛选出原始特征集中关键的特征,最后通过LDA模型在更小、更切题的文档子集上采样建模。对复旦大学中文语料进行文本分类,新方法的分类精度较单独使用LDA模型的效果提高了1.50%。实验表明提出的LSI__LDA模型在文本分类中有更好的分类性能。
The LDA method does not take the input space into consideration effectively when making topic label to each word in the original space. As the original input holds the non-action terms, which affects the topic distribution extremely and reduces the classification accuracy. In order to remedy this imperfection, this paper proposed a new LSI_LDA algorithm. Firstly, LSI model mapped the input space to the latent semantic space. Secondly, it extracted the key features in accordance with their se- mantic relation. Finally, LDA model could perfectly performed on a simpler and more pertinent space. The classification accura- cy was improved by 1.50% using the proposed method than that using LDA alone with Fudan University corpus. This experi- mental result shows that the LSI_LDA has a higher performance in text categorization.