东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

LSI__LDA：一种混合特征降维方法

ISSN号：1001-3695
期刊名称：《计算机应用研究》
时间：0
分类：TP391.1[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：辽宁工程技术大学软件学院,辽宁葫芦岛125105
相关基金：国家自然科学基金青年科学基金资助项目（61401185）;辽宁省教育厅科学研究一般项目（12013133）

关键词：文本分类, 特征降维, 潜在语义索引, 潜在狄利克雷分配, text categorization, feature dimensionality reduction, latent semantic index（LSI） , latent Dirichlet allocation （LDA）

中文摘要：

LDA没有考虑到数据输入,在原始输入空间上对所有词进行主题标签,因对非作用词同样分配主题,致使主题分布不精确。针对其不足,提出了一种结合LSI和LDA的特征降维方法,预先采用LSI将原始词空间映射到语义空间,再根据语义关系筛选出原始特征集中关键的特征,最后通过LDA模型在更小、更切题的文档子集上采样建模。对复旦大学中文语料进行文本分类,新方法的分类精度较单独使用LDA模型的效果提高了1.50%。实验表明提出的LSI__LDA模型在文本分类中有更好的分类性能。

英文摘要：

The LDA method does not take the input space into consideration effectively when making topic label to each word in the original space. As the original input holds the non-action terms, which affects the topic distribution extremely and reduces the classification accuracy. In order to remedy this imperfection, this paper proposed a new LSI_LDA algorithm. Firstly, LSI model mapped the input space to the latent semantic space. Secondly, it extracted the key features in accordance with their se- mantic relation. Finally, LDA model could perfectly performed on a simpler and more pertinent space. The classification accura- cy was improved by 1.50% using the proposed method than that using LDA alone with Fudan University corpus. This experi- mental result shows that the LSI_LDA has a higher performance in text categorization.

同期刊论文项目