文本挖掘中采用向量空间模型(VSM)来表达文本特征,表现出巨大的维数,从而导致处理过程计算复杂,为此,需要先对文本特征矩阵进行合理的降维处理.隐含语义分析(LSA)、概念索引(CI)、非负矩阵分解(NMF)和随机映射(RP)是几种有效的降维方法,在分析降维空间的含义和计算复杂度后,通过文本聚类实验比较和分析了这几种降维方法的差异,实验表明,这些方法不仅可以对文本特征空间作有效的降维处理,还能在不同程度上凸现文本和词条之间的语义关系,从而提高文本挖掘的效率和准确率.
Vector Space Model is usually used to express text feature in data mining.Text feature matrix has large dimensionality,and leads to complex computation.So it is needed to reduce dimensionality of text feature matrix before mining data.Latent Semantic Analysis,Concept Indexing,Non-negative Matrix Factorization and Random Projection are some dimension reduction methods.After comparing and analyzing the meanings of the reduced space,the computing complexity and their differences,experiments demonstrate these methods not only can reduce dimensionality effectively, but also open out the semantic relations between text and term and improve mining efficiency and accuracy.