介绍了中文文本分类系统的原理,在特征提取上采用了文档频率法(DF)与潜在语义分析法(K认)相结合的方法,先采用DF法过滤掉DF值低的词条,降低文本矩阵的稀疏性,然后使用LSA法进行词语间的语义分析,消除同义词和多义词的影响,提高文本分类的速度与精确度。实验结果表明使用此种降维方法取得了良好的效果。
This paper introduces the principle of Chinese text classification systems. The combined method of document frequency (DF) and latent semantic analysis (LSA) is used in the feature extraction. Firstly, the DF method is used to filter out low-value terms and to reduce the sparse matrix of text, then the LSA method is used to analyze sernanteme among the words and to eliminate the influence of synonyms and polysemous words, the combined method raises the speed and accuracy of text classification. The experimental results show that the proposed method for text classification is promising.