特征选择方法的优劣直接影响到文本分类的效果。传统的特征选择算法是以全局的方式来选取特征,这种方式忽视了局部特征对分类效果的影响,有时候甚至会导致很多训练文档没有特征。因此,在传统的特征选择方法主要考虑文档集全局特征的基础上,增加词对单篇文档的贡献率的考虑,并结合ALOFT方法,提出了一个结合全局和局部信息的特征选择算法(GLFS)。在路透社文档集及复旦文档集上的实验结果表明,本文提出的算法在保证每个文档都有特征词的同时提高了分类效果。最后讨论了对特征权重的确定方法,经过重新计算特征权重后分类效果有了较大的提高。
Feature selection methods directly affect the effect of text categorization.Traditional feature selection algo-rithm is based on global approach,ignoring the influence of local features,and even makes a lot of training document has no features.Therefore,the paper proposed a feature selection algorithm combined with the ALOFT method,which unify the traditional globe features and contribution rate of a word to individual document to unify the global and local information(GLFS).Experimental results in the Reuters data set and Fudan data set show that the method can ensure that each document has a characteristic word and improve classification performance.Furthermore,the paper discussed the influence of the new method of feature weights to classification.