特征与各个文档类在文本集中的独立程度体现了特征的代表性,文本分类的特征选择过程是选择能够提高分类性能的高代表性特征的过程。基于该原理提出DHChi2和EIBA2种新的文本分类特征选择方法,对这2种方法进行合理的组合。实验结果表明,独立性理论应用于文本分类特征选择有利于提高分类性能。
The degree of independence between a feature and each document category reflects the representation of the feature in the text set, while the procedure of selecting features is just a procedure in which the high representative subset of features are selected in text category. This paper proposes two approaches of feature selection based on the principle DHChi2 and ELBA, and rationally combines the two approaches. Experimental results show that applying the independence theory to feature selection for text categorization can improve categorization performance.