对于不同类别样本数量差别很大的偏斜文本数据集,使用传统的特征选择方法所选出的特征绝大多数来自于大类,会使得分类器偏重大类而忽略小类,直接影响分类效果.该文首先针对偏斜文本数据集的数据特点,分析发现偏斜数据集中影响特征选择的两个重要因素,即特征项的类别分布和类间差异,其中类别分布因素反映的是特征项在整个数据集中的类别频率差异;而类别差异因素反映的是特征项在不同类别之间的相对文档频率差异.然后基于这两个重要因素构造形成一个新的尤其适用于偏斜文本分类的特征选择函数 相对类别差异(Rel-ative Category Difference,RCD).与传统的特征选择方法进行对比实验的结果表明,RCD特征选择方法对于偏斜文本分类效果更优.
The existing for feature selection methods are not appropriate for the skewed corpus in which most of sam- ples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods se- lect features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category fre- quency difference in whole dataset, and the category difference factor indicates relative documents frequency differ- ence between classes. Then a new feature selection function called Relative Category Difference (RCD) is construc- ted based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.