为提高不均衡文本分类的准确率和稳定性,提出了一种基于类别加权和方差统计的联合特征选择方法.首先,基于类别文档数大小对特征选择的影响,给出了一种类别加权策略以强化小类别的特征;其次,在探究特征类别区分能力的基础上,设计了类别方差统计策略来凸显含有丰富类别信息的特征;最后,将2种策略相融合,实现了一种联合特征选择的新算法.在Reuters-21578和复旦大学语料这2个不均衡语料上的实验都表明:该算法有效,特别是在小类别的分类效果上远远好于IG、CHI和DFICF等流行的通用算法.
To improve the accuracy and stability of text classification on unbalanced datasets, a feature selection method based on category-weighted strategy and variance statistics strategy was proposed. First, larger weights to rare categories was assigned, these features that characterize rare categories would be strengthened,and the performance on rare categories could be improved. Then, a method of variance statistics was presented to develop feature selection. Finally,based on the two strategies, a new feature selection algorithm combined with Information Gain (IG) and χ2-statistic (CHI) was developed. Experiments on Reuters-21578 corpus and Fudan corpus (unbalanced datasets) show that new algorithm has better performances on MicroF1 and MacroF1 than those of IG, CHI and DFICF.