本文以UCI数据库为研究样本,分析logistic模型对不同程度非平衡数据的敏感性。研究表明:①数据非平衡程度越高,logistic回归对稀有类的识别能力越差。②相对于其他修正方法,OSS方法的改进效果不显著且不稳定;相对于复杂抽样,简单抽样修正结果更优。③AUC值不适宜于非平衡数据条件下的模型选择,因为在非平衡数据条件下,它不能有效区分四种修正方法的优劣,而且修正前后的差异亦不能辨。
Based on the UCI database, this paper analyzes the sensitivity of the logistic model to different degree of unbalanced data. The research shows that: (1) the higher the degree of unbalanced data is, the poorer ability the logistic regression to identify the rare classes. (2) Compared to other revised methods, OSS method is not significant and stable; Simple sampling has better performance relative to complex sampling. (3) The AUC is not suitable for model selection under the condition of unbalanced data. Because it cannot distinguish the four corrected methods effectively nor tell the differences before and after correction.