为缓解类不平衡问题对预测模型性能的影响,提出一种基于聚类的欠采样集成方法 CBUE(cluster-based undersampling ensemble method)。对多数类进行聚类分析,根据聚类的结果分布(即每个簇的大小比例)有放回地选择N个多数类的子集,N个子集分别和所有的少数类实例组成N个新的训练集;根据N个训练集训练出N个分类器,按照少数服从多数的原则生成一个新的集成分类器对新的数据进行预测。CBUE以NASA数据集作为评测对象,以balance、G-mean和AUC为评测指标,实验结果表明,该方法在大部分情况下要优于5种经典的基准方法 (ROS、RUS、SMOTE、RF和NB)。
To alleviate the impact of class imbalanced problem on the performance of prediction model,a cluster-based under-sampling ensemble method (CBUE)was proposed.The majority was clustered.N subsets of the majority were selected accor-ding to the distribution of clustering result which reflected the ratio of every cluster.N subsets and all minority instances were united to compose new N training sets respectively.N classifiers were trained according to N training sets and a new ensemble classifier was constructed which predicted new data based on majority rule.NASA datasets were used as evaluation datasets,and the balance,G-mean and AUC were taken as evaluation indicators.Experimental results show that the method is superior to five classical methods (ROS,RUS,SMOTE,RF and NB)in most cases.