针对现有分类算法通常对不平衡数据挖掘表现出有偏性,即正类样本(通常是更重要的一类)的分类和预测性能差于负类样本的分类和预测性能,提出一种不平衡数据分类方法。该方法通过一个超球面将两类数据以最大分离比率分离,并且引入两个参数来分别控制两类错分率的上界,不仅提高了不平衡数据集的分类和预测的性能,而且大大缩小了参数的选择范围。在UCI真实数据上进行了实验,并采用ROC曲线下面积作为评估指标进行比较,结果验证了该方法的有效性。
Using data sets that contain very few instances of the positive class usually produces the biased classifier and the predictive accuracy over the positive class (usually the more important class) is worse than that over the negative class. A classification method for imbalance data is proposed. This a obtains method maximum separation ratio to separate two classes instances via a single hypersphere and also provides the facility to control the upper bounds of two classes error rates respectively with two parameters. As such, the performance of clas- sification and prediction of imbalanced data sets can be improved, and the range of selection of parameters can be greatly narrowed. Using area under the ROC curve as performance measurement, experimental results on UCI data sets show the method's effectiveness.