在不平衡数据集的分类问题中,由于某一类或某几类样本数量相对较少,标准的分类器会倾向于数量多的类别,导致少数类样本在分类过程中容易被误分。合成少数类过采样技术(SMOTE)是一种常用的过采样数据预处理方法。通过合成少数类样本来平衡数据集各类样本的分布,能够有效地改善数据集的不平衡分布,从而提高不平衡数据集的分类精度。但SMOTE采样方法将所有的少数类样本都用来合成新样本,存在一定的盲目性。在分类中,处于边界的样本对分类决策往往有着更重要的作用,需要得到更大的关注。基于以上考虑提出一种改进的过采样方法——距离边界合成少数类过采样技术(DBSMOTE)。该方法根据少数类样本与多数类样本的距离确定边界样本,并在边界样本集上进行样本的合成,理论分析和实验结果表明DBSMOTE是有效的。
For classification problem of imbalanced data sets, since the number of one class or a few certain classes is relectively small, standard classifiers will tend to major classes, which results in misclassification of the minority class samples. Synthetic minority over-sampling technique(SMOTE) is a commonly used over-sampling method in data preprocessing for imbalanced data sets, owe to the synthesis of minority class samples, SMOTE can balance the number of samples of per class, thus it can improve the imbalanced distribution of data sets effectively, and then improve the classification accuracy of the imbalanced data sets. However, SMOTE use all of minority class samples to synthesize new samples, it is of some blindness. In classification problems, the boundary samples play a more important role in the decision, which need to attract more attention. According to above consideration, this paper proposed an improved method named distance borderline synthetic minority over--sampling technique (DBSMOTE). The DBSMOTE selects the boundary samples according to the distance between the minority samples and the majority samples firstly, then the minority samples are synthesized based on the boundary samples. Theoretical analysis and experimental results show that the DBSMOTE is effective.