在类别不均衡的数据中,类间和类内不均衡性问题都是导致分类性能下降的重要因素。为了提高不均衡数据集下分类算法的性能,提出一种基于概率分布估计的混合采样算法。该算法依据数据概率分别对每个子类进行采样以保证类内的均衡性;并扩大少数类的潜在决策域和减少多数类的冗余信息,从而同时从全局和局部两个角度改善数据的平衡性。实验结果表明,该算法提高了传统分类算法在不均衡数据下的分类性能。
In the class imbalanced data distribution, both the between-class and within-class imbalance issues are critical factors to decrease the performance. To improve the performance of classifier algorithm on the imbalanced data, a hybrid sampling algorithm based on probability distribution estimation is proposed. The approach re-samples the data of subclass to balance the distribution in each class based on probability distribution estimation. Moreover, it expands the decision region of minority class and removes the redundant information of majority class, so as to solve the imbalance issues from both global and local perspectives simultaneously. Experimental results show that the proposed method improves the classification performance for imbalanced data.