许多机器学习的实际应用中都存在数据不平衡问题,即某类的样本数目要远小于其他类别.数据不平衡会使得分类问题中的分类面过于倾向于适应大类而忽略小类,导致测试样本被错误地判断为大类.针对该问题,文章提出了一种平衡化图半监督学习方法.该方法在能量函数中引入均衡化因子项,使得置信值不仅在图上尽量光滑且在不同类别之间也尽量均衡,有效减小了数据不均衡的不利影响.21个标准数据集上对比实验的统计分析结果表明新方法在数据不平衡时具有显著(显著性水平为0.05)优于支持向量机以及其他图半监督学习方法的分类效果.
In many real applications of machine learning, there are class imbalance problems, which occurs when the number of one class is much lower than the ones of the other classes. In the framework of imbalanced data set, classifiers would tend to be biased toward the majority class and ignore the minority ones. It may cause samples of minority class being misclassified as majority class ones. Aiming at this problem, this paper proposes balanced graph based semi-supervised learning method (BGSSL). This method introduced an equilibrium factor of classes to energy equation to promise class confidence to be as sooth as possible on graph as well as be as balanced as possible over different classes. It is expected to effectively alleviates the decay of imbalance problem. Statistical analysis of experiments on twenty one datasets demonstrates that BGSSL can provide significantly (significance level of 0.05) better results than SVM and other graph based semi-supervised learning methods on imbalanced datasets.