针对样本集中的类不平衡性和样本标注代价昂贵问题,提出基于不确定性采样的自训练代价敏感支持向量机。不确定性采样通过支持向量数据描述评价未标注样本的不确定性,对不确定性高的未标注样本进行标注,同时利用自训练方法训练代价敏感支持向量,代价敏感支持向量机利用代价参数和核参数对未标注样本进行预测。实验结果表明:该算法能有效地降低平均期望误分类代价,减少样本集中样本需要标注次数。
Self-training cost-sensitive support vector machine with uncertainty based sampling(SCU) was proposed to solve two difficulties of class-imbalanced dataset and expensive labeled cost.The uncertainty of unlabeled sample was evaluated using support vector data description in uncertainty based sampling.The unlabeled sample with high uncertainty was selected to be labeled.Cost-sensitive support vector machine was trained using self-training approach.Cost parameters and kernel parameters of cost-sensitive support vector machine were employed to predict a class label for an unlabeled sample.The results show that SCU effectively reduces both average expected misclassification costs and labeled times.