传统的GentleAdaboost方法在处理不平衡数据集的分类问题时,通常采用过抽样方法,以达到数据集的平衡。但这样处理通常会引人难以分类的奇异样本,导致分类器的分类性能下降。为此,针对不平衡数据集分类提出了一种改进的GentleAdaBoost算法。考虑到传统GentleAdaBoost算法中容易分类的样本具有较小权重的特点,在分类器的迭代学习过程中,设定一个样本的权重阈值,仅对少数类样本中低权重样本进行复制,然后采用上述数据集进行分类器的训练,得到相应的弱分类器;重复上述步骤进行迭代,在完成平衡数据集的同时,得到强分类器。整个过程可以避免对数据过抽样时引入奇异样本的问题。实验证明了本算法的有效性。
Traditional Gentle AdaBoost Algorithm always use over-sampling way to accomplish the implementation of minority samples in the process of dealing with the classified issues of unbalanced data set for the purpose of achieving the balance of data set. But this method will incorporate the singular sample which is hard to classified, and lead to the unsatisfied classification performance of the classifier. Therefore, this paper proposes an improved Gentle AdaBoost algorithm specified for the classified issues of unbalanced data set. Firstly, considering the feature that misclassification samples is assigned with a large weight when the classifier is based on Gentle AdaBoost algorithm in training, we can decide the weight threshold for the copy samples, and then, copy a number of minority samples in the threshold range, and use the aforesaid data set to train the classifier and obtain related weak classifier. Repeat the former proce- dures to balance the data set so that the strong classifier can be also obtained. The whole process has the capability of avoiding the issue of incorporating singular samples in the process of data over-sampling. The experiment demonstrates validity of our algorithm.