真实世界问题中,不同类别的样本在数目上往往差别很大,而传统机器学习方法难以对小类样本进行正确分类,若小类的样本是足够重要的,就会带来较大的损失.因此,对类别分布不平衡数据的学习已成为机器学习目前面临的一个挑战.受计算机视觉中级联模型的启发,提出一种针对不平衡数据的分类方法BalanceCascade.该方法逐步缩小大类别使数据集趋于平衡,在此过程中训练得到的一系列分类器通过集成方式对预测样本进行分类.实验结果表明,该方法可以有效地提高在不平衡数据上的分类性能,尤其是在分类性能受数据的不平衡性严重影响的情况下.
In machine learning and data mining, there are many aspects that might influence the performance of a learning system in real world applications. Class imbalance is one of these factors, in which training examples in one class heavily outnumber the examples in another class. Classifiers generally have difficulty in learning concept from the minority class. In many applications if the minority class is more important than the majority class, there will be great loss. There is severe class imbalance in the face detection problem, which greatly decreases the detection speed. The cascade structure is proposed to accelerate the learning process. Cascade is a classifier system with a sequence of n node classifiers. At the beginning, all training examples are available to train the first node classifier. Then all positive examples and only a subset of negative examples are passed to the next node, neglecting those negative examples correctly classified by the first node. This procedure repeats until all node classifiers are trained. A test example is passed to the next node if it is recognized as positive by the current node, or is rejected immediately as negative. However, the learning goal of a cascade node classifier is quite different to usual classifiers in the sense that every node aims to get a high detection rate and only a moderate false alarm rate. The cascade can achieve both high overall detection rate and low overall false alarm rate. Every time training examples are passed to the next node, there are some negatives that are neglected. That is, there are fewer negatives in the training set than those in the previous node. Considering the class imbalance problem, it means a more balanced training set, compared with training sets in previous nodes. In early nodes within a cascade it is quite easy to achieve the learning goal, i.e. train a classifier with high detection rate and only moderate false alarm rate. However, it becomes harder in deeper nodes, since the negative examples in these nodes are fals