东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于级联模型的类别不平衡数据分类方法

ISSN号：0469-5097
期刊名称：《南京大学学报：自然科学版》
时间：0
分类：TP18[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]南京大学软件新技术国家重点实验室,南京210093, [2]佐治亚理工学院计算机学院,美国佐治亚州亚特兰大30332—0280
相关基金：国家杰出青年科学基金（60325207）,江苏省自然科学基金重点项目（BK2004001）,“973”国家计划（2002CB312002）

关键词：机器学习, 数据挖掘, 类别不平衡, 级联, 集成学习, machine learning, data mining, class imbalance, cascade, ensemble learning

中文摘要：

真实世界问题中，不同类别的样本在数目上往往差别很大，而传统机器学习方法难以对小类样本进行正确分类，若小类的样本是足够重要的，就会带来较大的损失．因此，对类别分布不平衡数据的学习已成为机器学习目前面临的一个挑战．受计算机视觉中级联模型的启发，提出一种针对不平衡数据的分类方法BalanceCascade．该方法逐步缩小大类别使数据集趋于平衡，在此过程中训练得到的一系列分类器通过集成方式对预测样本进行分类．实验结果表明，该方法可以有效地提高在不平衡数据上的分类性能，尤其是在分类性能受数据的不平衡性严重影响的情况下．

英文摘要：

In machine learning and data mining, there are many aspects that might influence the performance of a learning system in real world applications. Class imbalance is one of these factors, in which training examples in one class heavily outnumber the examples in another class. Classifiers generally have difficulty in learning concept from the minority class. In many applications if the minority class is more important than the majority class, there will be great loss. There is severe class imbalance in the face detection problem, which greatly decreases the detection speed. The cascade structure is proposed to accelerate the learning process. Cascade is a classifier system with a sequence of n node classifiers. At the beginning, all training examples are available to train the first node classifier. Then all positive examples and only a subset of negative examples are passed to the next node, neglecting those negative examples correctly classified by the first node. This procedure repeats until all node classifiers are trained. A test example is passed to the next node if it is recognized as positive by the current node, or is rejected immediately as negative. However, the learning goal of a cascade node classifier is quite different to usual classifiers in the sense that every node aims to get a high detection rate and only a moderate false alarm rate. The cascade can achieve both high overall detection rate and low overall false alarm rate. Every time training examples are passed to the next node, there are some negatives that are neglected. That is, there are fewer negatives in the training set than those in the previous node. Considering the class imbalance problem, it means a more balanced training set, compared with training sets in previous nodes. In early nodes within a cascade it is quite easy to achieve the learning goal, i.e. train a classifier with high detection rate and only moderate false alarm rate. However, it becomes harder in deeper nodes, since the negative examples in these nodes are fals

同期刊论文项目