类重叠问题是数据挖掘与机器学习领域的瓶颈问题之一.如果其中还存在类不均衡问题时,情况变得更加复杂.有鉴于此,本文在已有文献基础上归纳了三种类重叠学习算法及提出一种新的方法:分隔法,并首次将支持向量数据描述算法用于实际数据的重叠样本识别,对类重叠问题及其与类不均衡问题的相互影响进行了系统研究.在真实数据上采用五种分类器的实验结果表明:1)多数情况下“分隔法”是表现最佳的类重叠学习算法;2)分隔法通常对基于分界面而非规则的分类器更为有效;3)分隔法在类不均衡问题中表现很好,当基础分类器为支持向量机时尤为突出.最后针对支持向量机的实验结果给出了理论分析.
Classification with class overlapping (CWCO) has long been regarded as one of the toughest yet pervasive problems in data mining and machine learning communities. When it is combined with the well- known class imbalance problem, the situation becomes even more complicated, and few works in the literature addresses this problem. To meet this critical challenge, in this paper, we make a systematic study on the CW- CO problem and its interrelationship with the class imbalance problem. Specifically, we first introduce the support vector data description (SVDD) algorithm for capturing overlapping objects, and then introduce three learning schemes and propose a separating scheme for solving the CWCO problem. Extensive experiments on various real-world data sets using five different classifiers show that the separating scheme: 1 ) performs the best among the four schemes for CWCO, 2) is more suitable for classifiers using decision boundaries, and 3 ) performs well for class imbalance data, in particular with the support vector machines (SVMs). Finally, we provide theoretic explanations for the superior performance' of the separating scheme using SVMs.