大规模的训练集中通常含有许多相似样本和大量对分类器模型构造“无用”的冗余信息,利用全部样本进行训练不但会增加训练时间,还可能因为出现“过拟合”现象而导致泛化能力下降。针对这一问题,本文从最具代表性样本与最近边界样本两个角度综合考虑,提出一种基于改进加权压缩近邻与最近边界规则SVM训练样本约减选择算法。该算法考虑到有价值训练样本对SVM分类器性能的重要影响,引进减法聚类利用改进的加权压缩近邻方法选择最具代表性的样本进行训练,在此基础上利用最近边界规则在随机小样本池中选择边界样本提高分类精度。在UCI和KDD Cup 1999数据集上的实验结果表明,本文的算法能够有效地去除大训练集中的冗余信息,以较少的样本获得更好的分类性能。
Large-scale training sets usually contain large amount of similar samples and redundant information, resulting in a longer training time and poor generalization ability due to over-fitting. To deal with this problem, a training sample selection algorithm for SVM based on modified weighted condensed nearest neighbor and close-to-boundary criterion is proposed. Considering the significance of valuable training sets for the performance of SVM classification,the presented method combined the most representative samples with close-to-boundary samples and utilized the modified weighted CNN rule to select the most representative samples for training with subtractive clustering approach, and then used close-to-boundary criterion to select boundary samples to improve classification accuracy in random small pools. Experimental results on UCI and KDD Cup 1999 datasets show that the proposed algorithm can eliminate the redundancy, achieve better classification performance with fewer samples.