稀有类分类在许多领域有重要应用,针对稀有类在数据中所占比例少,容易被忽略的特点,提出一种基于聚类和Ripper的稀有类分类方法,该方法在一趟聚类的结果中,通过将在整个数据集中所占的比例低于15%的聚类标识为少数类,再应用Ripper分类算法分别对少数类和多数类分别进行分类建模,并按照一定的组合方式调整得出整个数据集的最终规则集。在UCI数据集上的测试结果表明,基于一趟聚类和Ripper的稀有类分类方法对稀有类可产生高质量的分类效果。可以将该方法应用于现实生活的领域中进行稀有数据的分类。
The rare-class classification is an important issue in many real life applications; this paper considers the rare-class datasets are easily ignored in the classification because of its low proportion of the whole datasets. We apply a rare-class classification approach based on clustering and Ripper. This approach is trying to find out the rare-class datasets after Cluster through recognizing every cluster whose proportion of the whole datasets is lower than 15 % as the rare-class datasets. After that, Ripper algorithm is used to classify both the rare-class datasets and the normal-class datasets separately. The rule set of the whole datasets will be created by the certain method of this approach according to the model which has already been set up above. The experiments carried on benchmark datasets from the UCI Machine Learning Repository show that this approach creates high quality classifying. This approach can also be implemented to classify the rare-class datasets in some practical life applications.