选择性集成通过选择部分基分类器参与集成,从而提高集成分类器的泛化能力,降低预测开销.但已有的选择性集成算法普遍耗时较长,将数据挖掘的技术应用于选择性集成,提出一种基于FP-Tree(frequent pattern tree)的快速选择性集成算法:CPM-EP(coverage based pattern mining for ensemble pruning).该算法将基分类器对校验样本集的分类结果组织成一个事务数据库,从而使选择性集成问题可转化为对事务数据集的处理问题.针对所有可能的集成分类器大小,CPM-EP算法首先得到一个精简的事务数据库,并创建一棵FP-Tree树保存其内容;然后,基于该FP-Tree获得相应大小的集成分类器.在获得的所有集成分类器中,对校验样本集预测精度最高的集成分类器即为算法的输出.实验结果表明,CPM-EP算法以很低的计算开销获得优越的泛化能力,其分类器选择时间约为GASEN的1/19以及Forward-Selection的1/8,其泛化能力显著优于参与比较的其他方法,而且产生的集成分类器具有较少的基分类器.
By selecting parts of base classifiers to combine,ensemble pruning aims to achieve a better generalization and have less prediction time than the ensemble of all base classifiers.While,most of the ensemble pruning algorithms in literature consume much time for classifiers selection.This paper presents a fast ensemble pruning approach: CPM-EP(coverage based pattern mining for ensemble pruning).The algorithm converts an ensemble pruning task into a transaction database process,where the prediction results of all base classifiers for the validation set are organized as a transaction database.For each possible size k,CPM-EP obtains a refined transaction database and builds a FP-Tree to compact it.Next,CPM-EP selects an ensemble of size k.Among the obtained ensembles of all different sizes,the one with the best predictive accuracy for the validation set is output.Experimental results show that CPM-EP reduces computational overhead considerably.The selection time of CPM-EP is about 1/19 that of GASEN and 1/8 that of Forward Selection.Additionally,this approach achieves the best generalization,and the size of the pruned result is small.