针对当前决策树剪枝算法较少考虑训练集嘈杂度对模型的影响,以及传统驻留内存分类算法处理海量数据困难的问题,提出一种基于Hadoop平台的不确定概率误差剪枝算法(IEP),并将其应用在C4.5算法中。在剪枝时,认为用于建树的训练集是嘈杂的,通过将基于不确定概率误差分类数作为剪枝选择依据,减少训练集不可靠对模型的影响。在Hadoop平台下,通过将C4.5-IEP算法以文件分裂的方式进行MapReduce程序设计,增强处理大规模数据的能力,具有较好的可扩展性。
Concerning that current decision tree pruning algorithms seldom consider the influence of the level of noise in the training set on the model,and traditional algorithms of resident memory have difficulty on processing massive data,an imprecise probability error pruning algorithm named IEP was proposed based on Hadoop and applied in C4.5 algorithm.When pruning,IEP algorithm considered that the training set used to design decision trees is noisy,and the error classified number based on imprecise probabi-lity was used as a foundation of pruning to reduce the influence of the noisy data on the model.C4.5-IEP implemented on Hadoop by MapReduce programming based on file split enhanced the ability of dealing with massive data and improved the algorithm’s extendibility.