针对当前大数据环境下随机森林分类算法在处理不平衡数据集分类任务时存在的小类样本被忽视及效率低的问题,提出了一种Hadoop环境下基于敏感度的随机森林分类算法.该算法引入了文本分类特征选择技术中的相关方法,采用MapReduce编程模型,在Hadoop云计算平台上实现了算法的并行化.通过实验对比分析了该算法与传统随机森林分类算法对不平衡数据的分类效果.结果表明,该算法显著提高了传统随机森林分类算法的性能,且具有高效性和易扩展性.
When applied to deal with the imbalanced dataset classification task under the circumstance of big data,Random Forest classification algorithm always suffers from the neglect of minority class and inefficiency problem. A Random Forest classification algorithm based on Sensitivity Degree in Hadoop environment is proposed to solve the above-mentioned problems,which introduced the method from feature selection of text classification,and is parallelized by using MapReduce programming model in Hadoop cloud computing environment. Comparison was made through experiments in regard to the effect of the imbalanced dataset classification by this algorithm and by the traditional Random Forest classification algorithm. The experimental results show that this algorithm significantly improves the performance of the traditional Random Forest classification algorithm,and has high efficiency and ease of scalability.