如何能从海量数据中以更快速、高效、低成本的方式挖掘出有价值的信息成为如今数据挖掘技术面临的新课题。文中在研究Hadoop平台的特征和决策树的C4.5算法的过程中,决定在决策树算法领域中引入云计算思维,实现其在Ha-doop平台上的并行化,并且采用MapReduce模型来解决海量数据挖掘问题。最后用打高尔夫球的数据集对新的算法进行验证。实验结果表明对海量数据,基于Hadoop平台的决策树算法可以明显提高数据挖掘的效率,具有可观的高效性和可扩展性,在一定程度上解决了C4.5算法在处理海量数据时计算量大、构建决策树时间长的问题。
How can dig out the valuable information from the vast amount of data in a more rapid,efficient and low-cost way now be-come a new task faced by the data mining technology. In this paper,in the study of the characteristics of the Hadoop platform and the process of decision tree C4. 5 algorithm,decide to introduce the cloud computing thinking to the field of decision tree algorithm,achieve its parallelization on Hadoop platform and use MapReduce model to solve the problem of massive data mining. Finally with using a round of golf data sets to verify this new algorithm,the results of the experiments show that for the huge amounts of data,the decision tree algo-rithm based on Hadoop platform can significantly improve the efficiency of data mining. It has a good efficiency and scalability. In a cer-tain extent,it also solves the problems of computing huge amounts of data and building the decision tree taking long time that C4. 5 algo-rithm faced when dealing with large amount of calculation.