随着大数据时代带来的数据量激增问题,该研究以随机决策树算法为基础,通过分析单个树与多个树的概率估计,利用无监督局部敏感哈希函数(LSH)处理大数据敏感分类,在对分布式数据挖掘过程中,采用超平面hash减少超平面的可能空间和增加系数处理密集数据类型,结合Sim Hash间接生成随机向量,Fast Hash将整数映射到位图处理稀疏数据类型。最后,在Spark平台上运行8个小数据集和6个大数据集的模拟结果显示:改进后的算法不需要构造很多深度树,检验了改进算法运行在配置不同数量节点的集群上的可扩展性。
Based on the random decision tree algorithm, the probability of single tree and multiple trees is analyzed in this paper, and the unregulated local sensitive hash function (LSH) is used to deal with large data sensitive. Classification, in the process of distributed data mining, the use of ultra-planar hash to reduce the super-plane of the possible space and increase the coefficient processing intensive data types, combined with SimHash indirect generation of random vector, FastHash integer mapping to the bitmap processing sparse data types. Finally, the simulation results of running eight small data sets and six large data sets on the Spark platform show that the improved algorithm does not need to construct many depth trees to verify that the improved algorithm runs on a cluster that configures different numbers of nodes.