针对传统决策树分类算法需要依靠人工构造特征才能实现对数据进行分类的问题,以及其在处理海量天文数据时所面临的处理速度和资源分配瓶颈问题,结合深度学习强大的特征学习能力和Spark高效的数据处理性能,提出了一种基于Spark平台的深度感知决策树并行化算法,并将其应用于天文恒星/星系分类问题中。研究结果表明,该算法具有很好的可伸缩性,可以通过增加Spark集群计算节点的数量,来减少分类模型所需的训练时间和增强其对海量天文数据的处理能力。并且,其因同时具备强大的特征学习和分类能力而在恒星星系分类问题上可以获得比传统决策树更高的分类准确率。
In view of the traditional decision tree need to predefine the features before classifying data,and in order to solve the bottleneck problems of processing speed and resource allocation when dealing with massive astronomical data,considering the strong representation learning of deep learning and good performance of processing huge amounts of data on Spark,this paper proposed a parallel deep neural decision tree based on Spark. And then it was applied on the astronomy star / galaxy separation problem. The results show that,this algorithm can scale well with cluster size as it can dramatically decrease the training time of model and enhance the ability of processing massive astronomy data with it. Moreover,it obtains a better classification accuracy on the star / galaxy separation problem by addressing the decision tree to learn the proper representations of input data and the final classifiers in a joint manner.