近年来,深度学习算法以其适应性强、准确率高、结构复杂等特性在数据挖掘算法中异军突起,但是在天文信息学中深度学习算法还鲜有问津.针对斯隆数字巡天(Sloan Digital Sky Survey,SDSS)恒星/星系分类中普遍存在的亮源集分类正确率高但暗源集分类正确率低等问题,引入了深度学习中较新的研究成果—堆叠降噪自编码(stacked denoising autoencoders,SDA)神经网络和dropout微调技术.从SDSS释放出的带有光谱证认(spectroscopic measurements)的测光数据中分别随机抽取DR7(Data Release7)和DR12(Data Release 12)的亮源集和暗源集并对其进行预处理,再分别对它们的亮源集和暗源集做不放回随机抽样,得到它们亮源和暗源的训练集和测试集.最后用这些训练集分别训练得到了DR7和DR12亮源和暗源的SDA模型,并将SDA在DR12测试集上的测试结果与支持向量机软件包(Library for Support Vector Machines,LibSVM)、J48决策树(J48)、逻辑模型树(Logistic Model Trees,LMT)、支持向量机(Support Vector Machine,SVM)、逻辑回归(Logistic Regression)、单层决策树算法(Decision Stump)上的测试结果进行比较,同时将SDA在DR7测试集上的测试结果与6种决策树的测试结果进行比较.仿真表明SDA在SDSS-DR7和最新SDSS-DR12的暗源集上的分类性能明显优于其他算法,尤其是在使用完备函数(completeness function,CP)作为衡量指标时,SDA相比决策树算法在SDSS-DR7暗源集正确率提高了15%左右.
In recent years, the deep learning has been becoming more and more popular because it is well-adapted, and has a high accuracy and complex structure, but it has not been used in astronomy. In order to resolve the question that the classification accuracy of star/galaxy is high on the bright set, but low on the faint set of the Sloan Digital Sky Survey(SDSS), we introduce the new deep learning SDA(stacked denoising autoencoders) and dropout technology, which can greatly improve robustness and antinoise performance. We randomly selected the bright source set and faint source set from DR12 and DR7 with spectroscopic measurements, and preprocessed them. Afterwards,we randomly selected the training set and testing set without replacement from the bright set and faint set. At last, we used the obtained training set to train the SDA model of SDSS-DR7 and SDSS-DR12. We compared the testing result with the results of Library for Support Vector Machines(Lib SVM), J48, Logistic Model Trees(LMT),Support Vector Machine(SVM), Logistic Regression, and Decision Stump algorithm on the SDSS-DR12 testing set, and the results of six kinds of decision trees on the SDSSDR7 testing set. The simulation shows that SDA has a better classification accuracy than other machine learning algorithms. When we use completeness function as the test parameter, the test accuracy rate is improved by about 15% on the faint set of SDSS-DR7.