数据分类是数据挖掘领域中一类重要的问题,然而,当前的数据挖掘工作面临着大样本量、高维度数据的挑战.从数据特征中选择出有效的数据特征子集,能够使数据降维,是进行进一步数据分类的基础.目前比较流行的特征选择方法对高维数据不太适应,精度也不高.因此,提出一种基于t检验和弹性网的特征选择方法,其基本思想是通过t检验得到特征在不同类之间的差异程度,并利用弹性网回归模型对差异程度较大的特征进行分析,通过回归系数压缩和误分类率得到最终的特征子集.本文通过实验证实了此方法在准确性、稳定性及时间代价上都具有良好的效果.
Data classification is an important issue in data mining domain. However,data mining is currently faced with challenges of large-sized and high-dimensional data. It is the basis of further data classification that effective feature subset being selected and thus data dimension being reduced. Currently popular feature selection methods are not accustomed to high-dimensional data and its accuracy is not good enough. In the present paper,a method based on t-test and elastic net is proposed,which is specially for data classification problems. In this method, variances of features between classes is calculated by t-tests. Then the features which have bigger variances are analyzed through the elastic net regression model. Finally, the feature subset is selected by shrinkage of regression coefficients and misclassification error rate. Experiments show that the method has achieved good results in aspects of accuracy, stability and time costs.