通过对肿瘤基因数据中存在的小样本、高维性和噪声冗余等问题的研究,提出一种Wrappers类型的特征选择方法。将样本按照标签进行分类,把具有相同标签的样本放到一个矩阵中;分别计算这两个基因矩阵的相关矩阵,找出相关性较高的特征集合,对多个相关矩阵进行相关性分析,找出所有类别中同时与指定特征相关的特征集合;从中选择Score最高的特征,得到特征组合最优的特征子集,缩小特征空间。通过对3个肿瘤数据集进行测试,验证了该方法具有较好的分类效果。
A Wrappers feature selection method based on supervised correlation was discussed according to the research of defections existed in tumor gene data,such as small sample,high dimension and noise.The training samples were classified into two gene matrices according to their labels,and the correlation matrices of gene matrices were calculated respectively to extract a highly correlative feature set.Correlation analysis of correlative matrices was conducted simultaneously to find feature subsets related to specified feature in all classes.And the features with highest scores were selected from above subsets to form the final optimal feature subset,reducing feature dimension.Experimental results on three tumor data sets show that the proposed algorithm has better classification performance.