目的研究一种高效的基因特征提取方法,以尽可能地克服传统噪声基因剔除法中阈值设置主观性带来的信息丢失问题。方法收集Golub等发布的急性白血病基因表达谱公共数据库中的数据。相对宽松地剔除噪声基因,适当增加被选基因数量,进而利用二维主元分析法(2D—PCA)技术进行二次基因特征提取,并采用基于机器支持向量机(SVM)的分类形式。结果文中方法可获得90个二次特征和100.00%的分类精度;与直接利用一次特征进行分类相比,分类精度可提高2.78~8.35%。结论通过适当增加被选基因数量提取高效且维数相对较低的特征是可行的。
Objective To study a gene feature extraction method with high efficiency so as to overcome the problem of effective information lost due to the subjective threshold setting during noise gene elimination in conventional methods. Methods The data for the analysis were taken from public Leukemia dataset published by Golub etal. More selected genes were introduced properly by relaxing the constraints of threshold setting during the process of gene noise elimination. Two-Dimensional Principal Component Analysis (2D-PCA) tech nique was applied to the selected genes to extract secondary features. Support vector machine (SVM) based classifier was used for the classification. Results Ninety secondary features could be extracted using the pro-posed approach. Its classification accuracy was 100% and the overall classification accuracy could be in creased by 2.78 - 8.35 percent as compared with the elementary feature based classification. Conclusion It is feasible to extract more effect features with lower dimensions by introducing more selected genes properly. Key words: gene ; feature extraction ; secondary feature ; support vector machine ; classification