神经胶质瘤(glioma)是一种严重的颅内肿瘤疾病,具有高复发率、高死亡率和低治愈率等特点。利用基因微阵列数据识别与神经胶质瘤相关的特征基因,对该疾病的临床诊断和生物医学研究将起到有益的参考和借鉴作用。作者针对神经胶质瘤数据,提出了一种集成类随机森林特征基因选择方法。首先应用有监督奇异值分解对数据进行降维并粗选出基因;其次应用类随机森林特征选择方法选出特征基因。实验结果显示,该方法对分类器的适应性强;对比其他方法,分类率优势明显;更重要的是,在选出的前50个特征基因中有39个基因与神经胶质瘤或肿瘤细胞生物过程存在着密切联系,证实该方法不仅保持了较高的分类率,而且保证了选择的特征基因具有很强的生物学关联意义,具有较高的可行性和实用性。
Glioma is a serious intracranial tumor with high relapse and mortality rate.With advances in microarray technology,gene biomarkers have the potential to provide more accurate and objective cancer diagnosis.In this study,the authors proposed an integrated method based on random forest and singular value decomposition(SVD) for gene selection.First,a supervised SVD analysis was applied to reduce data dimensionality and select out candidate genes.Secondly,a semi-random forest based method was applied to select biomarkers from candidate genes.Experimental results show that the method is classifier-independent,compares favorably to the state-of-arts.More importantly,out of the first selected 50 genes,39 genes are proved to be directly or indirectly connected to glioma.