目的 随着现代基因组学、蛋白组学和代谢组学等研究兴起,产生了大量的高维组学数据.对高维组学数据的分析,其重要任务是对样品进行分类及筛选出具有生物学意义的特征标志物.本项研究针对这一问题,采用目前公认效果较好的Boosting方法进行高维数据分析,并探讨Boosting算法在高维数据研究中的应用条件和效果.方法 通过多次迭代,Boosting能够将基础弱分类器(决策树)形成优效分类器.模拟试验研究和验证了在含有大量无差异变量情况下对分类及变量重要性度量的效果,并通过实际基因表达数据进一步考核其应用效果.结果 模拟试验显示,应用Boosting方法与决策树所建的组合模型对分类具有较高的准确性,并对噪声变量的干扰具有一定的抵抗能力.分类的同时能够对变量的重要性进行有效的评价;在保留了所有基因的情况下,对结肠癌真实基因表达数据的分类效果甚为理想,并为医学研究中结肠癌致病基因的发现提供了线索.结论 基于决策树所构造的Boosting组合分类模型,可以有效地应用于高维数据的判别分类及变量重要性评价的问题.Boosting算法在解决小样本、多噪声的高维问题中表现出许多潜在的优势,与目前使用的其他方法相比,对于具有复杂结构高维数据,Boosting算法有其明显的自身特点,如运算速度快,适用性更强,软件实现相对容易等,是一种值得推荐和进一步研究的方法.
Objective High-dimensional omics data are generated along with the rise of modern genomics, proteomics and metabonomics experiments. The primary task for high-dimensional omics data analysis is classification of the samples and se- lection of the biologically significant biomarkers. We adopted boosting, a well-recommended machine learning method to analysis high-dimensional data, and discussed the conditions and the effects of boosting in the application of high-dimensional data. Methods By the way of multiple iteration, boosting would change the weak classifier ( decision trees) into a strong one. The effect of the classifier was tested by simulations and real gene expression data. Results Simulations showed that models con- structed by boosting performed well even when the amount of noise increased. While classifying, boosting evaluated the impor- tance of variables effectively. Under the condition of keeping all the genes, similar results also got from real gene expression data of colon cancer, features selected by boosting provided important clues for the discovery of pathogenic genes in colon cancer. Conclusion Boosting models could be effectively used in the field of classification high dimensional data and the evaluation of the importance of viables. Comparing with other methods used nowadays, when dealing with complicated high dimensional data, boosting shows lots of potential advantages, such as rapid computation, wide applicability and easy programming. Therefore, boosting is a recommended method and needs further studies.