位置:成果数据库 > 期刊 > 期刊详情页
Boosting方法在高维数据分析中的应用
  • 期刊名称:中国医院统计. 2011. 18(1): 1-5
  • 时间:0
  • 分类:R331.143[医药卫生—人体生理学;医药卫生—基础医学]
  • 作者机构:[1]哈尔滨医科大学公共卫生学院统计教研室,黑龙江省哈尔滨市150081
  • 相关基金:国家自然科学基金资助(30872185)
  • 相关项目:代谢组动态指纹图谱的统计特征提取及数据分析方法研究
中文摘要:

目的 随着现代基因组学、蛋白组学和代谢组学等研究兴起,产生了大量的高维组学数据.对高维组学数据的分析,其重要任务是对样品进行分类及筛选出具有生物学意义的特征标志物.本项研究针对这一问题,采用目前公认效果较好的Boosting方法进行高维数据分析,并探讨Boosting算法在高维数据研究中的应用条件和效果.方法 通过多次迭代,Boosting能够将基础弱分类器(决策树)形成优效分类器.模拟试验研究和验证了在含有大量无差异变量情况下对分类及变量重要性度量的效果,并通过实际基因表达数据进一步考核其应用效果.结果 模拟试验显示,应用Boosting方法与决策树所建的组合模型对分类具有较高的准确性,并对噪声变量的干扰具有一定的抵抗能力.分类的同时能够对变量的重要性进行有效的评价;在保留了所有基因的情况下,对结肠癌真实基因表达数据的分类效果甚为理想,并为医学研究中结肠癌致病基因的发现提供了线索.结论 基于决策树所构造的Boosting组合分类模型,可以有效地应用于高维数据的判别分类及变量重要性评价的问题.Boosting算法在解决小样本、多噪声的高维问题中表现出许多潜在的优势,与目前使用的其他方法相比,对于具有复杂结构高维数据,Boosting算法有其明显的自身特点,如运算速度快,适用性更强,软件实现相对容易等,是一种值得推荐和进一步研究的方法.

英文摘要:

Objective High-dimensional omics data are generated along with the rise of modern genomics, proteomics and metabonomics experiments. The primary task for high-dimensional omics data analysis is classification of the samples and se- lection of the biologically significant biomarkers. We adopted boosting, a well-recommended machine learning method to analysis high-dimensional data, and discussed the conditions and the effects of boosting in the application of high-dimensional data. Methods By the way of multiple iteration, boosting would change the weak classifier ( decision trees) into a strong one. The effect of the classifier was tested by simulations and real gene expression data. Results Simulations showed that models con- structed by boosting performed well even when the amount of noise increased. While classifying, boosting evaluated the impor- tance of variables effectively. Under the condition of keeping all the genes, similar results also got from real gene expression data of colon cancer, features selected by boosting provided important clues for the discovery of pathogenic genes in colon cancer. Conclusion Boosting models could be effectively used in the field of classification high dimensional data and the evaluation of the importance of viables. Comparing with other methods used nowadays, when dealing with complicated high dimensional data, boosting shows lots of potential advantages, such as rapid computation, wide applicability and easy programming. Therefore, boosting is a recommended method and needs further studies.

同期刊论文项目
同项目期刊论文