目的 探讨基于肺癌全基因组关联研究数据的遗传风险预测方法和策略。方法 将肺癌GWAS数据中的南京子样本和北京子样本分别作为训练集和测试集,分别使用预测全集和最优预测子集两种策略,比较三种预测方法在不同连锁不平衡结构(LD)和初筛检验水准(α)下的预测准确度。结果 w GRS在高LD结构下,随着-log(α)增大,预测准确度呈现上升趋势;RF和SVM对LD结构不如w GRS敏感,但三种方法在低LD结构(r2〈0.2)下预测准确度优于高LD结构;w GRS方法下最优预测子集效果略优于预测全集效果,SVM下子集效果与全集近似,但略逊于全集,RF下子集效果则不如全集,且差距较大。结论 基于LD结构修剪SNP位点和选择适当的初筛水准可以提高遗传风险预测准确度,此时w GRS方法预测效果优于SVM和RF。
Objective To investigate the performance of three genetic risk prediction methods, weighted genetic risk score ( wGRS ), support vector machine ( SVM ) and random forest ( RF), applied to high dimensional data of lung cancer with two strategies. Methods This study served Nanjing and Beijing samples of GWAS data as training set and testing set respectively. We made use of the two strategies of Full predictive subset(FS) and Best predictive subset(BS) and compared the prediction ac- curacy within the three methods mentioned above with the combination of Linkage Disequilibrium (LD) and hypothesis testing levels(α). Results Under a high LD structure, the prediction accuracy of wGRS was on the rise with the increasing -log (α). RF and SVM were not sensitive to LD structures as wGRS, but the predictive accuracy of each method applied with a low LD structure( r2 〈 0. 2)was mainly better than itself with a high LD structure. Moreover, the performance of B S was slightly better than, approximately equal to or tiny less than and worse than FS when the methods were respectively wGRS, SVM and RF. Con- elusion The prediction accuracy could be improved with the condition of LD-pruning and adopting a proper a-value, mean- while, wGRS was better than SVM and RF in that condition.