目的研究如何快速有效地进行变量的筛选,建立起准确可靠的logistic回归预测模型;针对小样本的特性,如何对模型的泛化能力(即预测性能)进行可靠的评价;并在数据集来源于分离抽样时,对模型进行过抽样的调整,使调整后的结果适用于人群预测疾病发生的可能性。方法以2型糖尿病并发末梢神经病变数据为例,采用最优子集法与AIC信息准则相结合对变量进行快速方便的筛选,并采用MonteCarlo模拟抽样的方法(具体为10~100次的3~10折分层交叉验证法)对模型的泛化能力作出评价和比较。结果采用最优子集法与AIC信息准则相结合建立的logistic回归模型,准确率为79.6%,ROC面积为0.8802,经分层交叉验证法验证,泛化能力优于用一般筛选变量方法建立的模型;用先验概率对后验概率进行过抽样的调整,使调整后的结果适用于人群预测疾病发生的可能性。结论建立logistic回归预测模型时,应根据实际情况,尽量尝试多种筛选变量的建模策略,在小样本情况下,若欲对模型的泛化能力做出可靠的评价,可采用分层交叉验证的方法;当样本来源方式为分离抽样时,若研究目的为建立预测模型,则应采用先验概率对后验概率进行调整。
Objective:To study how to screen variables and build an accuracy logistic regression quickly and efficiently.To Assess the generalization ability(prediction performance) reliably in the case of small sample size.And when the data is obtained by separate sampling,we adjuste the model so that the adjusted results can predict the probability of outbreak of the disease about the overall people.Methods:We combined the best subset method with the Akike Information Criterion to screen the variable quickly and easily,and adopted the Monte Carlo simulation of sampling(to be more specific,10 to 100 times 3 to 10 fold stratified cross-validation) validation to make a reliable assessment and comparison of generalization ability.Results:The generalization ability of the best subset method with AIC was superior to the conventional screening methods by using the fold stratified cross-validation method,the accuracy was 79.6% and the area under curve was 0.8802.We used the prior to adjust the posterior probability so that the adjusted result was used to predict the possibility of disease of the overall people.Conclusion:If the researchers want to build a logistic regression for prediction,it is a good way to try different modeling methods according the data reality.In the case of small sample size,the fold stratified cross-validation is a good method to assess the generalization ability.If the research aim is to built a predictive model,we should use the prior probability to adjust the posterior probability.