为提高定量构效关系(QSAR)研究的预测精度,发展了一种新的基于支持向量机回归(SVR)非线性筛选分子结构描述符、基于k-近邻群的非线性组合预测方法.首先以均方误差(MSE)最小为原则,以留一法通过多轮末尾淘汰实施分子结构描述符的非线性SVR汰选并给出最优核函数和相应保留描述符;其次基于待测样本与训练样本保留描述符向量的欧氏距离,以不同k-近邻群子模型双重留一法预测值反映样本集的异质性;然后基于MSE最小,以留一法通过多轮末尾淘汰实施近邻群子模型的非线性SVR汰选并给出最优核函数和相应保留子模型;最后基于保留子模型以双重留一法实施组合预测.以取代苯胺和苯酚类化合物对大型潘的QSAR实例验证表明:新方法在所有参比模型中预测精度最高,且能更精细地反映描述符与化合物毒性间的非线性关系,具结构风险最小、非线性、适于小样本,能有效克服过拟合、维数灾和局极小,非线性筛选描述符和子模型,非线性组合预测,自动选择最优核函数及其相应参数,泛化推广能力优异、预测精度高等诸多优点,在QSAR研究中有广泛应用前景.
To improve the predication precision in quantitative structure-activity relationship (QSAR) research,a novel nonlinear combinatorial forecast method based on support vector machine regression and k-near neighbor group was proposed. Firstly, screen the descriptors using support vector machine regression (SVR) by leave-one-out method based on the minimum mean square error (MSE), get the optimal kernel and the corresponding retained descriptors. Secondly, characterize the heterogeneity of the sample set using the predication values of different k-near neighbor group based on Euclid distances of the retained descriptors vectors between test samples among train samples. Then,screen the sub-models, the predication values of different k-near neighbor group, using SVR by leave-one-out method based on the minimum MSE,get the optimal kernel and the corresponding retained sub-models. Finaly,carry out combinatorial forecast by dual leave-one-out method based on the retained submodels. The predicted results of QSAR for substituted anilines and phenols to Daphina magna Straus showed that the novel combination model had the highest prediction precision in all reference models and characterized the nonlinear relationships between the toxicity among the descriptors subtly. It had the. advantages of structural risk minimization, non-linear characteristics, avoiding the over-fit, strong generalization ability and high prediction precision, etc. The novel combination model, hence, can be widely used in QSAR.