对89个苯并异噻唑和苯并噻嗪类丙型肝炎病毒(HCV)NS5B聚合酶非核苷抑制剂进行了定量构效关系(QSAR)研究.采用遗传算法组合偏最小二乘(GA—PLS)和线性逐步回归分析(LSRA)两种特征选择方法选择最优描述符子集,然后建立多元线性回归和偏最小二乘线性回归模型.并首次尝试使用遗传算法耦合支持向量机方法(GA-SVM)对两种特征选择方法所选的描述符子集分别建立非线性支持向量机回归模型.三种机器学习方法所建模型均得到比较满意的预测效果.采用LSRA所选的6个描述符建立的三个QSAR模型对于测试集的相关系数为0.958—0.962,GA-SVM法给出最好的预测精度(0.962).采用GA-PLS所选的7个描述符建立的三个QSAR模型对于测试集的相关系数为0.918—0.960,偏最小二乘回归模型的结果最好(0.960).本工作提供了一种有效的方法来预测丙型肝炎病毒抑制剂的生物活性,该方法也可以扩展到其他类似的定量构效关系研究领域.
The quantitative structure-activity relationship (QSAR) approach was used to predict the activity of two different scaffolds (benzoisothiazole and benzothiazine) of 89 non-nucleoside inhibitors of hepatitis c virus (HCV) NS5B polymerase. Two selection methods, linear stepwise regression analysis (LSRA) and genetic algorithm-partial least squares (GA-PLS), were used to select appropriate descriptor subsets for QSAR modeling with linear models. The genetic algorithm-support vector machine (GA-SVM) approach was first used to build nonlinear models with six LSRA- and seven GA-PLS-selected descriptors. Three QSAR models built with the six LSRA-selected descriptors gave correlation coefficients of 0.958-0.962 for the training set. GA-SVM provided the highest prediction accuracy of the models of 0.962. Three QSAR models built with the seven GA-PLS-selected descriptors gave correlation coefficients of 0.918-0.960 for the training set, of which the partial least squares (PLS) model was the best (0.960). The investigated models gave satisfactory prediction results and can be extended to other QSAR studies.