有关蛋白质功能的研究是解析生命奥秘的基础,机器学习技术在该领域已有广泛应用。利用支持向量机(support vector machine,SVM)方法,构建一个预测蛋白质功能位点的通用平台。该平台先提取非同源蛋白质序列,再对这些序列进行特征编码(包括序列的基本信息、物化特征、结构信息及序列保守性特征等),以编码好的样本作为训练数据,利用SVM进行训练,得到敏感性、特异性、Matthew相关系数、准确率及ROC曲线等评价指标,反复测试,得到评价指标最优的SVM模型后,便可以用来预测蛋白质序列上的功能位点。该平台除了应用在预测蛋白质功能位点之外,还可以应用于疾病相关单核苷酸多态性(SNP)预测分析、预测蛋白质结构域分析、生物分子问的相互作用等。
Research of protein function is the base of life mystery,and machine learning technology is widely used in this field.This paper constructs a general platform using support vector machine(SVM) to predict protein function sites.Firstly,the platform extracts non-homologous protein sequences,and codes characteristics which include basic information,physical and chemical characteristics,structure information,sequence conservation characteristics.Then uses SVM to train the coded dataset,and get sensitivity,specificity,Matthew correlation coefficients,accuracy and ROC curve.Finally,get the best model and use it to predict the unknown protein function sites.Moreover the platform can be used to analyze disease and the related SNP,predict protein domain,biomolecular interaction and so on.