采用最邻近算法对同源建模中的缺失值进行填充,由此扩大传统同源建模方法的研究尺度.在序列结构比对中,氨基酸的插入(删除)会引起数据缺失,传统的同源建模法不能处理这部分结构.结合最邻近算法、期望值最大化方法和主成分分析,抽取蛋白质结构演化的主要信息,构建蛋白质保守结构的低维取样空间.与标准的主成分分析相比,该方法能利用更多的演化信息,涵盖更多的具有遗传信息的区域,构造更大尺度的蛋白质取样空间.取样空间的精度用目标蛋白质结构与其在取样空间上的投影的均方根偏差评价.将该方法应用于33个蛋白质超家族,结果表明,扩大后的取样空间精度达到测定蛋白质结构的X-ray实验精度,满足后续的蛋白质结构研究.
The k-nearest neighbor (KNN) algorithm is proposed to impute missing values in the protein com- parative modeling. These missing values are caused by insertions/deletions in the multiple structural align- ments. Together with the Expectation-maximization (EM) technique and the principal component analysis (PCA) method, evolutionary deformation information is extracted to help the construction of low dimensional sampling spaces for the conserved cores of amino acid backbones. Compared to the standard PCA method, this method utilizes more evolutionary information, and includes more core residues in the study. As a conse- quence, the sampling spaces are greatly enlarged. The qualities of sampling spaces are evaluated by the root mean square deviation between the target and its projection on the sampling space. The results of applications to a set of 33 representative and well studied super families show that the accuracies of enlarged sampling spaces are on the same level as the standard PCA space. This implies that sampling spaces obtained are suit to further applications of protein structural researches.