蛋白质结构自动分类是探索蛋白质结构-功能关系的一种重要研究手段。首先将蛋白质折叠子三维空间结构映射成为二维距离矩阵,并将距离矩阵视作灰度图像。然后基于灰度直方图和灰度共生矩阵提出了一种计算简单的折叠子结构特征提取方法,得到了低维且能够反映折叠结构特点的特征,并进一步阐明了直方图中零灰度孤峰形成原因,深入分析了共生矩阵特征中灰度分布、不同角度和像素距离对应的结构意义。最后应用于27类折叠子分类,对独立集测试的精度达到了71.95%,对所有数据进行10交叉验证的精度为78.94%。与多个基于序列和结构的折叠识别方法的对比结果表明,此方法不仅具有低维和简洁的特征,而且无需复杂的分类系统,能够有效和高效地实现多类折叠子识别。
One of the most important research aims is to understand the relationship between structure and function of protein. Inspired by this aim, automatic classification of protein structre becomes one of major research approaches. However, how to extract compact and effective feature to characterize protein structure is still a challenge to it. In this paper, 3-D tertiary structure of protein fold was mapped into 2-D distance matrix which can be further regarded as gray level image. Next, based on histogram and gray level co-occurrence matrix (CoM), a feature extraction_ of fold structure with low-cost computation was presented and feature vector with low dimension and definite structural properties was obtained. Furthermore, the nature of histogram peak at gray level 0 was depicted, and the structural meanings of gray distribution, various angles and pixels distance of CoM were analyzed in detail respectively. Finally, the presented feature extraction was validated by classification of 27 types of folds, and compared with several feature methods based on sequence or structure. The presented method achieved the accuracy 71.95% in independent test by using 5-CV (cross validation) to select the parameters of support vector machines (SVM), and 78.94% with 10-CV test on the whole combined data of training and testing sets. The results show that the presented method can perform effectively and efficiently automatic classification of multiple types of folds with the benefit of low dimension and compact feature, but also no need of complicated classifier system.