DNA序列编码区的辨识是基因辨识的一个重要方面。由于基因序列数据量大,导致许多统计辨识算法泛化性差、运算速度慢。根据编码区域序列和非编码区域序列相比有不同的碱基组成,提出将Takagi-Sugeno模型用于DNA序列的编码区辨识。首先,用基于模糊似然函数的模糊聚类算法确定系统的模糊划分数目,进而根据聚类个数建立相应的Takagi-Sugeno局部线性化模型,最后用最小二乘法实现模型结论参数的辨识。该算法不仅可以确定编码区的位置,还可以辨识出密码子第一位碱基的位置,对蛋白质结构的研究是非常重要的。算法简单、高效。仿真结果表明,该算法非常适合编码区辨识和其他编码区辨识算法有可比性。
An important step in gene identification is to predict coding regions in DNA sequence.Due to the large volume of gene data leading to the problem of poor generalization capability and lower computing speed in many algorithms of prediction of coding region.In this paper,a Takagi-Sugeno model of DNA sequence is built based on the different composition of nucleotides in coding regions and non-coding regions.First,the system is quickly divided into several fuzzy parts using clustering algorithm based on the fuzzy likelihood function.Then,regarding cluster number as a rule number,Takagi--Sugeno fuzzy model has been built.Finally,the consequent parameters of the model are identified associating with LS.The algorithm not only can predict coding regions,but also can identify the first nueleotide of the codon in coding regions.This is very significant for accurate translatiorl into a protein sequence.The algorithm is simple and simulation results show the proposed method is more effective for coding regions prediction than the existing coding region discovery tools.