翻译起始位点(TIS,即基因5’端)的精确定位是原核生物基因预测的一个关键问题,而基因组GC含量和翻译起始机制的多样性是影响当前TIS预测水平的重要因素.结合基因组结构的复杂信息(包括GC含量、TIS邻近序列及上游调控信号、序列编码潜能、操纵子结构等),发展刻画翻译起始机制的数学统计模型,据此设计TIS预测的新算法MED.StartPlus.并将MED.StartPlus与同类方法RBSfinder、GS.Finder、MED-Start、TiCo和Hon-yaku等进行系统地比较和评价.测试针对两种数据集进行:当前14个已知的TIS被确认的基因数据集,以及300个物种中功能已知的基因数据集.测试结果表明,MED-StartPlus的预测精度在总体上超过同类方法.尤其是对高GC含量基因组以及具有复杂翻译起始机制的基因组,MED-StartPlus具有明显的优势.
Accurate prediction of the translation initiation site (TIS) is an important issue for prokaryotic genome annotation. However, it is still a challenge for the existing methods to predict the TIS in the genomes over a wide variety of GC content. Besides, the existing methods have not yet undergone a comprehensive evaluation, leaving prediction reliability as a largely open problem. A new algorithm MED-StartPlus, a tool that predicts TIS in prokaryotic genomes with a wide variety of GC content was presented. It makes several efforts to model the nucleotide composition bias, the regulatory motifs upstream of the TIS, the sequence patterns around the TIS, and the operon structure. Tests on hundreds of reliable data sets, with TISs confirmed by experiments or having annotated functions, show that the new method achieves a totally high accuracy of TIS prediction. Compared with existing TIS predictors, the method reports a totally higher performance, especially for genomes that are GC-rich or have complex initiation mechanisms. The potential application of the method to improve the TIS annotation deposited in the public database was also proposed.