随着基因组测序技术的不断发展,大量微生物基因组序列可以在短时间内得以准确鉴定。为了进一步探究基因组的结构与功能,基于序列特征与同源特征的基因组注释算法广泛应用于新测序物种。然而受基因组测序质量以及算法本身准确性偏低等问题的影响,现有的基因组注释存在着相当比例的假基因以及注释错误,,尤其是蛋白质N端的注释错误。为了弥补基因组注释的不足,以基因芯片或RNA-seq为核心的转录组测序技术和以串联质谱为核心的蛋白质组测序技术可以高通量地对基因的转录和翻译产物进行精确测定,进而实现预测基因结构的实验验证。然而,原核生物细胞中存在的大量非编码RNA给转录组测序技术引入了污染数据,限制了其对基因组注释的应用。相对而言,以串联质谱技术为核心的蛋白质组学测序可以在短时间内鉴定到生物体内大量的蛋白质,实现注释基因的验证甚至校准。已成为基因组注释和重注释的重要依据,并因而衍生了“蛋白质基因组学”的新研究方向。文中首先介绍传统的基于序列预测和同源比对的基因组注释算法,指出其中存在的不足。在此基础上,结合转录组学与蛋白质组学的技术特点,分析蛋白质组学对于原核生物基因组注释的优势,总结现阶段大规模蛋白质基因组学研究的进展情况。最后从信息学角度指出当前蛋白质组数据进行基因组重注释存在的问题与相应的解决方案,进而探讨未来蛋白质基因组学的发展方向。
With the rapid development of genome sequencing technologies, a large amount of prokaryote genomes have been sequenced in recent years. To further investigate the models and functions of genomes, the algorithms for genome annotations based on the sequence and homology features have been widely implemented to newly sequenced genomes. However, gene annotations only using the genomic information are prone to errors, such as the incorrect N-terminals and pseudogenes. It is even harder to provide reasonable annotating results in the case of the poor genome sequencing results. The transcriptomics based on the technologies such as microarray and RNA-seq and the proteomics based on the MS/MS have been used widely to identify the gene products with high throughput and high sensitivity, providing the powerful tools for the verification and correction of annotated genome. Compared with transcriptomics, proteomics can generate the protein list for the expressed genes in the samples or cells without any confusion of the non-coding RNA, leading the proteogenomics an important basis for the genome annotations in prokaryotes. In this paper, we first described the traditional genome annotation algorithms and pointed out the shortcomings. Then we summarized the advantages of proteomics in the genome annotations and reviewed the progress of proteogenomics in prokaryotes. Finally we discussed the challenges and strategies in the data analyses and potential solutions for the developments of proteogenomics.