新一代测序技术下RNA-Seq测序数据为解码真核生物的转录组带来了突破性的变革,其细致到碱基层面的高分辨率信息,使得仅采用RNA-Seq作为唯一数据源便可对现有的基因组进行注解。同样地,利用RNA-Seq信息也能验证现有的剪切位点、外显子乃至转录物的注解信息。因此本文提出利用RNA-Seq数据对现有的基因组注解数据库进行评估,基于RNA-Seq的配准信息提出在基因、转录物、外显子、剪切位点和碱基层面的特异性和敏感性度量指标,进而评估基因组注解数据库的完整性和精确性。基于该评估框架,通过来自人类16个组织的11亿条RNA-Seq读段(read)数据对5个代表性的人类基因组注解数据库进行评估,并基于评价结果构建人体综合准确注解数据库;此外,还对现有的恒河猴基因组注解数据库进行了评估,发现该数据库的完整性有很大欠缺,同时其注解的精确性与人类数据库的注解水平有较大的差距。基于该评估体系,可对各物种的基因组注解信息的完整性和精确性进行全面、快速和高效的评估及验证。
RNA-Seq brings a breakthrough to decode eukaryotic transritptomes. With the high resolution to nucleotide level, RNA-Seq can be adopted as an only data resources to annotate a whole genome. Similarily, RNA-Seq should be able to validate the annotated splicing junction, exon and transcript sets. Therefore, this study proposed an evaluation scheme for the accuracy (specificity) and completeness (sensitivity) of genome annotation databases at gene/transcript/exon/splice-junction/nucleotide base levels with RNA-Seq datasets as only resources. The scheme was applied to assess 5 widely-used human genome annotation databases using 1.1 billion high-quality RNA-Seq reads from 16 human tissues. Accurate-annotated transcripts were collected from the 5 databases to build combined accurate-annotated transcripts databases for the 16 tissues and the whole human body. Furthermore, the assessment for current rhesus annotation database showed that it is far from complete, and not so accurate as Human's annotations. The RNA-Seq analysis pipeline was constructed to implement an express and efficient assessment of various organisms' genome annotations over the whole transcriptome. The implementing pipeline can be downloaded from http://code.google.com/p/genome-annotation-assessment-pipeline/downloads/.