随着新一代测序技术的不断发展,海量的序列数据将为生物学研究者挖掘基因信息提供巨大的资源.信息挖掘的一项重要工作是对序列进行功能注释,其中最重要的功能注释方式是基因本体论(Gene Ontology,GO)的注释.利用生物信息学方法和软件工具集成了针对EST序列的大规模GO注释流程(1arge—scale GO annotation pipeline,LSGAP).该流程集合了BLAST、B2g4pipe以及Wego等软件和Swissprot、Nr或Interpro等常用蛋白数据库.用户可以将EST序列通过此流程最终获得可视化的GO分类统计图表,直观地显示基因在不同过程中的参与情况.为了验证LSGAP的准确性,对2007年发表的美洲牡蛎(Crassostrea Virginica)的EST序列进行了LSGAP分析,结果表明GO分析非常准确有效.通过与Blast2go和GoBlast等GO注释软件进行比较,LSGAP流程具有可以本地化运行BLAST、对硬件要求低和运行时间短等诸多优势,因此l,sGAP流程是科研人员进行基因功能挖掘的有效工具.
With the fast development of next-generation sequencing technologies,a large number of biological data will provide tre- mendous sequence resources to biologists in gene exploitation. An important task on data mining is to annotate genes with functions, and the most important method is Gene Ontology (GO) annotation. This research formed the procedure of large-scale GO annotation pipeline for EST sequences,utilizing bioinformatics methodologies and software tools. This procedure encompasses different software like BLAST, B2g4pipe and Wego, together with Swissprot, Interpro or Nr protein databases. Users can put EST sequences with FASTA format through this system and ultimately gain visualized GO distribution statistics diagrams, which demonstrate the situations of the genes involved in different processes. In order to test and verify the preciseness of LSGAP, the EST sequences of eastern oyster published in 2007 were gone through this pipeline,and the results demonstrated that LSGAP procedure was quite accurate and efficient. Compared with other GO annotation software such as Blast2go (Graphical User Interface) and GoBlast,LSGAP procedure has many advantages:running BLAST software locally, without downloading many GO relative databases and consuming less time. All of the results demonstrated that LSGAP is an efficient tool for researchers to do data mining.