目的利用生物信息学和实验相结合的方法,补平已知拼接序列中的缺失片段,获得日本血吸虫乙醛脱氢酶全长基因编码序列。方法通过生物信息学方法从已公开发布的日本血吸虫转录组数据库中提取乙醛脱氢酶表达序列标签(EST)序列数据,与其他物种的同源基因进行多序列联配,寻找分别与同一蛋白氨基酸序列的N端和C端配对的序列;设计引物,用RT-PCR扩增得到全长基因中间的缺失片段并测序。最终获得全长基因序列,并分析该蛋白的理化性质。结果找到日本血吸虫乙醛脱氢酶基因的可能EST序列片段8条,对其中的1对EST序列(AAW27891和AAW27047对应的氨基酸序列,中间缺少约80个氨基酸)进行blastn比对结果,预测为同一基因的2条片段。根据这两对EST序列设计正、反向引物,通过PCR扩增、对扩增产物进行测序及序列的生物信息学鉴定,找回了缺少的核酸序列,并与预测序列大小大致吻合(430bp)。组合成1条完整的日本血吸虫乙醛脱氢酶基因的编码序列(提交GenBank,其登录号为EF503564)。ORF全长为1596bp,编码531个氨基酸,编码的蛋白相对分子质量理论值为Mr57330.7,pI值为7.94,此序列的290~297位氨基酸与乙醛脱氢酶的模式序列[LIVMFGA]-E-[LIMSTAC]-[GS]-G-[KNLM]-[SADN]-[TAPFV]相匹配。结论利用生物信息学和实验相结合的方法,可以补平已知拼接序列中的缺失片段,获得日本血吸虫乙醛脱氢酶的全长基因编码序列。
Objective To acquire the full coding sequence of Schistosoma japonicum aldehyde dehydrogenase, and fill the gaps of the partial aldehyde dehydrogenase sequences. Method Putative sequence fragments of the S. japonicum aldehyde dehydrogenase were extracted from the transcriptome database by use of bioinformatics tools, through the multiple sequences alignment with homologous sequences of other species. Primers were designed according to the EST sequences matching the N terminal and C terminal respectively, and the gap sequence fragment was amplified by RT-PCR and sequenced. The full gene sequence was obtained finally by combining the old 2 EST sequences with the amplified sequence. The physico-chemical parameters of the new sequence were analyzed by using bioinformatics software. Result Eight EST sequences of S.japonicum were predicted as partial sequences of aldehyde dehydrogenase. Two of which (AAW27891, AAW27047) were predicted to represent the N terminal and C terminal of one protein, respectively. The gap between them was deduced as about 80 amino acids according to the result of multiple sequences alignment. Primers located on the flanking of the gap were designed according to the known EST sequences of AAW27891 and AAW27047. The gap between the AAW27891 and AAW27047 were obtained by RT-PCR and then sequenced, as well as confirmed by bioinformatics software. The full sequence of aldehyde dehydrogenase was reassembled by filling of the gap sequence. The reassembled gene coding sequence was submitted to GenBank with an accession number of EF503564. The coding sequence contains an intact ORF of 1 596 bps with deduced 531 amino acids. Bioinformatic analysis of new amino acids sequence was performed as deduced molecular weight of 57 330.7 and PI value of 7.94. The aldehyde dehydragenase pattern of [LIVMFGA]-E-[LIMSTAC]-[GS]-G-[KNLM]-[SADN]-[TAPFV] was found located in the position 290-297 of the new sequence. Conclusion The gap between two partial nucleotide sequences is filled and the full coding sequence of