本文应用Perl语言和My SQL数据库构建了小RNA高通量测序数据分析平台,以4个水稻数据集为分析对象,详细介绍了小RNA高通量测序数据的处理方法和流程。我们以MSU 6.1水稻基因组为参考,构建了该版本的全基因组结构及已知nc RNAs位点信息数据库,结合Perl脚本可以实现小RNA在基因组上的详细定位与统计,同时我们从数据库中提取已知pre-mi RNAs表达特征,设计了一个新的mi RNAs挖掘方法,该方法可以筛选出大量的新mi RNAs,其中已知mi RNAs命中率可以达到98%。针对水稻小RNA种类的多样性,我们对mi RNAs和endo-si RNAs的鉴别也给予了探讨和说明。本文设计的高通量测序数据分析平台,方法简单高效,以数据库作为存储和查询媒介,能够实现多位点reads的分析,可以得到灵活多样的数据统计结果。依照本文的方法同样可以构建其他模式物种的小RNA数据分析平台,在高通量测序逐渐普及的将来,本文的方法对中小实验室建立自己的数据分析平台具有实践指导意义。
Here, we build a small RNA high-throughput sequenced data analysis platform by applying Perl language and My SQL database; through the analysis of four rice small RNA-Seq datasets, we introduce this method and dataflow in detail. We build a database for whole genome structure and known nc RNAs sites information based on the MSU 6.1 rice genome annotation. In combination with our perl script, the positioning and statistics of small RNAs can be achieved. Through the analysis of the known pre-mi RNA, we extracte the feature of expression and designe a new algorithm used for new mi RNAs discovery. A large number of new mi RNAs can be filtered in this method, and 98% of known mi RNAs can be filtered in the dataset. According to the diversity of rice small RNA, we also give a discussion and instruction about how to distinguish between endo-si RNAs and mi RNAs in small RNA-Seq data. In this paper, a data analysis platform is designed, which is simple and efficient, and can be used as the storage and query media to analyze multiple hits reads for flexible and diversified statistical results. can be used for multiple hits reads analysis, can easily get more flexible statistical results. According to this method, small RNA data analysis platforms of other model species can also be built. With the growing popularity of high-throughput sequencing in the future, the method will help small laboratories to establish their own data analysis platforms, which is of important and practical significance.