多序列星比对算法在确定中心序列时需要计算任意两个输入序列的距离及分数,其较高的时间复杂度耗费了大量时间,因此提出了通过综合计算每个序列产生的k-mers及各个k-mer在各序列中出现的次数来确定k-mers的拼接选择,由k-mers进行拼接从而得到中心序列。进而,在双序列比对过程中采用搜索两个序列最大相似子串的思想,改进的星比对算法的精度在一定程度上得到了明显提升。接着,将改进的星比对算法在Spark中进行并行化设计与实现。采用Spark的Yarn-Client运行模式,对正常人线粒体的多组数据进行实验,分析了算法性能上的不足及改进方向。
Because center star alignment algorithm needs to calculate the distance and scores of any two input sequences when determining the central sequence,it caused the high time complexity.A strategy for determining the assembling selection of k-mers was proposed by synthesizing computing the k-mers generated by each sequence and the number of occurrences of each k-mer in each sequence.Furthermore,in the process of pair wise sequence alignment,the idea of searching two sequences of the largest similar sub-sequences was used.The accuracy of the improved center star alignment algorithm is improved with a certain degree.The improved center star alignment algorithm was parallelized designed and implemented in Spark.Spark's Yarn-Client running mode was used to experiment the multi-group data of normal mitochondria.The performance of the algorithm was analyzed and the direction of improvement was analyzed.