提出了一种新的测序短片段定位算法Umap,算法引入核心片段逐步扩展延伸的基本思想,通过短片段间的重叠信息定位短片段.首先找出所有在参考基因组上只出现一次的短片段,称为唯一短片段.然后以唯一短片段为基础,利用短片段间的重叠信息,使用贪婪算法对唯一短片段进行扩展,进而确定其他非唯一短片段的准确位置.实验表明,该算法对短片段的定位比现有短片段定位算法更加准确,能够定位的短片段数目更多,匹配的短片段比率达到71%.通过利用客观存在于短片段间的重叠信息,可以更加准确地在参考基因组上对短片段在参考基因组上进行定位,减少模糊匹配.
A new short reads mapping algorithm Umap is presented here.Short reads are mapped to the reference genome using the main thought of contig extension based on reads overlap information.The unique reads which match only one position in the reference genome are found at first.Then,these unique reads are extended by greedy algorithm,and finally the un-unique reads' position in the reference genome are found.The experiments show that Umap can map short reads more accurately.And up to 71% short reads can be mapped to the reference genome.Taking advantages of the overlap information,short reads can be mapped to the reference genome more accurately.