针对BLAST等软件在生物数据库中搜索DNA分子序列时,不能兼顾时间开销和搜索敏感性的问题,提出一种基于关键字树的多种子搜索算法。首先将查询序列分割成多个种子并将它们构建成一棵关键字树;然后利用Aho—Corasick算法在数据库中搜索,找到每个种子的所有完全匹配;最后检查种子匹配密度大的区域,确定其是否是查询序列的近似出现。实验表明算法兼顾了时间开销和搜索的敏感性,而且能发现基因序列中的移位现象.
Time consuming and sensitivity is incompatible when searching DNA database with BLAST. We propose a new searching algorithm based on keywords tree and multiple seeds. Firstly, query sequence is divided into a number of seeds, which are built into a keywords tree. Then all the seeds are sought in database by Aho-Corasick algorithm. At last it will be checked in the regions where seeds are found frequently. The novel algorithm improves both the time cost and the sensitivity. Furthermore, it can deal with the transform in the genome sequences. Experiments have proved our description.