随着后基因组时代的到来,建立生物数据库并且在其上开发各种分析工具进行数据分析和挖掘,已经成为了生物学研究的一种新方法。目前国际上流行的通过序列比对搜索相似序列的方法主要是针对短的序列,将这样的方法应用于大规模基因组序列时搜索速度很慢。针对基因组序列搜索的特点,从提高序列搜索效率出发,提出了一种新的、速度更快的搜索方法,其核心是通过序列特征的分析和比较搜索相似序列。在此基础上,建立了基于特征的序列数据库搜索系统,并利用序列的碱基关联性特征搜索人类基因组序列,结果表明,新搜索方法具有较高的命中率,并且搜索速度非常快,适合于大规模基因组序列的搜索。
With the arrival of the post-genome era, a new strategy of biological research becomes more and more popular, which is to develop various tools on biological databases for data mining. Current alignment based sequence search methods are efficient to short biological sequences, however not to large scale genome sequences. To make the genome sequence searching more efficient, we proposed a new search method with high speed. The idea is to search similar sequence by analyzing and comparing the sequence feature. We developed a nucleotide sequence database searching system to search similar sequence according to the sequence feature. The result of human genome searching with the feature of base to base correlation indicates that the proposed method was of high hit ratio and high speed. The established searthing system exhibited good performance for large scale genome sequence searching.