基因序列中,许多病毒并不是简单的直接复制自己,而是相邻字符间插入或者删除序列片段,如何从序列数据中检索这些病毒具有重要的研究价值。提出了一个更普遍的问题,带任意长度通配符的模式匹配问题(Pattern matching with arbitrary-length wildcards, PMAW),里模式中不仅可以有多个通配符约束,而且每个通配符的约束可以是两个整数,也可以从整数到无穷大。给定序列S和带通配符的模式P,目标是从S中检索P的所有出现和每一次出现的匹配位置,并且要求任意两次出现不能共享序列中同一位置。为了有效地解决该问题,设计了两个基于位并行的匹配算法MOTW(Method of ocurrence then window)算法和MWTO(Method of window then ocurrence)算法。同时,MWTO算法进行细微改动就可以满足全局长度约束。实验结果既验证了算法求解问题的正确性,又验证了比相关的模式匹配算法具有更好的时间性能。
In genetic sequences, many viruses rarely reproduce themselves, but rather appear with a slightly different form in each of the occurrences. That is, sequence fragments may be inserted or deleted in adjacent characters. How to search for these viruses from the sequences has become an important research task. The paper presents a more general problem, named pattern matching with arbitrary-length wildcards (PMAW). Here, a pattern can have many wildcard constraints where the range of the wildcards may vary between two integer bounds or from an integer lower bound to infinity. Given sequence S and pattern P with arbitrary-length wildcards, this paper aims to search for all occurrences of P in S, and locate matching positions of each occurrence, where any two occurrences can not share the same position of S. In order to solve the problem effectively, two algorithms, MOTW (Method of ocurrence then window) and MWTO (Method of window then ocurrence), are proposed based on the bit-parallel technique. The MWTO algorithm can also meet the global length constraint with a minor modification. Experimental results validate the correctness of the proposed algorithms, and show that they perform better than the existing pattern matching algorithms.