带有通配符的多序列模式挖掘在文本检索、网络安全、生物科学等领域中具有很重要的作用.通过挖掘多序列模式,能够透彻的了解序列之间的联系,在各个领域中具有重要的现实意义.在已有的工作中,随着多序列集长度的增大,挖掘的规模呈现指数级增长.研究这样一个问题:给定多条序列s1,…,sn,支持度阈值和间隔约束,从多序列中挖掘所有出现次数不小于给定支持度阈值的频繁序列模式,并且要求模式中任意两个相邻元素在序列中的出现位置满足用户定义的间隔约束.设计了一个有效的算法M-OneOffMine,模式在序列中的出现满足one-off条件.在生物DNA序列上的实验结果表明,M-OneOffMine算法比相关的序列模式挖掘算法具有更好的时间性能.
Mining multi-sequential patterns with gap constraints is an important research task in many domains,such as text retrieval,network security,and biological science.In the previous work,with the length of the multi-sequence increasing,the mining scale presents exponential increasing,and those algorithms merely mined patterns with the limited length.Given the sequences s1,…,sn,a certain threshold,and gap constraints,we aim to discover frequent patterns whose supports in multiple sequence are no less than the given threshold value.There are flexible wildcards in pattern P,and the number of the wildcards between any two successive elements of P fulfills the user-specified gap constraints.In this paper,we design an efficient mining algorithm,named M-OneOffMine that satisfies the one-off condition under which each character in the given sequence can be used at most once in all occurrences of a pattern.The experiments on DNA sequences show that M-OneOffMine has better time performances than the related algorithms.The time and space complexities of M-OneOffMine are respectively O(kmnlw)and O(k(l+n)),where m is the number of frequent patterns,k is the number of element sequences,n is the length of the pattern,l is the length of the multiple sequence,and w is the flexibility of the gap constraint.