相似字符串匹配过滤算法因其适合大库查找而被广泛应用,为通过提高过滤算法的过滤效率加快匹配速度,提出一种基于匹配区域特征的过滤算法.该算法将模式串和文本串分割成固定长度为kq+1的逻辑块,并从各块中提取了2个新的匹配区域特征:q-gram命中的均匀性和q—gram有效命中的区域性.新算法利用这些新特征优化了传统过滤标准,提高了算法的过滤效率;并改进了QUASAR中基于分块策略的过滤区确定方案.实验结果表明,新算法与改进前相比有效地加快了匹配速度,尤其在误差率较小时改进效果更佳.
Approximate string matching is a basic problem in computer science. It is widely used in various areas. The aim of this study is to improve the speed of approximate string matching. Filter algorithm for approximate string matching is discussed because it is suitable for large-scale text searching. A novel filter algorithm based on match-region features is presented. Firstly, a q-gram index is used to preproeess text. Secondly, both pattern and text are logically divided into blocks of fixed size kq+1, and then new match-region features are extracted from blocks, and the algorithm optimizes the fundamental q-gram filtration criterion by the new features. Finally, the improved method of choosing filtration-region based on QUASAR's block addressing is used for filtration. The experimental results demonstrate that the algorithm achieves higher matching speed than that of QUASAR's block addressing by way of improving filtration efficiency. In particular, its matching speed is much faster under low error rate. Experiments also reveal the relationship between matching speed and error rate of new algorithm. These results suggest that the algorithm is useful in a system for approximate string matching with low error rate. It is also powerful for long pattern approximate string matching on the condition of fixed edit distance k.