字符串相似性查询是众多应用的基础操作,如数据清洁、拼写校验、生物信息学和信息集成等.随着数据的爆炸性增长,大规模字符串数据日益普遍,现代的信息系统中也广泛使用字符串作为数据的表达形式.现有支持字符串相似性查询的方法大多是基于q-gram的内存倒排索引,在处理大规模字符串集合会消耗无法忍受的内存容量,甚至在数据量过大时造成内存容量不足而无法支持查询处理.现有的外存倒排索引Behm-Index在查询的过滤阶段只支持少数过滤器,不能有效地减少查询I/O代价.提出了LPA-Index:一种支持长度过滤器和位置过滤器的外存倒排索引,并通过选择查询时使用的倒排表来有效地降低查询I/O代价.实验结果表明,与现有性能最好的外存索引Behm-Index相比,LPA-Index能够大幅降低查询的I/O代价,获得了更短的查询响应时间.
String similarity search is fundamental to various applications,such as data cleaning,spell checking,bioinformatics and information integration,since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings.With the rapid growth of data size,big string datasets become ubiquitous,and almost every modern information system stores data presented in string format.String similarity search should be well supported in modern information systems.Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint,and it can no longer work if the data size grows larger than memory size.Existing external memory method,Behm-Index,only supports length-filter and prefix filter.This paper proposes LPA-Index to reduce I/O cost for better query response time.It is a disk resident index which suffers no limitation on data size compared to memory size.LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently,and it selectively reads inverted lists during query processing for better I/O performance.Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time.