目前,针对嵌套式数据集上的高效查询处理已成为Web数据检索的一个重要任务.不同于传统信息检索,嵌套式数据集既要存储数据又要存储结构,导致了针对该类数据集查询的低效性,特别是对如何保证精确查询效率更是一个挑战.结合列存储结构和倒排索引技术,首先定义了表达嵌套式数据集中数据位置信息的唯一路径UPath,提出一种新的支持嵌套式数据集精确查询的索引结构——Uni Hash.在此基础上,给出了生成数据值的唯一路径UPath以及基于MapReduce框架建立Uni Hash索引的相关算法.通过将其与XPath检索进行对比,验证了Uni Hash索引结构的有效性.实验表明,将嵌套式数据集进行列式存储并建立Uni Hash索引,能够明显地提高精确查询的效率.
At present,querying nested data has already become one of the important tasks for Web data retrieval. Unlike the traditional information retrieval, to effectively manage nested data, we need not only to store the data but also its structures, which leads to the low efficiency of retrieving. Especially it brings a challenge for ensuring the efficiency of precise query. Combining the technique of col- umn-strip storage and that of inverted index, this paper defines UPath to express the data objects' unique location in nested records and presents a new index structure which supports precise query on nested datasets - UniHash. In addition, this work describes the related algorithms for building UPath and that for establishing UniHash in MapReduce. Compared with XPath-based retrieval, UniHash sup- ported queries have better efficiency. Experiment results show that columnar storage of nested data and indexing it with UniHash can significantly improve the performance of precise queries.