东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于指令级并行的倒排索引压缩算法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP301.6[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：[1]北京大学网络与信息系统研究所,北京100871, [2]淘宝(中国)软件有限公司,杭州312000, [3]北京理工大学,北京100081
相关基金：国家“九七三”重点基础研究发展计划基金项目（2014CB340400）；国家自然科学基金项目（61272340）；江苏未来网络创新研究院项目-云服务数字资源搜索（BY2013095-4-02）

作者：闫宏飞[1], 张旭东[1], 单栋栋[2], 毛先领[3], 赵鑫[1]

关键词：单指令多数据流, 倒排索引, 压缩, 整数编码, 信息检索, single instruction multiple data（SIMD）, inverted index, compression, integer encoding, information retrieval

中文摘要：

文本信息数量的快速增长给传统的信息检索技术带来了新的挑战.搜索引擎通常使用倒排索引来高效地处理查询.为了减少存储开销和加快访问速度，倒排索引通常被压缩存储.因此，如何选择一个高性能的压缩算法对高效查询处理是非常有必要的.在已有倒排链压缩算法PackedBinary和PForDelta的基础上，利用CPU的超标量特性和SIMD向量指令集，将其压缩和解压缩中的关键步骤并行化，提出了2种指令级并行压缩算法SIMD-PB和SIMD-PFD.基于GOV2和ClueWeb09B两个公开数据集的实验表明，SIMD-PB和SIMD-PFD算法在压缩率不变的情况下，压缩和解压缩速度比现有的压缩算法均有非常明显的提升.其中解压缩速度比起目前最好的倒排链压缩算法，最高能提升17%.此外，实验表明算法在较长的倒排链、较大的压缩块单位上有更好的解压缩性能.

英文摘要：

The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

同期刊论文项目