东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

面向近似近邻查询的分布式哈希学习方法

ISSN号：0254-4164
期刊名称：《计算机学报》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]清华大学软件学院,北京100084, [2]清华大学信息科学与技术国家实验室(筹),北京100084
相关基金：清华大学信息科学与技术国家实验室大数据科学与技术专项基金; 国家自然科学基金（61325008,61502265）; 中国博士后基金特别资助项目（2015T80088）资助

作者：文庆福[1], 王建民[1,2], 朱晗[1], 曹越[1], 龙明盛[1,2]

关键词：近似近邻查询, 哈希学习, 高维索引, 分布式计算, SPARK, approximate nearest neighbor search, learning to hash, high-dimensional indexing, distributed computing, Spark

中文摘要：

近似近邻查询是信息检索领域中的一项重要技术.随着文本、图像、视频等非结构化数据规模的迅速增长,如何对海量高维数据进行快速、准确的查询是处理大规模数据所必须面对的问题.哈希作为近似近邻查询的关键方法之一,能够在保持数据相似性的条件下对高维数据进行大比例压缩.以往所提出的哈希方法往往都是应对集中式存储的数据,因而难以处理分布式存储的数据.该文提出了一种基于乘积量化的分布式哈希学习方法SparkPQ,并在Spark分布式计算框架下实现算法.在传统的乘积量化方法的基础上,该文首先给出了分布式乘积量化模型的形式化定义.然后,作者设计了一种按行列划分的分布式矩阵,采用分布式K-Means算法实现模型求解和码本训练,利用训练出的码本模型对分布式数据进行编码和索引.最终,该文构建了一套完整的近似近邻查询系统,不仅可以大幅降低存储和计算开销,而且在保证高检索准确率的条件下加速查询效率.在较大规模的图像检索数据集上进行的实验验证了方法的正确性和可扩展性.

英文摘要：

Approximate nearest neighbor（ANN）search is an important technique in Information Retrieval.With rapid growth of volumes of unstructured data like texts,images,and videos,how to perform efficient and accurate search from large-scale data becomes an inevitable problem.As a key approach to approximate nearest neighbor search,hashing can perform similarity-preserving compression for high-dimensional data.Previous hashing methods are usually applied to centralized data,hence they cannot process distributed data.In this paper,SparkPQ,a novel distributed learning to hash method based on Product Quantization（PQ）is proposed,which is implemented in the Spark distributed computing framework.Based on the seminal Product Quantization（PQ）method,we first give a formal definition of distributed Product Quantization model.Then,we design a distributed matrix partitioned by rows and columns and apply distributed K-Means algorithm to solve the SparkPQ model and train a codebook.We encode and index the distributed data from database with the codebook model.Finally,we build an integrated ANN search system which not only reduces the storage and computation cost substantially,but also speeds up thesearch efficiency with guaranteed search accuracy.A comprehensive empirical study on largescale image retrieval datasets validates the effectiveness and scalability of the proposed distributed learning to hash method.

同期刊论文项目

大规模过程数据管理与挖掘

期刊论文 7

面向大数据的安全迁移学习方法

期刊论文 1

同项目期刊论文

支持时序数据聚合函数的索引

领域大数据应用开发与运行平台技术研究

基于触发序列集合的过程模型行为相似性算法

基于对齐的BPMN 2.0模型符合性检测算法

NBAJ:一种基于网络流的工作流资源分配合理性判定方法

海量流程实例的存储、索引与检索

期刊信息

《计算机学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国计算机学会中国科学院计算技术研究所
主编：孙凝晖
地址：北京中关村科学院南路6号
邮编：100190
邮箱：cjc@ict.ac.cn
电话：010-62620695

国际标准刊号：ISSN：0254-4164
国内统一刊号：ISSN：11-1826/TP
邮发代号:2-833

获奖情况:
中国期刊方阵“双效”期刊

国内外数据库收录:
美国数学评论（网络版）,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:48433