对于高维空间的近邻查找问题,位置敏感哈希(LSH)在查询代价和磁盘空间利用上有着出色表现。在传统分析模型下,LSH被视作随机算法,唯一不确定因素就是哈希函数的选择。研究中将这种模型下得到的碰撞概率称为基于哈希函数的碰撞概率。在本文中,使用了不同的分析模型对LSH作了理论分析。此工作的出发点有2个:1)在现有的分析模型下,用户为了达到理论的效果,必须对每个查询点产生随机的数据结构,这在实际应用中是不现实的。2)用户所关心的性能指标是随机查询点在一个数据结构上的期望碰撞概率。基于此,本篇论文即推导了在汉明距离下,随机点对在任意单个哈希函数上的碰撞概率。研究将此模型下推导出的碰撞概率称为基于随机查询的碰撞概率。同时也一并证明了在汉明空间中,2种碰撞概率完全相同。
Locality Sensitive Hashing ( LSH ) owns nice asymptotic performance bounds on query cost and space consumption for similarity search problem in high-dimensional spaces. In traditional analysis model, LSH is regarded as a randomized algorithm, where the only source of uncertainty is the random choice of hash functions. The research calls the probability of collision obtained under this model the hash-function?based collision probability. The paper conducts the theoretical analysis of LSH using a different model. The motivations are that 1) in the existing analysis model , for the purpose of achieving the ideal performance ,one has to generate a random data structure for each query, which is obviously unaffordable in practice;2) the performance metric that practitioners are interested in is the expected success probability of a random query over a single randomly generated data structure. To this end, the paper analytically derives the probability of collision that random pairs of data points collide over any single hash function for hamming distance. So the research calls the probability of collision derived following this model the random-input?based collision probability. Also, the paper proves that these two kinds of collision probabilities are exactly equivalent.