计算相似性是信息检索的一个核心基础问题,二者、三者甚至更多集合的相似性估计在相似文档检测、词语相关性、聚类、数据清理等领域有着广泛的应用.连接位Minwise Hash算法作为一种高效、准确的相似性估计算法,能够成倍地减少比对的次数,提升算法性能.通过理论推导,给出基于连接位Minwise Hash的三者相似度无偏估计公式.实验结果显示,在样本大小k=500、相似度阈值R0=0.8时,算法的准确率和召回率均能达到95%以上,并且所需的CPU运行时间仅为b位Minwise Hash三者估计算法的50%.
Compution of two-way and multi way set similarities is a fundamental problem in information re trieval. This paper focused on estimation of three-way resemblance using connected bit Minwise Hash. As an efficient and accurate method for similarity measurement, connected bit Minwise Hash can reduce the number of comparison, and exponentially improve the performance. The unbiased estimator of the three- way resemblance was provided theoretically. In experimental result analysis, several key parameters (e. g., precision, recall and efficiency) were analyzed. Experimental results demonstrate that when the sample size k= 500 and similarity threshold R0 =0.8, the accuracy and recall of the algorithm could reach 95% or more, using just 50% of CPU running time of b-bit Minwise Hash for the three-way estimation.