相似连接(similarity join)是指在给定的数据集中,根据给定的相似度度量函数来衡量数据之间的相似度,并找出所有相似度不小于给定阈值的数据对的操作。随着网络和移动应用等信息技术的不断发展,数据呈现爆炸式增长,海量数据的分析需要强大的计算能力,相似连接成为大数据处理领域的热点方式之一。传统的单核计算机平台的处理能力已经很难满足海量数据处理的计算要求。为了提高计算效率和性能,利用基于多核平台的多线程并行编程发挥多核体系结构的优势,已经成为实现个人低成本并行计算和多核技术发展的趋势。因此,为了提高相似连接的效率,充分利用现代体系结构的多核特性和多线程技术,提出了相似连接并行化的改进方法。实验结果表明,使用该方法极大地提升了效率。
Similar join is an operation which is using a given similarity function to measure the similarity between data and find out all similarity less than a given threshold in a given data set. With the continuous development of Internet and mobile applications, the amount of data is increasing explosively, and along with the analyzing of huge amount of data,it requires a strong ability of calculation, so similar joins become one of the leading way of hotspots in the field of data processing. The processing capacity of traditional single-core comput- er platform has been difficult to meet the calculation of mass data processing requirements. Programming based on multi-core platform and using the multi-thread parallel can make full use of the advantage of multi-core architecture and improve the computational efficien- cy and computational performance, which has become the trend to realize personal low cost calculation and the development of multi-core technology. Therefore, based on the characteristics of multi-core and multi-thread technology, the improved method of similar connected parallelization is proposed. The experimental results show that the efficiency has been obviously improved.