大数据环境下的相似重复记录影响数据统计分析结果的准确性,需要过滤相似重复记录.对相似重复记录检测的研究现状做了介绍,在此基础上提出了属性加权的思想,对属性进行加权,并根据属性权值进行排序分组;在对属性加权时,考虑到一些字段的取值是一一对应的关系,权值相同,提出了同义属性的概念,在原数据集的基础上排除部分同义属性来缩减数据集,提高重复数据检测的效率,最后给出了相似重复记录判定的方法.考虑到大数据集给重复记录检测带来的挑战,将大数据集拆分成若干小数据集,充分利用MapRe-duce机制进行处理,将大数据集按照权重较大的属性取值进行分组,分割成若干个map任务,分别进行处理.实验结果表明,该方法能够有效地提高相似重复记录检测的效率.
The accuracy of the data statistical analysis is affected by approximately duplicated records in big data environments, so the approximately duplicated records need to be filtered. We introduced the current research of approximately duplicated records and proposed the weighted attribute idea, weigh- ting the attributes and grouping them according to the weights. Considering that some field's relation- ship is one to one, we proposed synonymous property. We excluded some synonymous property on the basis of the original dataset to reduce the dataset and improve the efficiency of detection of approximate- ly duplicated records . Finally synonymous property was proposed. Big datasets were split into a num- ber of small datasets considering the challenge of approximately duplicated records in big dataset. Tak- ing full advantage of MapReduce processing mechanism, big datasets were grouped according to the weight of the larger attribute values, and then divided into a number of map tasks to process. Experi- ment shows that this method can improve detection efficiency of approximately duplicated records effec- tively.