位置:成果数据库 > 期刊 > 期刊详情页
一种基于同级字段的相似重复记录检测方法
  • ISSN号:1003-6970
  • 期刊名称:《软件》
  • 时间:0
  • 分类:TP393[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]周口师范学院计算机科学与技术学院,河南周口466001
  • 相关基金:国家自然科学基金青年项目(61103143);周口师范学院青年科研基金项目(zknuc0215)
作者: 殷秀叶[1]
中文摘要:

大数据环境下的相似重复记录影响数据统计分析结果的准确性,需要过滤相似重复记录.对相似重复记录检测的研究现状做了介绍,在此基础上提出了属性加权的思想,对属性进行加权,并根据属性权值进行排序分组;在对属性加权时,考虑到一些字段的取值是一一对应的关系,权值相同,提出了同义属性的概念,在原数据集的基础上排除部分同义属性来缩减数据集,提高重复数据检测的效率,最后给出了相似重复记录判定的方法.考虑到大数据集给重复记录检测带来的挑战,将大数据集拆分成若干小数据集,充分利用MapRe-duce机制进行处理,将大数据集按照权重较大的属性取值进行分组,分割成若干个map任务,分别进行处理.实验结果表明,该方法能够有效地提高相似重复记录检测的效率.

英文摘要:

The accuracy of the data statistical analysis is affected by approximately duplicated records in big data environments, so the approximately duplicated records need to be filtered. We introduced the current research of approximately duplicated records and proposed the weighted attribute idea, weigh- ting the attributes and grouping them according to the weights. Considering that some field's relation- ship is one to one, we proposed synonymous property. We excluded some synonymous property on the basis of the original dataset to reduce the dataset and improve the efficiency of detection of approximate- ly duplicated records . Finally synonymous property was proposed. Big datasets were split into a num- ber of small datasets considering the challenge of approximately duplicated records in big dataset. Tak- ing full advantage of MapReduce processing mechanism, big datasets were grouped according to the weight of the larger attribute values, and then divided into a number of map tasks to process. Experi- ment shows that this method can improve detection efficiency of approximately duplicated records effec- tively.

同期刊论文项目
同项目期刊论文
期刊信息
  • 《软件:教学》
  • 主管单位:中国科学技术协会
  • 主办单位:中国电子学会 天津电子学会
  • 主编:胡锦华
  • 地址:北京市3105信箱
  • 邮编:100044
  • 邮箱:rjjxzz@126.com
  • 电话:010-56174511
  • 国际标准刊号:ISSN:1003-6970
  • 国内统一刊号:ISSN:12-9203/TP
  • 邮发代号:
  • 获奖情况:
  • 国内外数据库收录:
  • 波兰哥白尼索引
  • 被引量:305