为对Web大数据环境下的相似重复冗余数据进行清理,降低数据存储与管理的时间和成本,提出Web大数据相似重复数据清理方法。对Web数据进行预处理,提出相似哈希的实现算法计算各数据信息的相似度,对于满足特定阈值的相似数据信息,保留其中一个及其副本,其余数据信息保存该数据信息的地址。使用该方法在Hadoop平台上对多个网站的Web数据进行实验,实验结果表明,该方法具有良好的精确性及数据缩减效果。
To clean the approximately duplicate data based on Web big data,reduce the time and cost of data storage and management.A cleaning method for approximately duplicate cross-source data based on Web big data,called ADDCWBD,was proposed.Web data were preprocessed.The realization algorithm of SimHash was presented to compute the similarity metrics of data item.For data which satisfied the threshold,one of them and its copy were saved,the others saved its address.The method was applied to multiple portal sites on Hadoop platform.The results of experiments verified the accuracy and showed good data reduction rate.