相似重复记录检测是数据清洗领域中的一个重要方面。文中研究了在数据模式与匹配规则不变的前提下,数据集动态增加时近似重复记录的识别问题,针对基于聚类数算法精度不高、效率低下等问题提出一种改进算法。该算法运用等级法给属性赋予相应权重并约减属性,通过构造聚类树对相似记录进行聚类,增设了一个阈值以减少不必要的相似度比较次数,提高了算法的效率和准确率。最后通过实验证明了该算法的有效性,并提出了进一步的研究方向。
Cleaning approximately duplicate records is an important task in data cleaning.Problems of detecting approximately duplicate records when the data set is dynamically increased on the assumption of stable data model and matching rules are studied.An improved method is proposed to deal with problems in the method based on clustering tree.The proposed method appoints proper weight to each field of the record and reduces attributes through using ranked-based weights method;clusters duplicate records by creating a clustering tree.To improve the efficiency of this method,a limen is added into the arithmetic.Finally,the validity of this method is proved by experiment and further research directions are proposed.