为了实现自动化的数据质量评估,提出了一种在背景范围内的数据质量量化方法QDC(Quantify Dimensionswithin Context)。数据质量可以用数据和其对应实体的“完美表达”间的差距来衡量。由于“完美表达”很难获得或代价很高,因此提出在多数据源条件下,数据的“完美表达”可以在其背景范围内用投票获得的“最近似”来替代,从而确定了数据质量评估参照的标准。同时提出利用信息论中信息熵指标,将不同类型数据的质量维度统一为通用的度量。作为一种自动化的数据质量评估方法,QDC方法不仅能够对数据的准确性和完整性维度给出准确的评估值,并且具有很高的计算效率。
To automatically quantify data quality dimensions in multiple-source environment,it proposes a novel approach to auto- matically Quantify Dimensions within Context (QDC).Data quality can be gauged by discrepancy between data view and its entity's perfect representation.Since it is difficult to obtain the perfect representation of entity,it proposes to approximate the perfect representation within its available context and quality dimensions can be quantified in this context scope.By naturally borrowing entropy concepts from information theory,the measurement is easily given for different types of data.In this way,the two most important quality dimensions,that are accuracy and completeness,are properly quantified.This QDC approach can not only give an objective score and ranking in a cooperative multi-source environment but also avoid human's laborious interaction.As an automatic quality rating solution this approach is distinguished,especially for large scale datasets.Theory and experiment shows the approach performs well for quality rating.