在Web数据集成中,常出现多个数据源对同一实体对象的描述存在冲突.解决冲突,发现真值有助于提高数据集成质量或构建高质量的知识库等.已有的解决单真值数据冲突的方法存在数据源评价指标不充分,无法区分数据源的数据缺失和假真,以及无法处理数据源间传递复制、共同复制等高阶复制的局限性.因此,本文采用召回率和假真率度量数据源质量,提出能处理数据源间复杂数据复制的真值发现算法.三个真实数据集和人工数据集上的实验结果表明,本文算法能有效降低错误数据复制带来的真值计算偏差,提高真值发现的准确率.
In many web integration applications, some sources usually depict the same entity object with different descriptions, which leads to data conflicts. Resolving conflicts and finding truth can be used to improve the quality of integration or to build a high-quality knowledge base, etc. In the single-truth data conflicting scenario, existing methods have limitations to distinguish false negative and false positive. Their source quality measurements are inadequate. Moreover, existing methods can't capture the high order copying relationships among data sources such as transitive copying and co-copying. Therefore, in this paper, we use recall and false positive rate to measure source quality. Meanwhile, we propose a method to capture the complex data copying among sources in truth discovery. The experimental results on three real-word data sets and synthetic data sets show that the proposed algorithms can effectively reduce the truth computation bias caused by error data copying and improve the precision of truth discovery.