在Deepweb数据挖掘中,不同数据源之间往往会出现数据冲突,如何解决冲突从而获得正确值(这一过程称为数据融合)是数据集成中的一个关键问题.提出一种考虑数据源之间依赖关系的数据融合方法.该方法利用贝叶斯分析确定数据源之间的依赖性,设计出检测依赖性和融合数据的迭代算法;并通过考虑数据源的准确度和属性值之间的相似性等条件扩展模型.使用该方法,对网上爬取的真实数据进行了实验,结果表明它能够显著提高数据融合的准确度,而且在大量数据源存在的情况下具有可扩展性.
In Deep Web data mining, different sources can often provide conflicting data. It is important that data integration systems can resolve conflicts and obtain correct values, which is called data fusion. We propose an algorithm that considers dependence be- tween sources in data fusion. The algorithm uses Bayesian analysis to decide source dependence and iteratively detects dependence and fuses data. Moreover, we extend our model by considering accuracy of data sources and similarity between values. Our experiments on real data show that our algorithm can notably imorove accuracv of data fusion and is scalable when there is a large of data sources.