Web已成为一个浩瀚的信息海洋,其信息分散在不同的数据源中.不同数据源常常为同一对象实体提供冲突的属性值.如何从这些冲突属性值中找到真值被称为真值发现问题.根据属性值数量可将对象属性分为单值属性和多值属性,现有的多数真值发现算法对单值属性的真值发现比较有效.针对多值属性的真值发现问题,提出了一个多真值发现方法MTruths,该方法将多真值发现问题转化为一个最优化问题,其目标是:各对象的真值与各数据源提供的观察值之间的相似性加权和达到最大.对象真值求解过程中,提出2种方法求真值列表的最优解:基于枚举的方法和贪心算法.与已有方法不同的是MTruths可以直接得到对象的多个真值.最后,通过图书和电影2个真实数据集上的实验表明,MTruths的2种实现方法的准确性以及贪心算法的效率优于现有真值发现方法.
W e b has bee n a massive information repository o n w h i c h information is scattered indifferent data sources.It is c o m m o n that different data sources provide conflicting information for thes a m e entity.It is called the truth finding p r o b l e m that h o w to find the truths f r o m conflictinginformation.A c c o r d i n g to the n u m b e r of attribute values,object attributes can be divided into t w ocategories:single-valued attributes a n d multiple-valued attributes.M o s t of existing truth findingw o r k is designed for truth finding o n single-valued attributes.In this paper,a m e t h o d called M T r u t h sis proposed to resolve truth finding p r o b l e m for multiple-valued attributes.W e m o d e l the p r o b l e musing an optimization problem.T h e objective is to m a x i m i z e the total weight similarity b e t w e e n thetruths a n d observations provided b y data sources.In truth finding process,t w o m e t h o d s are proposedto find the optimal solution:an e n u meration algorithm a n d a greedy algorithm.E x p e r i m e n t s o n t w oreal data sets s h o w that the correctness of our approache a n d the efficiency of the greedy algorithmo utperform the existing state-of-the-art techniques.