东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

MTruths:Web信息多真值发现方法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国人民大学信息学院,北京100872, [2]首都师范大学教育技术系,北京100048, [3]北京服装学院信息工程学院,北京100029
相关基金：国家自然科学基金项目(61379050,91224008,61502279 ); 国家“八六三”高技术研究发展计划基金项目(2013AA013204);高等学校博士学科点专项科研基金项目(20130004130001); 中国人民大学科学研究基金项目(11XNL010)

关键词：真值发现, 数据冲突, 单值属性, 多值属性, 数据源质量, truth finding, data conflicting, single-valued attributes, multi-valued attributes, quality of data sources

中文摘要：

Web已成为一个浩瀚的信息海洋，其信息分散在不同的数据源中.不同数据源常常为同一对象实体提供冲突的属性值.如何从这些冲突属性值中找到真值被称为真值发现问题.根据属性值数量可将对象属性分为单值属性和多值属性，现有的多数真值发现算法对单值属性的真值发现比较有效.针对多值属性的真值发现问题，提出了一个多真值发现方法MTruths，该方法将多真值发现问题转化为一个最优化问题，其目标是：各对象的真值与各数据源提供的观察值之间的相似性加权和达到最大.对象真值求解过程中，提出2种方法求真值列表的最优解：基于枚举的方法和贪心算法.与已有方法不同的是MTruths可以直接得到对象的多个真值.最后，通过图书和电影2个真实数据集上的实验表明，MTruths的2种实现方法的准确性以及贪心算法的效率优于现有真值发现方法.

英文摘要：

W e b has bee n a massive information repository o n w h i c h information is scattered indifferent data sources.It is c o m m o n that different data sources provide conflicting information for thes a m e entity.It is called the truth finding p r o b l e m that h o w to find the truths f r o m conflictinginformation.A c c o r d i n g to the n u m b e r of attribute values,object attributes can be divided into t w ocategories:single-valued attributes a n d multiple-valued attributes.M o s t of existing truth findingw o r k is designed for truth finding o n single-valued attributes.In this paper,a m e t h o d called M T r u t h sis proposed to resolve truth finding p r o b l e m for multiple-valued attributes.W e m o d e l the p r o b l e musing an optimization problem.T h e objective is to m a x i m i z e the total weight similarity b e t w e e n thetruths a n d observations provided b y data sources.In truth finding process,t w o m e t h o d s are proposedto find the optimal solution:an e n u meration algorithm a n d a greedy algorithm.E x p e r i m e n t s o n t w oreal data sets s h o w that the correctness of our approache a n d the efficiency of the greedy algorithmo utperform the existing state-of-the-art techniques.

同期刊论文项目