东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种面向Deep Web数据源的重复记录识别模型

期刊名称：电子学报
时间：0
页码：275-281
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]东北大学信息科学与工程学院,辽宁沈阳110004
相关基金：国家自然科学基金（No.60973012,No.60693139）;国家863高技术研究发展计划（No.2008AA01Z146,No.2009AA01Z131）
相关项目：面向数据空间内多模式查询和数据集成的关键技术研究

关键词：重复记录, 深层web, 数据清洗, Duplicate records identification, deep web, data extraction

中文摘要：

重复记录是指描述现实世界中同一实体的不同的记录信息．由于从同一个领域的不同DeepWeb数据源中抽取的记录信息通常存在许多重复记录，本文针对半结构化的重复记录的识别进行研究．在已知全局模式和全局模式与各DeepWeb数据源查询接口映射关系的基础上，提出了一种重复记录识别模型．基于从DeepWeb中抽取出的半结构化的数据，采用查询探测方法确定所抽取数据所匹配的属性，通过分析抽取的实例数据确定属性重要度，结合多种相似度估算器和多种算法计算记录间的相似度，进而识别重复记录．实验表明，该重复记录识别模型在Deepweb环境下是可行且有效的．

英文摘要：

Duplicate records are multiple different records describing the same entity in the real world. Since some of the records extracted from different Deep Web sources in the same domain usually are duplicates, the paper focuses on duplicate records identification and a duplicate records identification model is proposed on the basis of known global schema and the relationship hetween the global schema and the interface attributes of each Deep Web data source.Based on the semi-structured data extracted from Deep Web data sources, the attributes that these data matching to are annotated by using a query probing method and the dominance of atwibutes of global schema is specified by analyzing these extracting instance data. Moreover, multiple estimators and multiple similarity algorithms are adopted to identify the duplicates. The experiment results show our duplicate record identification model is feasible and efficient.

同期刊论文项目