东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于模式的实体解析算法

ISSN号：0254-4164
期刊名称：《计算机学报》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]华东师范大学软件学院数据科学与工程研究院,上海200062
相关基金：国家“九七三”重点基础研究发展规划项目基金（2012CB316203）; 国家自然科学基金（61370101,61321064）; 上海市教委科研创新重点项目（14ZZ045）资助

关键词：数据融合, 数据清洗, 实体解析, 编辑距离, 字符串相似度, data integration, data cleaning, entity resolution, edit distance, string similarity

中文摘要：

实体解析是数据融合和数据清洗的关键步骤,旨在从大量的数据集中找出描述相同实体的记录.当前主要有两种基本的解决思路,一种是穷尽式的实体解析,通过两两比较数据集中的所有记录,然后再合并相似的记录,从而找到描述某一个实体的若干记录集合.然而,该方法的计算复杂度比较高（O（n2）,其中n表示数据集合的规模）,难以处理大型数据集合.另一种思路是基于分块的实体解析,它调用特定的分块函数（如哈希函数、滑动窗口技术等）将集合中较为相似的记录划分到同一个块中,再仅对属于同一块中的记录进行两两比较.这种方法显著降低了运行时间,但会损失部分精度,因为某些描述同一实体的记录可能没有被分到同一个块中.文中提出了一种基于模式的实体解析算法,通过将相似的记录合并成记录集合并尝试生成对应的记录模式,然后进行模式之间的两两比较来产生一个边界值,以确定对应的记录集合是否需要进行进一步的精确比较,从而判断是否属于同一个实体.与第一种方法相比,该方法可有效地过滤部分不可能相似的记录,从而避免了针对所有数据记录进行两两比较,显著地降低了时间复杂度;与第二种方法相比,该方法并不损失任何精度.基于真实和模拟数据集合的实验结果验证了新方法的执行效率和有效性.

英文摘要：

As a critical step in data integration and data cleaning, entity resolution （ER） aims at identifying groups of records that refer to the same real-world entity. Currently, there mainly exist two typical methods to handle this issue. One is exhaustive entity resolution, which compares all record pairs to determine the entity they belong to. However, its complexity （O（n2）, n stands for the size of dataset） is too high to handle big volume dataset. The other is blocking-based entity resolution, which maps similar records to the same block by a specific method （e. g. , hash function, sliding window, ete）. Then only the records in the same block need to be compared. This method improves the efficiency while sacrifices the effectiveness. Since some records refer to the same entity may not in the same block. In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern, then we will generate a bound by comparing record patterns. With this bound, we can decide if the two patterns＇ corresponding records need to be precisely compared to verify whether they refer to the same entity. In this way, we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness. Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.

同期刊论文项目

网络化信息物理计算基础研究

期刊论文 10

数据质量管理中的完整性约束关键技术研究

期刊论文 10

同项目期刊论文

ProMiner：系统性质驱动的双向一致性检验框架

数据管理系统评测基准：从传统数据库到新兴大数据

Java应用系统的复杂网络分析

Reciprocal Transformations of Two Camassa–Holm Type Equations

Dual Hierarchies of a Multi-Component Camassa–Holm System

A UTP semantic model for Orc language with execution status and fault handling

Symmetry Analysis and Conservation Laws to the(2+1)-Dimensional Coupled Nonlinear Extension of the Reaction-Diffusion Equation

Rogue-wave pair and dark-bright-rogue wave solutions of the coupled Hirota equations

Resultant Elimination via Implicit Equation Interpolation

数据管理系统评测基准：从传统数据库到新兴大数据

基于手机大数据的城市人口流动分析系统

如何客观评测内存数据库的性能

基于手机轨迹数据的人口流动分析

轨迹大数据异常检测：研究进展及系统框架

基于函数依赖与条件约束的数据修复方法

面向海量低质手机轨迹数据的重要位置发现

面向不确定数据流的近似ER-Topk查询处理

MapReduce-based entity matching with multiple blocking functions

期刊信息

《计算机学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国计算机学会中国科学院计算技术研究所
主编：孙凝晖
地址：北京中关村科学院南路6号
邮编：100190
邮箱：cjc@ict.ac.cn
电话：010-62620695

国际标准刊号：ISSN：0254-4164
国内统一刊号：ISSN：11-1826/TP
邮发代号:2-833

获奖情况:
中国期刊方阵“双效”期刊

国内外数据库收录:
美国数学评论（网络版）,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:48433