实体解析是数据融合和数据清洗的关键步骤,旨在从大量的数据集中找出描述相同实体的记录.当前主要有两种基本的解决思路,一种是穷尽式的实体解析,通过两两比较数据集中的所有记录,然后再合并相似的记录,从而找到描述某一个实体的若干记录集合.然而,该方法的计算复杂度比较高(O(n2),其中n表示数据集合的规模),难以处理大型数据集合.另一种思路是基于分块的实体解析,它调用特定的分块函数(如哈希函数、滑动窗口技术等)将集合中较为相似的记录划分到同一个块中,再仅对属于同一块中的记录进行两两比较.这种方法显著降低了运行时间,但会损失部分精度,因为某些描述同一实体的记录可能没有被分到同一个块中.文中提出了一种基于模式的实体解析算法,通过将相似的记录合并成记录集合并尝试生成对应的记录模式,然后进行模式之间的两两比较来产生一个边界值,以确定对应的记录集合是否需要进行进一步的精确比较,从而判断是否属于同一个实体.与第一种方法相比,该方法可有效地过滤部分不可能相似的记录,从而避免了针对所有数据记录进行两两比较,显著地降低了时间复杂度;与第二种方法相比,该方法并不损失任何精度.基于真实和模拟数据集合的实验结果验证了新方法的执行效率和有效性.
As a critical step in data integration and data cleaning, entity resolution (ER) aims at identifying groups of records that refer to the same real-world entity. Currently, there mainly exist two typical methods to handle this issue. One is exhaustive entity resolution, which compares all record pairs to determine the entity they belong to. However, its complexity (O(n2), n stands for the size of dataset) is too high to handle big volume dataset. The other is blocking-based entity resolution, which maps similar records to the same block by a specific method (e. g. , hash function, sliding window, ete). Then only the records in the same block need to be compared. This method improves the efficiency while sacrifices the effectiveness. Since some records refer to the same entity may not in the same block. In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern, then we will generate a bound by comparing record patterns. With this bound, we can decide if the two patterns' corresponding records need to be precisely compared to verify whether they refer to the same entity. In this way, we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness. Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.