实体匹配也叫记录匹配,是数据集成与数据清洗过程中的一项关键技术.其典型用例包括不同网站之间的商品匹配以及DBLP(Digital Bibliorgrophy&Library Project)与Scholar文献数据库之间的文献实体匹配.真实数据中广泛存在的数据质量缺陷,如错误值、缺失值和数据表达形式多样性等数据质量问题,使得实体匹配问题很具挑战性.目前流行的实体匹配算法可划分为三大类:基于规则的、基于概率的和基于学习的.电商数据中,对同一商品的描述可能差异巨大.对于这类充满表达多样性的实体匹配问题,通常并不存在简洁高效的匹配规则,训练精准的分类模型也很困难.针对这个问题,文中提出了一种基于离群点检测(Outlier Detection)的自动实体匹配方法,记为ODetec算法.首先计算记录序偶在匹配属性上的相似度,并将序偶映射为特征空间上的点;接着在特征空间中估算每个序偶的离群距离;最后根据离群距离和匹配约束,抽取匹配序偶.另外,ODetec算法采用主成分分析方法将多个存在相关性的匹配特征变换为彼此正交的主成分,突破了Fellegi-Sunter模型中属性之间须满足条件独立假设的限制,具备了更好的匹配效果和更为广泛的适用性.实验结论证实了ODetec方法的有效性.
Entity Matching, also known as Record Matching, is a key technique in data integration and cleaning process. Its typical applications include the commercial products matching across different websites and the research paper records matching between the DBLP (Digital Bibliorgrophy Library Project) and Scholar digital libraries. The widespread data quality defects in real data, e. g. , tuple errors, missing values and representation diversities, make the entity matching problem much challenging. The popular entity matching algorithms can be categorized into rule-based, probabilistic and learning-based approaches. In e-commercial data, the descriptions of the same products may vary greatly. For the entity matching task on those datasets with representation diversity problems, it is difficult to design effective matching rules and remains challenging to train classification models. To address this issue, this paper proposes an Outlier-Detection-based approach, denoted by ODetec, for automatic entity matching. Firstly, the ODetec measures the similarities on the matching attributes for each record pair, and map the pairs into points in feature space. Then it calculates the outlier distances for each record pair in the feature space. Finally, it ranks the pairs by their outlier distances and extracts those matching candidates that meet the matching constraints. In addition, ODetec can transform multiple co-related matching features into orthogonal principal components by Principal Component Analysis, breaking through the limitation of conditional independence between attributes that is required by Fellegi-Sunter model. Thus it reaches better effect and broader applicability. Our extensive experiments on real datasets have verifiedthe effectiveness of the ODetee approach.