由于XML文档越来越广泛地被用于信息交换与集成,其数据质量问题引起了人们的关注.解决由数据质量引发的问题,实体识别技术非常关键.当实体识别被应用于XML数据中时,最为关键的操作是实体数据对象的匹配.为了克服现有方法的不足,在海量XML数据上进行高效的重复对象检测,文中提出一种基于实体描述属性技术的高效XML重复数据对象检测方法.它将所有标签属性与结点统称为属性,用实体来描述属性,通过属性的属性结点表的构建,快速地找到在某个属性上相同的所有实体对象,然后比较它们是否重复.此方法的优势体现在无需比较所有实体对象,只需要比较在属性结点表中同一位置的结点,大大节省了时间.此外,我们提出的Max-Merge算法,在兼顾相似对象传递性与独立性的基础之上,将所有相似对象进行聚类,大大提高了算法的精确率与召回率.
As being more and more widely used for data exchange and integration,the XML data quality issues cause for concern.In order to overcome the problems caused by data quality,Entity Resolution(ER) is critical.When ER is applied to XML data,the crucial operator is Object Matching(OM).To overcome the drawbacks of current methods and perform entity resolution efficiently and effectively on massive XML data set,an entity-describe-attribute(EDA) rule based object matching method is presented in this paper.Our EDA method uses entities to describe their attributes.By the construction of attribute-node table,we can compare the objects which have one or several common attributes.Then the MaxMerge algorithm is proposed.It clusters the duplication efficiently and effectively.