XML上实体抽取问题的任务是要从XML数据中抽取出描述现实世界某个物理实体的数据实体.利用xML查询提供实体的表示方法,基于键规则中有关实体的语义信息,给出了求解XML上实体抽取问题的基于键规则的实体抽取(key-based entity extraction,KEE)方法.KEE方法利用查询松弛技术,自动地生成抽取实体的候选查询集合,基于相似性测度,从候选查询中选取适用于抽取实体的查询集合.作为KEE方法的一个具体实现,SharingEE算法利用标准化的查询松弛技术,减少了候选查询中的冗余,利用基于自动机的查询处理技术,在多个候选查询之间共享中间结果,从而减少计算开销.在真实和模拟数据上运行的实验验证了算法的效率和有效性.实验结果表明,KEE方法可以很好地解决实体抽取问题,并可以扩展到大规模数据上.
This paper proposes a method KEE for evaluating entity extraction problem over XML data, which is an important step for identifying entities in XML data. Directed by the XML Key, utilizing the relaxation and verification techniques, KEE provides a rule-based solution for entity extraction problem, which has following characteristics. Firstly, using XML query language, KEE provides a condensed presentation for the entity whose size may get very large when scaling up the data size. Secondly, requiring only one location example to indicate the interests, using relaxation technique, KEE can discover other similar locations automatically. Thirdly, by adjusting the example given to KEE, users can specify their own interesting entity locations and control the locations discovered by KEE. Besides, utilizing the idea of sharing computations, by extending previous automaton techniques for XML query evaluation, an efficient implementation of KEE is provided. Experimental results on both synthetic and real data show that KEE can provide an effective and efficient solution to the entity extraction problem.