半结构化数据的模式抽取对于半结构化数据查询、优化及异构数据的集成具有重要的意义。结合标签路径及标签路径的目标集概念,提出了基于OEM(Object Exchange Model)模型的半结构化数据最小化模式抽取新方法,并给出了与标签路径目标集、支持度计算相关的两个定理。算法的基本思路:依据文中的两个定理,采用宽度优先自顶向下的遍历策略依次求出各标签路径的最后一个标签的目标集及支持度,标签支持度大的目标集优先映射为对应的模式节点。对同一半结构数据实例,算法抽取的模式与其他算法得到的模式相比规模小、算法执行时间短。算法适用于层次型及包含环路的OEM半结构化数据模式抽取。
Schema extraction of Semi-structured data is important for semi-structured data query and optimization as well as integration of heterogeneous data. By combining two conceptions of label path and target set of label path, this paper presents a new algorithm of minimized schema extraction for semi-structured data based on the OEM model, and gives two theorems related to computing the target set and supporting degree of label path. The basic idea of the algorithm is: with the help of two theorems, using width-first and top-down ergodic strategy, the target set and supporting degree of the last label in each label path are computed in turn,the target sets with bigger supporting degree are mapped in priority into corresponding schema nodes. For same semi-structured data instance, the scale of the schema extracted by the algorithm stated in this paper is smaller than the schema extracted by other algorithms, and the time of executing the algorithm is shorter as well. The algorithm is suitable for schema extraction of hierarchical OEM semi-structured data and the semi-structured data with loop.