随着XML数据流的广泛应用,从挖掘XML数据流中发现知识具有重要的理论与应用价值.相比其他频繁模式挖掘,大型XML文档与数据流的频繁子树挖掘面I临困难:XML数据流不可能整体在内存解析;对XML数据流分段挖掘必须考虑XML数据的半结构化特征等.针对上述问题,提出数据流分页频繁子树挖掘模型Tmlist.Tmlist对XML数据流进行分页,管理跨页节点及频繁候选子树的跨页增长,逐页挖掘频繁子树;频繁候选子树的增长根据根节点层次由浅至深地在最右路径加入频繁候选节点,避免以低层次为根子树的重复性递归增长;对频繁候选子树采用子树拓扑序列和最右路径共同标识,子树的增长不需要对子树前缀进行匹配,省去前缀节点存储与匹配开销;以页面最小支持度对频繁候选子树按页筛选,子树按页面衰减度衰减支持度、剪枝.Tmlist在可控误差范围内降低频繁子树挖掘的空间消耗,提高内存利用率和挖掘效率.
With the widespread use of XML data stream, discovering knowledge from it becomes important. Compared with other frequent pattern mining, mining frequent subtree over large-scale XML documents and unlimited growing XML data stream is facing difficulties, data steam can not be resolved in memory as a whole, and mining partitioned XML data stream must be considered semi- structured characteristics of XML data, etc. Inspired by this fact, Tmlist is proposed for mining frequent subtrees over paging XML data stream. Tmlist pages XML data stream, manages cross-page nodes and frequent candidate subtrees growing across page, and mines frequent subtrees page-by- page. Frequent candidate subtrees grow by inserting frequent candidate nodes in their rightmost path according to the level of their roots, avoiding the repeated recursive growth of the subtrees rooted by the low-level nodes. A subtree is represented by the topologic sequence of its rightmost path, which avoids the prefix match for the increment of subtrees, so the storing and matching cost for the prefix nodes is cut. Frequent candidate subtrees are selected according to the page minimum support, the support of frequent subtrees is decayed and branches are pruned according to the decaying factor. Accordingly, Tmlist reduces the memory cost of mining frequent subtrees in the limit of error and improves memory utilization and mining efficiency.