由于分布式环境下挖掘全局序列模式常常产生过多候选序列,加大了网络通信代价。为此提出一种基于分布式环境下的全局序列模式快速挖掘算法。该算法将各站点得到的局部序列模式压缩到一种语法序列树上,避免了重复的序列前缀传输;基于合并树中节点序列规则和简单的特点,提出一种项扩展和序列扩展剪枝策略,有效地约减了候选序列,减少了网络传输量,从而快速生成全局序列模式。理论和实验表明,在大数据集环境下该算法性能优越,能够有效地挖掘全局序列模式。
There were too many candidate sequences generated from sequential pattern mining algorithms in distributed environment which led to communication overhead.To deal with this problem,a new algorithm,Fast Mining of Global Sequential Pattern(FMGSP) in distributed system was proposed.The core idea of this algorithm was to compress local frequent sequential patterns into the corresponding lexicographic sequence tree so as to avoid transmission of repeated prefixes.Based on the regular and simple sequences of merged trees,a new pruning method named Item Extension and Sequence Extension(I/S-E) pruning was presented to prune candidate sequences effectively.Therefore,communication overhead was significantly reduced and global sequential patterns were generated quickly.Theories and experiments showed that the performance of FMGSP was superior,and it was effective specially in mining global sequential patterns for huge amount of data.