由于大部分图挖掘算法都需要利用频繁子图,频繁子图挖掘逐渐成为了数据挖掘领域中的热点研究内容。目前,很多高效的频繁子图挖掘算法已经被提出。其中,gSpan算法是目前公认的最好的频繁子图挖掘算法。然而,在化合物数据集上,还可以利用化合物的特殊结构进一步优化gSpan算法的性能。文献利用了化合物分子结构的对称性和原子类型分布的不均衡性,提出了一些新的优化策略,进一步改进了gSpan的性能。鉴于gSpan算法在图挖掘领域乃至整个数据挖掘领域的重要性,设计并实现gSpan算法。同时,采用文献[4]中的优化策略,进一步提高gSpan算法在化合物数据集上的运行效率。
Since most of the graph mining algorithms are needed to make frequent subgraph,frequent subgraph mining is gradually becoming the hot spot in the field of research.At present,many efficient frequent subgraph mining algorithms have been proposed.Among them,gSpan algorithm is currently accepted as the best frequent subgraph mining algorithm.However,in the compound datasets,the performance of gSpan algorithm based on the special structure could be further optimized.The paper uses the symetry of the molecular structure of compounds and the unequilibrium of the distribution of atomic types,and puts forward some new optimization strategy,so as to further improve the performance of gSpan algorithm.Because gSpan algorithm is very vital in graph mining areas and the entire data mining field,this paper designes and implementes gSpan algorithm.Meanwhile,the paper also prepares to adopt the optimization strategy in the literature[4],further improves the gSpan algorithm operation efficiency in compound datasets.