基于粒度空间理论,进行了基于归一化距离的最小生成树分类算法研究.首先根据类内偏差和类间偏差的性质,在已有的粒度空间生成算法的基础上,引入最小生成树以及新的最优聚类指标,给出了基于归一化距离的最小生成树分类算法,并建立了最优聚类模型.其次,将模型应用于研究从NCBI上下载的1902-2015年间的898条现在已经确认能够感染人的禽流感病毒蛋白质序列HA与NA蛋白,共有8种,包括H5N1,H5N2,H7N2,H7N3,H7N7,H9N2,H10N7,以及最近的H7N9.在距离中心最近的基础上,通过运行最小生成树分类算法,6个代表病毒序列被选出,并且得到了最优层次结构.最后,对实验结果进行分析,结果表明病毒爆发地域差异、病毒爆发时间等因素对禽流感病毒的变异产生了重要影响,这些结果与已有的研究结果一致,说明本文提出的最小生成树分类算法是有效的.在寻找基于粒度空间的最佳聚类问题上,最小生成树分类算法比原有的算法具有更低的复杂度.这些结论为基于大数据的信息处理提供了一种全新的处理方法.
According to the granular space theory,minimum spanning tree classification algorithm is proposed based on normalized metric.Firstly,based on the existing representation and generation algorithm of granular space,by introducing the minimum spanning tree and the new optimization clustering index based on the intra-class deviation and inter-class deviation,an optimal model was established.Furthermore,the 8 subgroups(H5N1,H5N2,H7N2,H7N3,H7N7,H9N2,H10N7 and H7N9)of 898 avian influenza viruses containing both HA and NA protein were used as an experimental database.These avian influenza viruses occurred from 1902 to 2015 around the world and could infect people.Based on the characteristics of avian influenza virus data sets,the 898 avian influenza viruses were divided into two classes by running the algorithm first time.Each class contains varying amounts of the close rela-tionship between viral sequences,respectively,842 and 56.Considering the complexity of the evolutionary tree structure,a signature virus representative is selected for each class of optimal clustering for more effective research and discussion of new methods.In order to further study the nature of avian influenza virus,the two types of influenza viruses were analyzed separately by the algorithm again.Based on the nearest principle,6 representative viruses were selected and a phylogenetic tree was constructed.Finally,comparing the results with those in the literature,we found that the variation of human influenza virus is closely related to the region and the outbreak time.These results are consistent with the results of previous studies,indicating that the algorithm is effective.The minimum spanning tree classification algorithm has lower complexity than the original algorithm in finding the optimization clustering.These conclusions provide a new approach to information processing based on large data.