为了从本质上揭示H1N1病毒分子的变异、流感流行等关系,提出一种构建H1N1型流感病毒进化树的新方法。在1902—2013年全球22 455条H1N1型流感病毒HA蛋白序列和16 444条NA蛋白序列数据的基础上,利用其特征向量构建基于内积的蛋白序列相似度;采用基于相似度的完全聚类图的方法进行数据系统粗粒化的相似信息提取;最后,利用基于模糊邻近关系的结构聚类方法构建H1N1型禽流感病毒HA、NA蛋白序列的进化树及算法研究。试验结果表明:H1N1病毒的变异不仅与爆发时间密切相关,还与所分布地域及地域间的距离有很大关系,且分布地域间的距离越近,爆发的病毒进化的相似程度越高。因此这种基于大数据处理的新方法能有效揭示流感病毒的进化关系,为进一步研究流感病毒的变异、进化与预测奠定了基础。
The goal of this paper is to propose a new method for constructing evolutionary tree of H1N1 flu viruses in order to reveal the relationship between the molecular variation of H1N1 and epidemics. First,based on the 22455 HA and 16444 NA protein sequence data of H1N1 flu viruses from 1902 to 2013 years,the similarity index of protein sequences was constructed by using inner product of their eigenvectors. Then,the coarse-graining similar information of data was extracted by applying the complete graph clustering based on the similarity index of protein sequences. Finally,the evolutionary tree for HA and NA protein sequences of H1N1 flu viruses was studied by using the structure clustering method based on fuzzy proximity relations. Test results shown that the mutation of the H1N1 viruses was not only closely related to its outbreak time,but also to the outbreak regions and the geographical distances among the distribution regions,and the closer distance between the geographical distribution and the outbreak of the H1N1 viruses,the higher similarity degree of of the H1N1 viruses. Therefore,the new method based on the large data processing can effectively reveal the evolutionary relationship of H1N1 flu viruses,and can provide a foundation for further study of the mutation,evolution and prediction of flu viruses.