在NCBI数据库中获得1902—2013年关于流感病毒10种组成蛋白的所有氨基酸序列,在MATLAB中采用大数据编程分析,结合详细的HP模型,并基于CGR-WALK模型的方法将全部流感病毒蛋白质序列转化为数据形式,引入时间序列ARFIMA(p,d,q)模型来拟合所有序列,分析10种组成蛋白的序列在近80年的变化趋势,并对其未来10年的发展趋势进行预测.通过分析可以发现,其对流感病毒变异趋势的预测有很好的效果,这为基于大数据分析流感病毒蛋白质序列,预测流感病毒的爆发提供一定的的研究参考价值.
Ten protein amino acid sequences of influenza virus were obtained from the National Center for Biotechnology Information (NCBI) from 1902 to 2013,which was analyzed using big data in MATLAB programming with the detailed HP model. Meanwhile, the protein sequences were converted to the data series based on the CGR - WALK model. The time series ARFIMA (p ,d,q ) was introduced to fit all the sequences. The analysis results indicated a good model with accurate prediction for the variation tendency in the next 10 years,which also provided a reference for the prediction of influenza virus using the big data analysis.