气象卫星和雷达资料的数据文件往往达到几十兆甚至上千兆字节,根据扩展名对文件进行分类仅是一种约定俗成,不具备基于数据特征的属性,因而在一定程度上缺乏可靠性。通过统计分析可得到典型气象数据的一些编码特性,但若对全文件进行值谱分析,效率低,因此需要研究快速准确的大文件分类识别方法。在研究已有文件分类方法的基础上,分析研究典型气象数据的字节值频率分布统计特征,作为分类的特征参数;采用自相似理论,自适应确定文件的截取长度和截取原点,提出了最小特征文件块指纹模型,设计了基于自相似的大数据文件快速识别算法。实验表明该算法在保证数据类型识别的查准率和查全率的基础上,大幅度减少了大文件数据分类的时间。
The meteorological satellite and radar files often have dozens or even hundreds of megabytes. Classification according to the extensions of files is just a conventional method,it has no attribute based on the features of data,thus to some extent,it is lack of reliability. Via statistical analysis the encoding rules of typical meteorological data can be acquired,but it is a low-efficient way to analyze the spectrum of the entire file,so we need to find a fast and accurate method for big file classification and recognition. Based on the existing research on file classification,analyzing and researching the statistical characteristics of BFD( Byte Frequency Distribution) of typical meteorological data, accepting self-similarity theory,presenting a finger model of minimum file blocks of features,this paper devised a fast recognition algorithm of big file based on the theory of self-similarity. Experiments show that the algorithm greatly improves the classification rate of large files; at the same time,it ensures the precision and recall of data type identification.