针对HDFS处理时空小文件效率不高的问题,从用户的访问规律和访问数据自身属性这两者之间的相关性上出发,将用户访问流看成对数据文件的请求序列,然后根据数据的时空属性参数化表示,并利用特征提取构建一个新的特征序列,最后通过序列模式挖掘PrefixSpan算法找到用户在不同访问模式下的特征模板,合并相关文件。实验结果表明,该合并策略有效地降低了NameNode内存占用率和响应时间,提高了读取效率。
Aiming to the issues of low processing efficiency of small files in HDFS,from the perspective of researching corre-lation between user’s accessing regulation and data attributes,this paper treated user accessing streams as request sequences to data files,and parameterized these data on the basis of its spatial and temporal properties.When it generalized new signa-ture sequences by feature extraction,the feature templates of different access modes were found through sequential pattern mi-ning by PrefixSpan algorithm.Experimental results show that the consolidation strategy effectively reduces the NameNode mem-ory usage and response time,and improves the system read efficiency.