分布式文件系统HDFS被用来存储大文件,若在其中存储海量小文件将会严重消耗NameNode内存,影响系统性能,同时小文件也不利于使用MapReduce框架进行并行处理和分析.另外,小文件附带的多维元信息也需要以一种合理的方式进行存储和索引以便于查询.本文针对以上问题,提出一种基于多维列索引的小文件管理方案,支持文件的并发上传、下载及删除操作,并在多个查询维度上提供文件的自由检索.本文提出的小文件合并方案能够明显减少HDFS上的文件数量,经过实验对比,在小文件元信息的查询效率方面,本文提出的多维索引方案优于HBase,同时保证了文件传输的吞吐量.
Hadoop Distributed File System { HDFS ) is designed to manage large files, storing of massively small files in HDFS will take a high memory usage rate on NameNode and also not efficient for parallel processing by MapReduce. On the other hand, the meta file information of these small files also need to be stored and indexed in an efficient way to realize fast query performance. For these problems, we present an efficient approach to store massively small files in HDFS by combining small files to large DataFiles. Our approach supports for concurrent file upload, download and delete operations, especially querying on multi-dimensional search conditions. The experiment results show that our approach outperformed HBase in querying massively small files, while ensuring upload/ download throughput.