为了简化文件系统的实现,支持超大规模数据集的流式访问,HDFS牺牲了文件的随机访问功能,而在实际场景中很多应用都需要对文件进行随机访问。在深入分析HDFS数据读写原理的基础上,提出了一种面向HDFS的数据随机访问方法。其设计思想是为Datanode添加本地数据访问接口,用户程序可以读取Datanode上存放的数据块文件以及把数据写入到Datanode上的数据块存放目录。文件的首副本由用户程序直接产生,其余副本在首副本写入完成之后采用数据复制的方式生成。此外,为数据块添加了权限管理功能,Datanode上的文件副本属于用户所有。若名字空间中文件权限发生变化,文件对应的数据块权限也会改变。测试表明,数据读取性能提升了约10%,数据写入性能提升了20%以上,在高并发下写入性能最大可提升2.5倍。
In order to simplify the realization of the file system,HDFS sacrifices the file’s random access feature to support streaming access for large data set.But in the actual scene,many applications require random access to the file.After indepth analysis of HDFS data reading and writing principle,a data random access method oriented to HDFS is proposed.The idea is to add data access interface for Blocks on Datanode,the user program can read the Block file stored on the Datanode and write the data to the Block storage directory.The first file replica is written to the local Datanode by user program,the rest replicas produced by copy of the first replica stored on other Datanodes.In addition,add the permissions management for Block,the file replicas stored on Datanodes belongs to the user.If the file permissions changed in the namespace,the Block permissions also changed.Test results show that data read and write performance is improved about10%and20%separately,the write performance can be increased by2.5times under the high concurrency.