为了解决Hadoop分布式文件系统(HDFS)平台上小文件的存在带来MapReduce程序运行能耗成本偏高问题,建立Hadoop节点集群的能耗模型进行分析推导,证明了在Hadoop平台上,存在能使程序运行能耗成本最低的最优文件大小,并在此基础上结合经济学边际分析理论提出一种基于能耗成本和访问成本考虑的最优文件大小判定策略.此策略可以对存放在HDFS上的小文件合并进行效益计算,将小文件合并为成本最优文件大小以获得最佳收益.通过实验证明了能效最优数据块大小的存在,并证明了成本和效益相结合利用边际分析理论来确定数据块大小的合理性和有效性.
The map reduce program operated on Hadoop distributed file system( HDFS) has a high-energy-cost problem caused by existence of small files. In order to solve this problem,the article established a new energy model of Hadoop node cluster to analyze data then proved that there exists the optimal file size on Hadoop which can reduce the energy cost of program operation to the lowest level,and based on the above data and the margin analysis theory,a judging strategy was put forward,which can find the optimal file size from the angle of energy cost and visit cost. This strategy can merge the small files on HDFS to the optimal file size according to the cost efficiency,so to get the best benefit. The existence of optimal sized data block was proved by examination,and the reasonability and validity of identifying the data block size by the combination of cost and efficiency under the margin analysis theory are proved as well by examination.