采用现有的Hadoop默认数据放置策略时,若本地数据副本失效,从远程结点上恢复数据需要耗费大量数据传输时间,且随机选取数据放置结点可能会影响数据放置的负载均衡.为此,文中提出一种改进的数据放置策略.该策略基于结点网络距离与数据负载计算每个结点的调度评价值,据此选择一个最佳的远程数据副本的放置结点,从而既能实现数据放置的负载均衡,又能实现良好的数据传输性能.在Hadoop平台上实现了所提出的数据副本放置改进策略,结果表明,与系统默认策略相比,文中提出的策略不仅可以改进数据放置的负载均衡,而且可以减少数据副本放置的时间.
In the existing default data placement strategy for Hadoop,much time is needed to restore data from a remote DataNode when the local replicas become unavailable,and the load balancing may be destroyed due to the random selection of DataNode for data storage.In order to solve these problems,an improved data placement strategy is proposed,which chooses the most appropriate DataNode to place remote replicas according to the scheduling evaluation value of each DataNode based on DataNodes' network distance and data load.Thus,the load balancing for data storage is implemented and excellent data transmission is achieved.The proposed data placement strategy is then implemented in the Hadoop platform and the results show that the proposed strategy is superior to the existing default data placement strategy because it improves the local balancing for data storage and reduces the time for data placement.