当今云计算环境下,Hadoop已经成为大数据处理的事实标准;然而云计算具有大规模、高复杂和动态性的特点,容易导致故障的发生,影响Hadoop上运行的作业;虽然Hadoop具有内置的故障检测和恢复机制,但云环境中不同节点负载大小的变化,被调度的作业仍然导致失败;针对此问题提出自响应故障感知的检测调度方法,对异构环境负载能力的不同,而做出服务器快节点和慢节点的判断,把作业分配调度到合适的节点上执行,调整任务决策来尽可能的防止任务失败的发生;最后在Hadoop框架下与基本调度器进行实验性能比较,结果显示该方法减少作业失败率最高达19%,并缩短了作业执行时间,同时也减少CPU和内存的使用。
In today's cloud computing environment, Hadoop has become the fact standard for big data processing. However, cloud com- puting has the characteristics of large scale, high complexity and dynamic characteristics, fault occurrence is common, but it often affects the operation of jobs on the Hadoop. Although the Hadoop has a built--in fault detection and recovery mechanism, but the cloud environment, the changes in the load of different nodes, the job is scheduled to still lead to failure. The proposed self response of fault aware scheduling detection method, according to different load capacity of heterogeneous environment, and make fast server nodes and slow nodes judgment, the scheduling of job allocated to the appropriate node , adjust the decision task to prevent mission failure occurred. Finally in the Hadoop framework and basic scheduler were experimental performance compared results show that the method reduce job failure rate of up to 19 % and shorten job execution time, and also reduce the CPU and memory usage.