应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在cUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU—GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU—GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check—pointing的开销,从而获得了更高的容错性能。
The application-level checkpointing technique is one of the most commonly used and well matured fault-tolerance techniques in homogenous systems. However, It is on its infant phase in heterogeneous systems and there are not accurate and reasonable solutions or approaches with respect to architectures and fault models of heterogeneous systems. Motivated by this observation, based on the architecture and programming model of the CUDA heterogeneous system, this paper analyzes the execution mode of CUDA programs running on CPUs and GPUs and proposes an asynchronous execution mechanism for the application-level checkpointing technique in heterogeneous systems. With this mechanism, we explore a solution of optimal placement of checkpoints in heterogeneous systems. Finally, three experimental cases in the CUDA platform are used to evaluate our technique's performance, feasibility and viability. The results demonstrate the effectiveness of our asynchronous execution mechanism for the application-level checkpointing technique in heterogeneous systems. Compared with the synchronous execution mechanism, our mechanism is more flexible and has broader optimization space to explore. Moreover, Our solution of optimal placement of checkpoints can efficiently reduce checkpointing overhead and hence obtain higher performance.