高性能计算机上并行程序用到的结点越来越多,而在程序运行期间中发生结点失效的概率也随之增大.对于计算时间很长的程序,容忍结点失效的容错能力显得尤为重要.并行多重网格算法(MG)被广泛用于求解大型工程和物理问题中的偏微分方程组的数值解.为了实现MG算法的容错能力,提出了一种基于容错MPI的容错并行多重网格算法FT—MG.实验结果表明:FT-MG算法在引入少许开销的条件下实现了MG算法的容错能力.
As the number of nodes used in high performance applications increases, the probability that failures occur in the computing processes of applications increases. For long running applications it is essential that fault tolerance be used to survive fail-stop failures. The parallel multiple grid algorithm (MG) is widely used to solve the PDEs in large-scale project and physical problems. To implement the fault tolerant ability of MG algorithm, we design a fault tolerance algorithm (named FT-MG) based on fault tolerance MPL Experimental results prove the fault tolerance ability of FT-VMG and demonstrate that the fault tolerant overhead of FT-MG is pretty small.