关注MPI并行程序的运行时错误检测,提出了一种基于冗余进程的检错方法REDReP,能够检测MPI并行程序在运行过程中由于硬件故障导致的数据错误.介绍了REDReP的基本思想,讨论了一些关键问题,最后给出了实验结果,表明REDReP具有较低的检错开销.
This paper works on runtime error detection for MPI programs and proposes a novel error detection approach, making use of redundant processes, called REDReP. The paper first introduces the basic idea of REDReP, then discusses some key problems, and finally presents the experimental evaluation. REDReP can achieve minor overhead.