随着集成电路工艺进入纳米时代,芯片正面临着软错误的威胁.除了软错误在数量上的威胁外,处理器还面临着由于工艺变动性和电压、温度、位置等工作环境变动而导致的错误率变动性的威胁,即系统的错误率不会一直稳定而是随着时间发生变化.检查点是系统容错的主要机制,它的开销和检查点的间隔密切相关,目前检查点间隔的确定大多是基于恒定错误率的.但是在软错误变动的情况下,自适应检查点的方法比固定方法更能够显著降低检查点开销,它通过预测系统的错误率来确保系统的检查点间隔始终与最优状态接近.但是自适应检查点所能获得的性能改善与错误率变动的具体程度相关.因此本文研究软错误率变动的形式和幅度如何影响检查点的开销.该文开展了如下研究:基于温度、电压、位置等因素对软错误影响的原理,建立了一个包含变动幅度、持续时间等参数的错误率变动的模型;基于错误率变动模型,模拟了在理想情况下自适应检查点机制能够获得的性能改善;提出了一种基于错误历史预测错误率的方法,从而验证了在实际情况下自适应检查点能够达到的效果.实验结果表明,变动的幅度在3倍以上且持续时间在12.5%以上时,该文方法就能获得实际上的性能改善.
Soft errors are increasingly important threats to the reliability of integrated circuits as reduction of feature sizes to nanometer level. Chips manufactured in advanced technologies show variations in S£IR(Soft-Error-Rate) caused by variations in the process parameters and operating environment. Ongoing reduction of feature sizes and complexity of operating environment, SER variation is increasingly manifesting. The most popular recovery method is checkpoint, and the intervals of checkpoint can obviously influence performance. However, most ways to determine intervals of checkpoint relying on constant SER. Theoretically speaking, self-adoptive checkpoint which analyze occurrence of errors more carefully and dynamically match checkpoint interval to real time SER can improve checkpoint overhead under variable SER. But benefit of SACP is relative with SER variation, so we have to evaluate impact of SER variability on self-adoptive checkpoint. We study impacts of theoretical variable SER on checkpoint overhead; propose a way to predict SER based errors occurred most currently, showing practical benefits of self-adoptive checkpoint. Results show our method can improve performance in the situation of variation above 3X amplitude and sustained time more than 12. 5%.