随着工艺的进步,微处理器将面临越来越严重的软错误威胁.文中提出了两种片上多核处理器容软错误执行模型:双核冗余执行模型DCR和三核冗余执行模型TCR.DCR在两个冗余的内核上以一定的时间间距运行两份相同的线程,store指令只有在进行了结果比较以后才能提交.每个内核增加了硬件实现的现场保存与恢复机制,以实现对软错误的恢复.文中选择的现场保存点有利于隐藏现场保存带来的时间开销,并且采用了特殊的机制保证恢复执行和原始执行过程中load数据的一致性.TCR执行模型通过在3个不同的内核上运行相同的线程实现对软错误的屏蔽.在检测到软错误以后,TCR可以进行动态重构,屏蔽被软错误破坏的内核.实验结果表明,与传统的软错误恢复执行模型CRTR相比,DCR和TCR对核间通信带宽的需求分别降低了57.5%和54.2%.在检测到软错误的情况下,DCR的恢复执行带来5.2%的性能开销,而TCR的重构带来的性能开销为1.3%.错误注入实验表明,DCR能够恢复99.69%的软错误,而TCR实现了对SEU(Single Event Upset)型故障的全面屏蔽.
With the development of integrated circuit,microprocessors are more and more susceptible to soft errors.Two chip multiprocessor execution models for soft error tolerance are proposed in this paper.Dual Core Redundancy(DCR) executes two redundant threads of a given program on separate cores with certain slack.The store instructions can not be committed until they are compared.The redundant cores are enhanced with hardware implemented context saving and recovery,so that the soft errors can be recovered by re-execution from the last context saving point.The context saving point chosen in this paper can efficiently hide the saving latency.The load coherence between original and re-executions is guaranteed by special mechanism to avoid undesirable fault.Triple Core Redundancy(TCR) applies triple modular redundancy on core level to exploit the core resources for soft error masking.Three redundant threads are executed in TCR on separate cores.Once detecting soft errors,TCR can be reconfigured to mask the wrong results of corrupted core.The experimental results demonstrate that,compared to traditional soft error recovery execution model CRTR,DCR and TCR can reduce 57.5% and 54.2% inter-core communication bandwidth demand respectively.The performance loss of DCR caused by re-execution is 5.2%,while reconfiguration on TCR brings 1.3% performance overheads.The fault injection experiment shows that DCR can recover 99.69% soft errors,while TCR can mask all the SEU(Single Event Upset) faults.