随着半导体工艺进步,多核处理器超过60%的片上面积由片上缓存占据.由于特征尺寸缩小及供电电压下降,片上缓存较以往更容易发生错误.缓存错误包括可恢复的软错误(softerror)及不可恢复的不稳定位(erraticbit)失效.传统容错技术主要研究针对单个缓存模块的保护.当缓存中包含成百上千个模块时,即使单个缓存模块出错的概率很低,系统中有一个或多个缓存出错的概率也相对较高.文中提出可扩展地址映射(SAM)方法,支持对可缓存地址空间灵活高效的映射,提高末级缓存的可靠性.通过对末级缓存地址空间进行重构,只要有末级缓存模块可以工作,SAM就能够保证系统正确运行.SAM可应用于共享或集群缓存组织方式.文中提出的算法能根据末级缓存中出错缓存模块的数目变化,动态调整集群缓存组织方式下的集群大小.实验结果表明,SAM方法可在多种出错环境下保证系统功能正确,且性能平滑下降.
With the advance of semiconductor technology, more than 60% area of chip multiprocessor may be dedicated to on-chip cache memory. Due to shrinking feature sizes and decreasing voltage, on-chip caches are more vulnerable to faults including recoverable soft errors and unre- coverable erratic bit failures. Conventional fault tolerance techniques mainly focus on the protection of single cache module. For a cache hierarchy with hundreds of modules, there is a considerable probability that one or more cache modules fail even if the fault probability of single module is relatively low. In this paper we propose Sealable Address Mapping (SAM) method which supports an efficient and flexible mapping mechanism for the cacheable address space and improves the LLC reliability. SAM can reconstruct the LLC memory hierarchy. If there is at least one functional LLC module left on-chip, SAM can maintain the system operating correctly. SAM can be used in both shared and cluster cache organizations. A reorganization algorithm is proposed which can adaptively change the region size for cluster organizations. When the number of fault LLC modules increase, SAM adjusts the region numbers in LLC to alleviate capacity stresses. Simulation results show that SAM can successfully keep system functional under various cache fault situations, while sustaining graceful average performance degradation.