高速缓存采用写回策略,能极大地节省对片上网络和访存带宽的消耗,这对于片上众核(大于16核)的结构尤为重要.与通常多核系统中基于目录/总线的写无效或写更新协议不同,文中给出了片上实现域一致性存储模型和基于硬件锁的缓存一致性协议的方案并提出了在L1高速缓存保存写掩码的方法,用以记录本地更新缓存块的字节位置,解决了写回策略下伪共享带来的缓存一致性问题.文中还进一步提出两种优化掩码存储空间开销的新方法:通过设定程序中较少出现的、长度为1-3字节的写指令为写穿透,在L1中每4字节设置一位写掩码,将写掩码的芯片面积开销压缩到字节粒度的27.9%;设计项数为L1缓存块总数12.5%的多路写掩码缓存,在不损失性能的情况下,将面积开销压缩到字节粒度的17.7%.搭建的众核平台Godson-T采用域一致性存储模型,使用写掩码实现混合写回/写穿透缓存策略(临界区内写穿透,临界区外写回).实验使用splash2的3个程序和2个生物计算程序进行评估.结果表明,相对于完全写穿透,混合写回策略在32和64线程的配置下普遍获得24%以上的性能提升,性能略优于完全写回,并且采用两种优化空间开销的新方法后性能无损失.
Write-back cache policy can greatly save bandwidth consumption for write operations.It′s particularly beneficial in many-core architecture.Normally CMP uses write-invalid or write-update cache protocol like directory based MESI which is hardly scalable and too complex.Alternatively the authors implemented scope consistency(and lock-based cache coherence protocol) on chip,add write-mask for each cacheline of L1 Dcache to record the written byte′s location and solve the false sharing problem.To further optimize the write-mask storage overhead,two methods are proposed.First the authors set store instructions with 1/2/3 bytes write-through property and let every 4-byte data has 1 bit write-mask.This method can compress the chip area of write-mask to 27.9% of origin byte-grain design.Secondly they design write mask buffer whose entry counts 12.5% of total number of Dcache blocks and compress the area overhead to 17.7% of origin without performance lost.On Godson-T 64-core platform which uses scope consistency,they use write-mask to implement hybrid WB/WT cache policy(in the scope range with possible data race we implement write-through,but out of the scope range without data race they choose write-back).Three splash2 programs and two biological programs are evaluate.The results show that performance improvement is above 24% compared to completely write-through and no performance lost under the two storage optimizations.