作为弥补处理器和主存之间速度巨大差异的桥梁,Cache已经成为现代处理器中不可或缺的一部分.经研究发现,传统Cache单独使用硬件进行管理,使用固定的Cache策略和一致性协议难以适应程序中数据访存模式的多样性,容易造成Cache抖动,以致影响性能.提出了一种新的软硬件合作管理Cache——面向数据对象Cache(data-object oriented cache,DOOC).DOOC动态地为程序中的数据对象分配Cache段,并且动态变化段容量、段内相联度、块大小和一致性协议,从而适应数据访存模式的多样性.还介绍了DOOC软件管理的编译方法以及面向数据对象的预取机制.分别使用CACTI和基于LEON3处理器的实验平台对DOOC的硬件开销进行评估,验证了DOOC的硬件可实现性.还使用软件模拟的方式分别测试了DOOC在单核和多核处理器平台上的性能.在单核处理器上对15个基准测试程序的评测结果表明,与传统Cache相比,DOOC失效率平均降低44.98%(最大降低93.02%),平均加速比为1.20(最大为2.36).同时,通过在4核处理器平台上运行NPB的OpenMP版本测试程序,失效率平均降低49.69%(最大降低73.99%).
Cache memory has become an essential part of modern processors to bridge the increasing gap between the CPU and the main memory speeds. Cache thrashing is one of the most severe problems that affect cache performance. Research shows that the traditional unified cache management by the hardware does not match the diverse memory access patterns in programs. This consequently leads to an enormous cache thrashing. The authors present data-object oriented cache (DOOC), a novel software/hardware cooperative cache. DOOC dynamically allocates isolated cache segments for different data-objects. Moreover, the segment capacity, associativity, block size, and cache coherence protocol implementation of each segment can be configured dynamically by the software. DOOC uses varied cache segments to match diverse data access patterns. The design and implementation of DOOC are discussed in detail together with the compiling techniques and data pre-fetching strategies. Also estimated is the hardware cost of DOOC with CACTI and a sample implementation on LEON3 processor through FPGA. The results show that the DOOC is hardware efficient. Likewise, the performance of DOOC is tested on both single-core and multi-core platforms through software simulation. Fifteen kernel benchmarks extracted from scientific applications, multimedia programs, and database management routines are tested on a single-core platform. Compared with a traditional cache, the DOOC achieves an average reduction of 44.98% in miss rate (the maximum is 93.02%) and an average speedup of 1.20 (the maximum is 2.36). Then the OpenMP version of NPB is run on a four-core platform and the results show an average reduction of 49.69% in miss rate (the maximum is 73.99%).