同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力.文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系.传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题.实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%.
Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache.It can execute multiple instructions from multiple threads each cycle,thus increasing the pressure on memory hierarchy.In this paper,the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture,especially the issue of fairness in cache sharing and its relation to throughput are studied.The commonly used LRU policy implicitly partitions a shared cache on a demand basis,giving more cache resources to the application that has a high demand.LRU manages the cache unfairly,and will lead to some serious problems,such as thread starvation and priority inversion.An adaptive runtime partition(ARP) mechanism is implemented to manage the shared cache.ARP takes fairness as the metric of cache partitioning,and uses dynamic partitioning algorithm to optimize fairness.The dynamic partitioning algorithm is easy to implement,requires little or no profiling.Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread,and requires less than 0.25% of storage overhead.The evaluation shows that on the average,ARP improves the fairness of a 2-way SMT by a factor of 2.26,while increasing the throughput by 14.75%,compared with LRU-based cache partitioning.