考虑到半导体工艺发展带来的线延迟问题,分布式、分片式的处理器结构变得很有吸引力.在传统流处理器中,流控制器发射的控制信号在传递时存在长线延迟问题.传统流处理器的运算簇由众多的功能部件组成,由于运算簇间的通信是集中控制的,运算簇间通信网络的线延迟可扩展性差.提出了一种分片式流处理器(TPA-PD)体系结构,它采用分布式的网络连接分片式的部件,避免了控制信号在传递过程中出现的长线延迟问题.在kernel级,TPA-PD使用类数据流的执行模型即显式数据流图执行,将指令间的依赖关系在指令中静态编码,把传统流处理器中运算簇间的集中通信变为动态发射、分布式的通信,利于结构扩展.解释了新的执行模型、指令集以及将流编程模型映射到新结构上.在时钟精确的模拟器上,实验分析了影响kernel级执行时间的软硬件因素,TPA-PD比传统流处理器在8个benchmark中平均获得了20%的加速比.
In the view of wire delay increase brought by technology development, the distributed and tiled processor architecture becomes increasingly attractive. The controlling signal dispatched by the stream controller of the conventional stream processor faces the increasing wire delay. The cluster consists of a variety of functional units in the conventional stream processor. The wire delay scalability of the centralized communication architecture among clusters is improper. In this paper, a tiled architecture of the stream processor (TPA-PD) is introduced, in which the distributed network is used to connect the tiled components to address the increasing wire delay of the controlling signal. A data-flow-like driven execution model, which is explicit data graph execution, is employed in the kernel level, the dependence relation is encoded in the instruction set, and the centralized communication model of clusters is converted into dynamic dispatching and distributed communication model which is wire-delay scalable. The instruction set, and how to map the stream programming model to the TPAD-PD and microarchitecture are described. Finally, the authors analyze the factor which has an effect on the kernel level execution time on a cycle-accurate simulator, and the TPA-PD achieves an average 20% speedup over traditional stream processor in eight benchmarks.