东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种类数据流驱动的分片式流处理器体系结构及其编程模型

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：0
页码：1643-1653
语言：中文
分类：TP302[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学技术大学计算机科学与技术学院,合肥230027, [2]中国科学院计算机系统结构重点实验室（中国科学院计算技术研究所）,北京100190
相关基金：国家自然科学基金重点项目（60633040）;国家自然科学基金项目（60736012）; 国家“九七三”重点基础研究发展计划基金项目（2005CB321601）; 国家“八六三”重点基础研究发展计划重大项目（2006AA01A102）;国家“八六三”高技术研究发展计划基金项目（2009AA01Z106）; 教育部-英特尔信息技术专项科研基金项目（MOE-INTEL-08-07）
相关项目：超并行计算机体系结构研究

关键词：线延迟, 流处理器, 分片式, 类数据流驱动, 处理器结构, wire delay, stream processor, tiled, data-flow-like driven, architecture

中文摘要：

考虑到半导体工艺发展带来的线延迟问题,分布式、分片式的处理器结构变得很有吸引力.在传统流处理器中,流控制器发射的控制信号在传递时存在长线延迟问题.传统流处理器的运算簇由众多的功能部件组成,由于运算簇间的通信是集中控制的,运算簇间通信网络的线延迟可扩展性差.提出了一种分片式流处理器（TPA-PD）体系结构,它采用分布式的网络连接分片式的部件,避免了控制信号在传递过程中出现的长线延迟问题.在kernel级,TPA-PD使用类数据流的执行模型即显式数据流图执行,将指令间的依赖关系在指令中静态编码,把传统流处理器中运算簇间的集中通信变为动态发射、分布式的通信,利于结构扩展.解释了新的执行模型、指令集以及将流编程模型映射到新结构上.在时钟精确的模拟器上,实验分析了影响kernel级执行时间的软硬件因素,TPA-PD比传统流处理器在8个benchmark中平均获得了20%的加速比.

英文摘要：

In the view of wire delay increase brought by technology development, the distributed and tiled processor architecture becomes increasingly attractive. The controlling signal dispatched by the stream controller of the conventional stream processor faces the increasing wire delay. The cluster consists of a variety of functional units in the conventional stream processor. The wire delay scalability of the centralized communication architecture among clusters is improper. In this paper, a tiled architecture of the stream processor （TPA-PD） is introduced, in which the distributed network is used to connect the tiled components to address the increasing wire delay of the controlling signal. A data-flow-like driven execution model, which is explicit data graph execution, is employed in the kernel level, the dependence relation is encoded in the instruction set, and the centralized communication model of clusters is converted into dynamic dispatching and distributed communication model which is wire-delay scalable. The instruction set, and how to map the stream programming model to the TPAD-PD and microarchitecture are described. Finally, the authors analyze the factor which has an effect on the kernel level execution time on a cycle-accurate simulator, and the TPA-PD achieves an average 20% speedup over traditional stream processor in eight benchmarks.

同期刊论文项目