传统数据流结构通过多上下文来隐藏指令等待源操作数的延迟,然而这种隐藏方式只能部分提高数据流处理器执行单元的利用率.在面向例如Stencil、FFT和矩阵乘法等典型科学应用时,传统数据流结构的执行单元利用率仍然较低.科学计算中的核心程序一般是对不同数据进行相同的操作,而且这些操作可以并行执行,数据之间没有直接依赖关系.传统数据流结构是面向通用计算的,通常采用循环来实现对不同数据的相同操作.在这些循环中,迭代是按照顺序依次执行的,这导致了传统数据流结构没有利用科学计算的并行性来提高性能.所以传统数据流结构在处理这些规则的科学应用时没有协调好数据流计算模式和科学计算特征,而数据流计算是非常适合科学计算这种类型的规则计算.基于科学计算的这些特征,该文提出了一种面向科学计算的数据流结构优化方法:循环流水优化方法.循环流水优化方法利用科学计算的分块和并行处理特征,对传统数据流结构中的上下文控制逻辑进行了改进,将科学计算中的循环采用硬件自迭代的方式实现,并将上下文切换逻辑进行了流水化,使数据流结构中的上下文以流水线方式进入执行单元阵列,从而提高计算单元的利用率.面对这种循环流水优化后的数据流结构,传统数据流结构上的指令映射算法不再适用.通过分析循环流水优化后的结构特征,该文进一步提出了一种改进的指令映射算法:LBC(LoadBalanceCentric)指令映射算法.LBC算法按照深度优先顺序依次映射数据流图中的所有指令,对每条指令分别计算执行单元阵列中所有位置的代价,取最小代价的位置作为最佳映射位置.LBC算法以执行单元负载均衡为核心,同时将定点指令和浮点指令分开处理,保证执行单元上的定点部件和浮点部件?
Traditional dataflow architectures hide the latency of waiting for operands through multiple contexts. But multiple contexts can only improve the utilization of function units in processing elements of dataflow architectures partly. When dealing with typical scientific applications such as stencil, FFT and matrix multiplication, they are still not efficient enough because the utilization of function units is not very high. In the kernels of scientific applications, the same operations are usually performed on different data. Because the data are usually independent of each other, the operations on different data can be performed in parallel. Traditional dataflow architectures are general purpose. They usually implement the same operations on different data in loops where iterations are executed in sequence. It results in that traditional dataflow architectures don~t exploit the parallelism of scientific applications. Dataflow computing is very suitable for scientific applications but traditional dataflow architectures don't coordinate the dataflow computing model and the features of scientific applications. Based on the computing features of scientific applications, in this paper, we propose an optimization of dataflow architectures for scientific applications: loop-in-pipeline optimization. The optimization takes advantages of the blocking and parallelism features of scientific applications and improves the context control logic of traditional dataflow architectures. The optimization implements the loops of scientific applications in hardware and switches the contexts in pipeline model. The loopqn-pipeline dataflow architecture streams the contexts into the processing element (PE) array in pipeline model to improve the utilization of function units. But the traditional dataflow instruction mapping algorithms are not adapted to the loop-in-pipeline architectures. Based on the features of the loop-in-pipeline architectures, we propose a novel instruction mapping algorithm. LBC (load-ba