东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种面向科学计算的数据流优化方法

ISSN号：0254-4164
期刊名称：《计算机学报》
分类：TP393[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100190, [2]中国科学院大学计算机与控制学院,北京100049, [3]中国科学院电子学研究所,北京100090
相关基金：本课题得到国家“八六三”高技术研究发展计划项目基金（2015AAOIA301,2012AA010901）、国家核高基重大专项（2013ZX0102-8001-001-001）、国家自然科学基金（61332009,61173007,61204047,61221062）资助.

作者：申小伟[1,2], 叶笑春[1], 王达[1], 张浩[1], 王飞[3], 谭旭[1,2], 张志敏[1], 范东睿[1], 唐志敏[1], 孙凝晖[1]

关键词：指令映射, 数据流, 循环流水, 科学计算处理器, 高性能计算, Instruction mapping, performance computing dataflow, loop-in-pipeline, scientific processing unit, high

中文摘要：

传统数据流结构通过多上下文来隐藏指令等待源操作数的延迟，然而这种隐藏方式只能部分提高数据流处理器执行单元的利用率．在面向例如Stencil、FFT和矩阵乘法等典型科学应用时，传统数据流结构的执行单元利用率仍然较低．科学计算中的核心程序一般是对不同数据进行相同的操作，而且这些操作可以并行执行，数据之间没有直接依赖关系．传统数据流结构是面向通用计算的，通常采用循环来实现对不同数据的相同操作．在这些循环中，迭代是按照顺序依次执行的，这导致了传统数据流结构没有利用科学计算的并行性来提高性能．所以传统数据流结构在处理这些规则的科学应用时没有协调好数据流计算模式和科学计算特征，而数据流计算是非常适合科学计算这种类型的规则计算．基于科学计算的这些特征，该文提出了一种面向科学计算的数据流结构优化方法：循环流水优化方法．循环流水优化方法利用科学计算的分块和并行处理特征，对传统数据流结构中的上下文控制逻辑进行了改进，将科学计算中的循环采用硬件自迭代的方式实现，并将上下文切换逻辑进行了流水化，使数据流结构中的上下文以流水线方式进入执行单元阵列，从而提高计算单元的利用率．面对这种循环流水优化后的数据流结构，传统数据流结构上的指令映射算法不再适用．通过分析循环流水优化后的结构特征，该文进一步提出了一种改进的指令映射算法：LBC（LoadBalanceCentric）指令映射算法．LBC算法按照深度优先顺序依次映射数据流图中的所有指令，对每条指令分别计算执行单元阵列中所有位置的代价，取最小代价的位置作为最佳映射位置．LBC算法以执行单元负载均衡为核心，同时将定点指令和浮点指令分开处理，保证执行单元上的定点部件和浮点部件?

英文摘要：

Traditional dataflow architectures hide the latency of waiting for operands through multiple contexts. But multiple contexts can only improve the utilization of function units in processing elements of dataflow architectures partly. When dealing with typical scientific applications such as stencil, FFT and matrix multiplication, they are still not efficient enough because the utilization of function units is not very high. In the kernels of scientific applications, the same operations are usually performed on different data. Because the data are usually independent of each other, the operations on different data can be performed in parallel. Traditional dataflow architectures are general purpose. They usually implement the same operations on different data in loops where iterations are executed in sequence. It results in that traditional dataflow architectures don~t exploit the parallelism of scientific applications. Dataflow computing is very suitable for scientific applications but traditional dataflow architectures don＇t coordinate the dataflow computing model and the features of scientific applications. Based on the computing features of scientific applications, in this paper, we propose an optimization of dataflow architectures for scientific applications： loop-in-pipeline optimization. The optimization takes advantages of the blocking and parallelism features of scientific applications and improves the context control logic of traditional dataflow architectures. The optimization implements the loops of scientific applications in hardware and switches the contexts in pipeline model. The loopqn-pipeline dataflow architecture streams the contexts into the processing element （PE） array in pipeline model to improve the utilization of function units. But the traditional dataflow instruction mapping algorithms are not adapted to the loop-in-pipeline architectures. Based on the features of the loop-in-pipeline architectures, we propose a novel instruction mapping algorithm. LBC （load-ba

同期刊论文项目

面向功能ECO的不等价逻辑抽取方法研究

期刊论文 14 会议论文 8

超并行高效能计算机体系结构与设计方法研究

期刊论文 124 会议论文 114

数据并行与线程并行合一的可伸缩处理器体系结构

期刊论文 10

同项目期刊论文

Automatic Test Program Generation Using Executing-Trace-Based Constraint Extraction for Embedded Pro

Layout-Oblivious Compiler Optimization for Matrix Computations

Test Path Selection for Capturing Delay Failures Under Statistical Timing Model

全局图像特征分析与实时层次化消失点检测

采用旋转匹配的二进制局部描述子

A Two-tiered On-Demand Resource Allocation Mechanism for VM-Based Data Centers

RevivePath: Resilient Network-on-Chip Design Through Data Path Salvaging of Router

An Efficient Parallel Mechanism for Highly-Debuggable Multicore Simulator

基于传播引擎的指针引用错误检测[J]. 计算机学报

Leveraging the Error Resilience of Neural Networks for Designing Highly Energy Efficient Accelerator

Architecture Support for Task Out-of-order Execution in MPSoCs

GreenDCN: A General Framework for Achieving Energy Efficiency in Data Center Networks.

Test-Quality Optimization for Variablen-Detectionsof Transition Faults

Lifetime enhancement techniques for PCM-based image buffer in multimedia applications

Orchestrator: Guarding Against VoltageEmergencies in Multi-threaded Applications

基于弱隔离性的事务内存冲突分析

ZoneDefense: A Fault-Tolerant Routing for 2-D Meshes Without Virtual Channels

HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace’s semantic gap

BPM/BPM+: Software-based Dynamic Memory Partitioning Mechanisms for Mitigating DRAM Bank-/Channel-le

Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

SmartCap: Using Machine Learning for Power Adaptation of Smartphone’s Application Processor

MIMS：Towards a Message Interface Based Memory System

A High-Performance and Cost-Effcient Interconnection Network for High-Density Servers

A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications.

Oblivious Integral Routing for Minimizing the Quadratic Polynomial Cost

A marker-free automatic alignment methodbased on scale-invariant features.

MALK——面向共享存储多核系统高效处理大规模键值的MapReduce框架

多核系统共享内存资源分配和管理研究，

A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications

二维Mesh结构的片上网络中利用全局信息的路由算法

计算机系统模拟器研究综述

基于MIPS的异构内存虚拟化方法研究

基于Cache锁和直接缓存访问的网络处理优化方法

基于可行序的数据竞争检测

二维mesh结构的片上网络中利用全局信息的路算法

高可靠处理器微体系结构设计空间的快速搜索

编译队列监视下的Size-Speed动态编译调度

龙芯GS464E 处理器核架构设计

龙芯UNCACHE加速原理及其在系统图形性能优化中的应用

Online Updates on Data Warehouses via Judicious Use of Solid-State Storage

i2MapReduce:Incremental MapReduce for Mining Evolving Big Data

A signal degradation reduction method for memristor ratioed logic (MRL) gates

RISO: Enforce Noninterfered Performance With Relaxed Network-on-Chip Isolation in

A High-Performance and Cost-Efficient Interconnection Network for High-Density Servers

Nimble：一种适用于OpenFlow网络的快速流调度策略

Test-Quality Optimization for Variable n-Detections of Transition Faults

Economizing TSV resources in 3D Network-on-Chip design

Data Remapping for Static NUCA in Degradable Chip Multiprocessors

Reliability-Oriented Placement and Routing Algorithm for SRAM-Based FPGAs

多核程序交互理论及应用

异构平台上性能自适应FFT框架

Automatic tuning of sparse matrix-vector multiplication on multicore clusters

GreenDCN: A General Framework for Achieving Energy Efficiency in Data Center Networks

Joint virtual machine assignment and traffic engineering for green data center networks

面向低能耗的非精确异构多核上的运行时技术

基于顶点加权的介度中心近似算法研究

Memory Efficient Two-Pass 3D FFT Algorithm for Intel Xeon Phi Coprocessor.

异构并行编程模型研究与进展

基于FPGA模拟片上多核处理器的新方法

基于神经网络预测模型的异构多核处理器调度

NUMA结构的高效实时稳定的垃圾回收算法

龙芯指令系统融合技术

FreeRider: Non-local Adaptive Network-on-Chip Routing with Packet-Carried Propagation of Congestion

BPM/BPM+: Software-based Memory Partitioning Mechanisms for Eliminating DRAM Bank-/Channel-level Int

Memory Efficient Two-Pass 3D FFT Algorithm for Intel Xeon PhiTMCoprocessor

基于扫描链的可编程片上调试系统

延迟存储：一种降低虚拟机退出开销的方法

片上多核处理器的区域共享的双粒度目录

An Elastic Architecture Adaptable to Various Application Scenarios

Prevention from Soft Errors via Architecture Elasticity

多核系统共享内存资源分配和管理研究

面向非规则三维片上网络的自适应可靠路由方法

编译队列监视下的Size-Speed动态编译调度算法

基于NUMA架构的解释器访存优化设计与实现

数据触发的基本块间弹性控制电路综合方法

基于资源配置等效性的数据中心能耗优化

基于软硬件协同设计的解释器指令分派方法

时分复用片上网络的设计与优化

Pragma Directed Shared Memory Centric Optimizations on GPUs

一种无目录的共享高速缓存一致性协议

移动设备应用程序的体系结构特征分析

一种基于Trace精度改进的内存系统模拟器优化方法

HDAS：异构集群上Hadoop＋框架中的动态亲和性调度

Parallel Incremental Frequent Itemset Mining for Large Data

影响非易失性内存系统性能的因素分析

二进制翻译系统中信号处理机制的研究

MIMS： Towards a Message Interface Based Memory System

A survey of neural network accelerators

Memory Efficient Two-Pass 3D Coprocessor FFT Algorithm for Intel（R） Xeon Phi TM

BDSim:面向大数据应用的组件化高可配并行模拟框架

Journal of Visual Communication and Image Representation

MACT：高通量众核处理器离散访存请求批处理机制

On-ChipGenerating FPGA Test Configuration Bitstreams to Reduce Manufacturing Test Time

一款用于多媒体设计的异构多核系统芯片的可测试性设计

EOFDM：一种面向众核架构的最低能耗搜索方法

Corrigendum to “Fast and scalable lock methods for video coding on many-core architecture” [J. Vis.

BDSim:面向大数据应用的组件化高可配并行模拟框架

VMM中Guest OS非陷入系统调用指令截获与识别

EOFDM：一种面向众核架构的最低能耗搜索方法

基于全局同步逻辑时间的访存依赖约减方法

多核系统共享内存资源分配和管理研究

二进制翻译系统中信号处理机制的研究

基于数据流块的空间指令调度方法

An Efficient Network-on-Chip Router for Dataflow Architecture

LFF：一种面向大数据应用的众核处理器访存公平性调度机制

面向门级网表的VLSI三模冗余加固设计

期刊信息

《计算机学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国计算机学会中国科学院计算技术研究所
主编：孙凝晖
地址：北京中关村科学院南路6号
邮编：100190
邮箱：cjc@ict.ac.cn
电话：010-62620695

国际标准刊号：ISSN：0254-4164
国内统一刊号：ISSN：11-1826/TP
邮发代号:2-833

获奖情况:
中国期刊方阵“双效”期刊

国内外数据库收录:
美国数学评论（网络版）,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:48433