东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于FPGA的高精度科学计算加速器研究

ISSN号：0254-4164
期刊名称：计算机学报
时间：2012
页码：112-120
分类：TP302[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：[1]国防科学技术大学计算机学院,长沙410073
相关基金：国家“八六三”高技术研究发展计划项目基金（2008AA01A201）; 国家自然科学基金重点项目（60833004 61125201）资助
相关项目：高性能可重构算法加速器体系结构研究

关键词：四精度浮点算术, LU分解, MGS-QR分解, FPGA, 硬件加速器, E量级计算, quadruple precision floating-point arithmetic, LU decomposition, MGS-QR decomposition, FPGA, hardware accelerator, ExeScale computation

中文摘要：

探索了FPGA平台加速高精度科学计算应用的能力和灵活性.首先,研究科学计算中最常用的操作——向量内积,提出基于定点操作的精确向量内积算法.以IEEE 754-2008标准的四精度（Quadruple Precision）浮点算术为例,在FPGA平台上设计了一个基于全展开方法的全流水四精度浮点乘累加单元（QPMAC）：提出两级存储策略精确存储乘累加和;采用保留进位累加策略减少定点加法器位宽、简化进位处理、优化关键路径;引入累加和划分策略,实现流水吞吐率.最后,在XC5VLX330FPGA芯片上设计一个LU分解和MGS-QR分解加速器原型来验证QPMAC的性能.实验结果表明,与运行在Intel四核处理器上的基于OpenMP的并行算法相比,集成4个QP-MAC单元的加速器能获得42倍到97倍的性能提升,并且能获得更高结果精度和更低能量消耗.

英文摘要：

In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate high precision scientific computing applications.First,we research the inner product operation,which occurs in almost all scientific and engineering applications,and propose the exact inner product algorithm based on exact long fixed-point operations.Taking IEEE 754-2008 quadruple precision floating-point as an example,we have implemented a full-pipelined Quadruple Precision Multiplication and Accumulation（QPMAC） into FPGA devices.We propose a two-level RAM banks scheme to store the exact fixed-point result,and use carry-saved accumulator scheme to minimize the width of fixed-point adder and simplify the logic of carry resolution.We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations,by dividing the summation function into 4 partial operations,processed in 4 banks.To prove the concept,we prototype four QPMAC units into a XC5VLX330 FPGA chip and perform LU decomposition and MGS-QR decomposition.The experimental results show that our implementations based on FPGA achieve 42X-97X better performance,more precision results and much lower power consumption compared with the use of a parallel software approach based on OpenMP running on an Intel Core2 Quad Q8200 CPU at 2.33 GHz.

同期刊论文项目

高性能可重构算法加速器体系结构研究

期刊论文 45 会议论文 17

支持高速缓存一致的片上网络关键技术研究

期刊论文 75 会议论文 63 专利 12

同项目期刊论文

Reliability provision mechanism for large-scale de-duplication storage systems

层次式FPGA快速布局算法

精确分类的视角无关人脸检测方法与硬件加速体系结构

A High Performance and Memory Efficient LU Decomposer on FPGAs

一种基于仿生原理的Sobel算子容错方法

Timing-driven placement via preassigning cell

Mix storage architecture for block level continuous data protection

基于龙芯3B的H.264解码器的向量化

A fast routability-driven router for hierarchical FPGA

Cholesky分解细粒度并行算法

面向稀疏矩阵访存特性的Cache划分

多核龙芯3A上二级BLAS库的优化

Markov Clustering-Based Placement Algorithm for Hierarchical FPGAs

A Novel Cache Replacement Policy via Dynamic Adaptive Insertion and Re-Reference Prediction

一种严格非阻塞Clos型波分复用光置换网络

Multi-core optimization for conjugate gradient benchmark on heterogeneous processors

2-Omega新型会议网络的设计与分析

Architecture and Circuit Optimization of Hardwired Integer Motion Estimation Engine for H.264/AVC

A fine-classification method and its hardware acceleration architecture for rotation invariant multi

Fine-grained parallel RNA secondary structure prediction using SCFGs on FPGA

大矩阵QR分解的FPGA设计与实现

Fast placement algorithm for hierarchical FPGAs

一种近似无阻塞的置换三级Clos网

FPGA accelerator for protein secondary structure prediction based on the GOR algorithm

一种多倍数据供应的编译优化方法

GPU上的矩阵乘法的设计与实现

An analytical placement technique for large-scale FPGAs

FPGA-specific custom VLIW architecture for arbitrary precision floating-point arithmetic

层次式FPGA快速可布性布线算法

Peak Temperature Reduction by Physical Information Driven Behavioral Synthesis with Resource Usage A

差量存储的集中式文件级连续数据保护方法

预先指定单元位置的时延驱动布局优化方法

线长驱动的层次式FPGA布局算法

Hierarchical Cache Directory for CMP

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors

A Unified Co-Processor Architecture for Matrix Decomposition

一种利用图建模的宏模块合法化算法

应用于大规模FPGA的解析式布局算法

基于FastPlace总体布局算法的实现

Design and Implementation of the Parameterized Multi-Standard High-Throughput Radix-4 Viterbi Decode

High precision scientific computation accumulator on FPGA

MCC: A message and command correlation method for identifying new interactive protocols via session

基于龙芯3A的LAPACK函数优化

龙芯3A处理器上FFT的高效实现

SAM: A fault-tolerant scalable address mapping method in last-level cache

面向龙芯3A体系结构的BLAS库优化

基于插桩和布尔逻辑的运行时程序验证框架

针对CMT架构的linux内核负载均衡算法优化

Java虚拟机中的动态锁cache优化

龙芯3B的SIMD编译优化及分析

一种仿生的面向可重构多细胞阵列的分布式定序方法

快速时代回收：一种针对无锁编程的快速垃圾回收算法

一种基于数据访问特征的层次化缓存优化设计

一种并行的网页解析算法

SAM：一种容错的末级缓存可扩展地址映射方法

存储有效的多模式匹配算法和体系结构

Efficient Hierarchical Algorithm for Mixed Mode Placement in Three Dimensional Integrated Circuit Chip Designs

Partition-Based Global Placement Considering Wire-Density Uniformity for CMP Variations

Multi-core optimization for conjugate gradient benchmark on heterogeneous processors

The Case of Using Multiple Streams in Streaming

MCC： A Message and Command Correlation Method for Identifying New Interactive Protocols via Session Analyses

定制VLIW结构实现四精度浮点基本函数

The unified accelerator architecture for RNA secondary structure prediction on FPGA

A Bio-Inspired Fault-tolerant Hardware System Supporting Hierarchical Self-healing

CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications

Efficient parallel implementation of three-point viterbi decoding algorithm on CPU, GPU, and FPGA

Design and Implement of High Performance Crypto Coprocessor

Optimization schemes and performance evaluation of Smith-Waterman algorithm on CPU, GPU and FPGA

High-performance architecture for the conjugate gradient solver on FPGAs

From WiFi to WiMax: Efficient GPU-based parameterized transceiver across different OFDM protocols

FPGA implementation of an exact dot product and its application in variable-precision floating-point

Efficient Parallel Implementation of Three-point Viterbi Decoding Algorithm on CPU, GPU and FPGA

High performance sparse matrix-vector multiplication on FPGA

Parallel graph traversal for FPGA

An Efficient Parallel SOVA-Based Turbo Decoder&nbsp

Transpose-free variable-size FFT accelerator based on-chip SRAM

FPGA Implementation of a Special-Purpose VLIW Structure for Double-Precision Elementary Function

TorusBFS: A Novel Message-passing Parallel Breadth-First Search Architecture on FPGAs

DLPF:基于异构体系结构的并行深度学习编程框架

Efficient graphics processing unit based layered decoders for quasicyclic low-density parity-check c

Urban Land Use and Land Cover Classification Using Remotely Sensed SAR Data through Deep Belief Netw

Affine-Transformation Parameters Regression for Face Alignment

An Efficient and Effective Convolutional Auto-Encoder Extreme Learning Machine Network for 3D Featur

Parallel graph traversal for FPGA

电子组织:一种具有自适应能力的可重构仿生硬件结构

Efficient Parallel Interference Cancellation MIMO Detector for Software Defined Radio on GPUs

An Efficient Parallel SOVA-Based Turbo Decoder for Software Defined Radio on GPU

Remote sensing image classification based on DBN model

Design and Implement of High Performance Cryp

PR-ELM: Parallel regularized extreme learning machine based on cluster.

An Efficient Robust Eye Localization by Learning the Convolution Distribution Using Eye Template

基于DBN模型的遥感图像分类

VLIW coprocessor for IEEE-754 quadruple-precision elementary functions

CuSora: Real-time software radio using multi-core graphics processing unit

基于卷积-自动编码机的三维形状特征学习

Instance-Specific Algorithm Selection via Multi-Output Learning

稀疏矩阵LU分解的FPGA实现

基于GPU的稀疏矩阵Cholesky分解

期刊信息

《计算机学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国计算机学会中国科学院计算技术研究所
主编：孙凝晖
地址：北京中关村科学院南路6号
邮编：100190
邮箱：cjc@ict.ac.cn
电话：010-62620695

国际标准刊号：ISSN：0254-4164
国内统一刊号：ISSN：11-1826/TP
邮发代号:2-833

获奖情况:
中国期刊方阵“双效”期刊

国内外数据库收录:
美国数学评论（网络版）,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:48433