探索了FPGA平台加速高精度科学计算应用的能力和灵活性.首先,研究科学计算中最常用的操作——向量内积,提出基于定点操作的精确向量内积算法.以IEEE 754-2008标准的四精度(Quadruple Precision)浮点算术为例,在FPGA平台上设计了一个基于全展开方法的全流水四精度浮点乘累加单元(QPMAC):提出两级存储策略精确存储乘累加和;采用保留进位累加策略减少定点加法器位宽、简化进位处理、优化关键路径;引入累加和划分策略,实现流水吞吐率.最后,在XC5VLX330FPGA芯片上设计一个LU分解和MGS-QR分解加速器原型来验证QPMAC的性能.实验结果表明,与运行在Intel四核处理器上的基于OpenMP的并行算法相比,集成4个QP-MAC单元的加速器能获得42倍到97倍的性能提升,并且能获得更高结果精度和更低能量消耗.
In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate high precision scientific computing applications.First,we research the inner product operation,which occurs in almost all scientific and engineering applications,and propose the exact inner product algorithm based on exact long fixed-point operations.Taking IEEE 754-2008 quadruple precision floating-point as an example,we have implemented a full-pipelined Quadruple Precision Multiplication and Accumulation(QPMAC) into FPGA devices.We propose a two-level RAM banks scheme to store the exact fixed-point result,and use carry-saved accumulator scheme to minimize the width of fixed-point adder and simplify the logic of carry resolution.We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations,by dividing the summation function into 4 partial operations,processed in 4 banks.To prove the concept,we prototype four QPMAC units into a XC5VLX330 FPGA chip and perform LU decomposition and MGS-QR decomposition.The experimental results show that our implementations based on FPGA achieve 42X-97X better performance,more precision results and much lower power consumption compared with the use of a parallel software approach based on OpenMP running on an Intel Core2 Quad Q8200 CPU at 2.33 GHz.