位置:成果数据库 > 期刊 > 期刊详情页
面向ARMv864位多核处理器的QGEMM设计与实现
  • ISSN号:0254-4164
  • 期刊名称:《计算机学报》
  • 分类:TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]国防科学技术大学计算机学院,长沙410073, [2]湖南大学信息科学与工程学院,长沙410082, [3]国防科学技术大学并行与分布处理重点实验室,长沙410073
  • 相关基金:本课题得到国家“八六三”高技术研究发展计划项目基金(2012AAOIA301)、国家自然科学基金项目(61402495,61303189,61602166,61170049,61402496)资助.
中文摘要:

该文在ARMv864位多核处理器上基于OpenBLAS首次设计、实现并优化了四精度矩阵乘法(Quadruple precision General Matrix-Matrix Multiplication,QGEMM).由于浮点计算中不可避免地引入舍入误差,双精度矩阵乘法(DGEMM)在某些情况下不能给出令人满意的数值结果,因此需要高精度或多精度算法来实现更精确的计算.Double—double算术是一种较为有效和广泛使用的手段.文中采用double—double数据格式构建结构体存储四精度浮点数据;基于OpenBLAS中的稠密矩阵计算的分块算法,增加四精度数据格式的相关的头文件和源文件,并用汇编代码撰写文中所提出的QGEMM的核心内核;利用无误差变换技术,调整并优化内核中的算法流程,避免规格化操作步骤造成的数据强制依赖关系;通过分析算法的数据依赖关系,设计寄存器的分配和轮转策略,优化指令调度顺序。开发指令级并行性,提高QGEMM的实际性能.根据具体算法使用混合乘加指令(FMA)的程度不同,文中采用了算法理论峰值性能这一概念,其有别于机器理论峰值的概念,能更好地评估文中所提出的QGEMM的实际效率.数值实验表明:文中通过汇编代码实现并优化的QGEMM性能最高达到19.7Gflops,效率为在ARMy864位多核处理器平台上QGEMM算法理论峰值性能的82.1%,在满足数值结果精度要求的同时,其计算速度约是由c语言撰写的未优化的QGEMM和MBLAS中QGEMM的5.8倍,是编译器GCC实现的longdouble数据格式的QGEMM的24倍.同时数值实验还显示文中提出的QGEMM针对不同规模的矩阵具有较好的线程可扩展性.

英文摘要:

In this paper, we present the first design, implementation and optimization of quadruple precision matrix-matrix multiplication(QGEMM) based on OpenBLAS for ARMv8 64-bit multi-core processor. Sometimes, double precision matrix matrix multiplication (DGEMM) can't give accurate results as expected owing to cancellation from round-off errors, therefore higher or multiple precision is required. The most efficient and widely used way is by using double-double arithmetic to achieve quadruple precision. The element of the designed OGEMM in this paper is stored as the structure, which consists of two floating-point numbers in double format corresponding to a double-double number. With GEMM blocking algorithm of OpenBLAS, we implement the QGEMM by adding some header files, source files and especially the inner kernel written in assembly. With error-free transformation, we optimize the algorithm flow in the inner kernel to avoid the renormalization step that sometimes is not necessary. By analyzing the data dependency, we design the register rotation and instruction scheduling to exploit instruction level parallelism. Considering that algorithms utilize fused multiply and add (FMA) instructions differently, we use the concept of algorithm~ s theoretical peak performance, which is different from that of machine's theoretical peak performance, to evaluate the efficiency of QGEMM better. Experimental results show that our QGEMM can perform up to 19.7 Gflops with the efficiency 82.1 G of the algorithm's theoretical peak performance for ARMv8 64-bit multi-core processor. With the similar accuracy, our QGEMM runs 5.8 times faster than the un-optimized QGEMM based on OpenBLAS and the QGEMM in MBLAS, both of which utilize the double-double arithmetic to implement QGEMM and are written in C code. Our QGEMM also runs 24 times faster than the QGEMM implementation using GCC complier with long double format. varying thread counts The numerical tests show that our QGEMM has good scalability under across a range of matr

同期刊论文项目
同项目期刊论文
期刊信息
  • 《计算机学报》
  • 北大核心期刊(2011版)
  • 主管单位:中国科学院
  • 主办单位:中国计算机学会 中国科学院计算技术研究所
  • 主编:孙凝晖
  • 地址:北京中关村科学院南路6号
  • 邮编:100190
  • 邮箱:cjc@ict.ac.cn
  • 电话:010-62620695
  • 国际标准刊号:ISSN:0254-4164
  • 国内统一刊号:ISSN:11-1826/TP
  • 邮发代号:2-833
  • 获奖情况:
  • 中国期刊方阵“双效”期刊
  • 国内外数据库收录:
  • 美国数学评论(网络版),荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊(2004版),中国北大核心期刊(2008版),中国北大核心期刊(2011版),中国北大核心期刊(2014版),中国北大核心期刊(2000版)
  • 被引量:48433