双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多.
General matrix multiplication of double precision(DGEMM) is one of the most important functions in BLAS library,which is called by many functions in the level-3 BLAS.The theoretical analyses help us find out the best way for loop unrolling contraposing 128-bit memory access instructions of Loongson-3A.By means of address interleaving,cache conflict misses are reduced according to the random cache replacement policy.Considering the limited memory bandwidth of Loongson-3A,task classification on the basis of data sharing is adopted to reduce the data access.The computation speed of the optimized DGEMM on single-core and multi-core is more than twice that of the open source BLAS library of highest performance.