针对龙芯3A体系结构以及二级BLAS库函数的特点,在指令级、存储级和线程级抽取并行方案,总结了一些合适的优化方法,并对其进行了定量的分析。实验表明,这些优化可以将二级BLAS函数单线程的性能提升20%以上,多线程下也可以得到2.5倍左右的加速比,这对今后多核龙芯上的系统软件优化工作有着一定的帮助。
According to characteristics of Loongson 3A architecture and BLAS level 2, this article derives the parallel solutions from instruction level, storage level and thread level. We summarize some suitable optimization methods and make a quantitative analysis. Experiment shows that the single-threading performance of BLAS level 2 is increased by 20%, and the multi-threading speedup reaches to 2.5. All of these will give some help to the optimization of system software on multi-core Loongson 3A.