共性数学库PETSc(Portable,Extensible Toolkit for Scientific Computation)是高性能计算的基础模块,是超级计算机计算环境的基础算法库之一,其性能直接影响调用数学库的高性能数值计算应用的效率.面向国际上首台100P神威·太湖之光异构超级计算机,根据实际研究需要选取PETSc中两个典型用例ex5(单节点线性求解方程组问题)和exl9(多节点求解2D驱动腔问题)进行实验探究.对运行结果分析找到的热点函数主要为PETSc函数库中7个核心函数,针对这7个核心函数(主要包括向量运算与矩阵运算),提出和实现了其异构并行算法,并结合机器的异构体系结构提出了相应的性能优化方法.在超级计算机上的实验结果为:核心函数并行算法在4主核、256从核的单节点上加速比最大可达到16.4;多节点情况下,当输入规模为16384时,8192个节点相对于256节点的加速比为32,且加速比随着异构处理器数目的增加接近线性增加,表明PETSc核心函数并行算法在神威·太湖之光超级计算机上具有良好的可扩展性.
Large-scale scientific and engineering calculations such as hydrodynamic calculations, numerical weather forecasting, seismic data processing, genetic engineering, and high-dimensional differential equations are facing with the big performance challenge. Meanwhile, the High Performance Computing (HPC) platform has been significantly developed in recent years. The appearances of multi-core processors and heterogeneous computing platforms dramatically improve the performance of high-performance applications. To fully utilize the computing power of HPC systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the system architecture. The Sunway TaihuLight supercomputer is presently ranked in the TOP500 list as the fastest supercomputer in the world, with a LINPACK benchmark rating of 93 petaflops. The Sunway TaihuLight uses a total of 40960 Chinese designed SW26010 multi-core 64-bit RISC processors. Portable, Extensible Toolkit for Scientific Computation (PETSc), an indispensable module of high performance computing, is one of basic algorithm libraries widely applied in many high-performance applications. Meanwhile, PETSc is also widely used in partial differential equations, sparse linear algebra and other related problems. The performance of PETSc directly affects the efficiency of applications invoking PETSc. In this paper, we use two most typical cases in PETSc according to actual research needs, that is ex5 (solving problems of linear systems on single node) and ex19 (solving problems of 2D driving cavity on multi nodes) to perform them on the Sunway TaihuLight supercomputer. With the analysis of experimental results, we figure out there are seven core functions including vector calculations and matrix calculations. First of all, for each core function, we do an in-depth research of its characteristics, parallel difficulties, optimizations for the bottlenecks. And then, we determine an appropriate heterogeneous parallel model for these functi