目前,光滑粒子流体动力学方法的GPu加速几乎都是基于简化的Euler控制方程,完整的Navier-stokes方程的GPU实现非常少,且对其困难、优化策略、加速效果的描述较为模糊.另一方面,CPU—GPU协同方式深刻影响着异构平台的整体效率,GPU加速模型还有待进一步探讨.文中的目的是将自主开发的、基于Navier—Stokes方程的SPH应用程序petaPar在异构平台上进行高效加速.文中首先从数学公式的角度分析了Euler方程和Navier—Stokes方程的计算特征,并总结了Navier—Stokes方程在GPU加速中面临的困难.由于Euler方程只含有简单的标量和向量计算,是典型的适合GPU的计算密集轻量级kernel;而完整形式的Navie〉Stokes方程涉及复杂的材料本构和大量张量计算,需要面对GPU上大kernel带来的系列问题,如访存压力、cache不足、低占用率、寄存器溢出等.文中通过减少粒子属性、提取操作到粒子更新、利用粒子的重用度、最大化GPU占用率等策略对Navier-Stokes方程的粒子交互kernel进行优化,具体实现见5.1节.同时,文中调研了三种GPU加速模型:热点加速、全GPU加速以及对等协同,分析了其开发投入、应用范围、理论加速比等,并深入探讨了对等协同模型的通信优化策略.由于通信粒子的不连续分布,GPU端通信粒子的抽取、插入、删除等操作本质上是对不连续内存的并行操作,会严重影响CPU—GPU的同步效果,而相关文献对此问题没有阐述.我们通过改进粒子索引规则解决此问题:粒子排序时不仅考虑网格编号,还要考虑网格类型,具体实现见5.2.3节.基于Euler方程和Navier—Stokes方程实现并分析了三种GPU加速模型.测试结果显示,三种模型下,Euler方程分别获得了8倍、33倍、36倍的加速,Navier-Stokes方程分别获得了6倍、15倍、20倍的加速.全GPU加速均突破了热点加速的?
The existing GPU-accelerated codes of Smoothed Particle Hydrodynamics method mostly focus on the simplified Euler equations rather than the complete Navier-Stokes equations. Besides, the current GPU acceleration models seem to be not "optimal". There is a question that needs to be answered: what is the most emcL~ application code, especially for the Navier-Stokes equations. In this paper, we analysed the computing features of Euler equations and Navier-Stokes equations mathematicallY, and summed up the difficulties in GPU accelerating of Navier-Stokes kernels. The Euler kernel is light-weight since only simple scalar and vector calculations are involved. However, the Navier-Stokes equations involve complicated constitutive models and tensor computations, resulting in the big kernel issues on GPU, such as heavy memory access, low occupancy and register spilling, etc. Kernel optimization strategies of reducing particle properties, extracting operations from interaction kernel to updating kernel, utilizing particle reusability, and maximizing the GPU occupancy are introduced, as described in chapter 5.1. Meanwhile, we investigated three GPU acceleration models- hot-spot acceleration (t~ run hotspots on GPU), GPU-entire (to finish the whole computing process on GPU), and peer2peer acceleration (to treat CPU and GPU as equivalent processors). The three models are analyzed from the perspectives of development cost, application scope, and theoretical speedup. And the communication optimization strategies of peer2peer model are addressed in detail. Because of the discontinuous distribution of communication particles, the extracting, inserting and deleting of them on GPU are actually parallel operations over discon- tinuous memory, which have serious influence on the CPU-GPU synchronization but no relevant research in literature. We solved the problem by improving the particle indexing rule, to consider not only cell index but also ceil type when ordering particles, as described in chapter