科学与工程应用对计算性能要求的不断增加使得异构计算得到了迅速发展,然而CPU与加速单元之间没有共享内存的特点增加了异构编程难度,编程人员必须显式地指定数据在不同设备之间的传递情况.全局数组(globalarrays,GA)模型基于聚合远程内存拷贝接口(ARMCI)为分布式存储系统提供异步单边通信、共享内存的编程环境,但ARMCI接口拓展的复杂性使得GA不能根据特定计算平台的特点迅速在该平台上实现.CoGA模型是对GA模型的异构拓展,旨在为CPU+英特尔至强融核(MIC)的异构系统提供全局数组结构,隐藏数据传输细节从而简化异构编程难度.CoGA基于MIC上的对称传输接口(SCIF)实现对CPU和MIC的内存管理,并结合SCIF远程内存访问特点优化CPU与MIC间的数据传输性能.最后,通过数据传输带宽、通信延迟和稀疏矩阵乘问题的测试,证明了CoGA简化编程并优化数据传输性能的有效性和实用性.
The increasing requirement for computational performance has led to the rapid development of heterogeneous computing.However,heterogeneous programming is more complicated since there is no shared memory between CPU and accelerators.Besides,programmers must distinguish the local or remote access of data and transmit the data between computing devices explicitly.Global arrays(GA)can provide an asynchronous one-sided,shared memory programming environment for distributed memory systems,but creating an efficient and scalable implementation of GA for a new system is a challenge because of the sophistication of communication library inside GA.In this paper,we present CoGA,the extension of GA on heterogeneous systems consist of CPU and Intel many integrated core(MIC).CoGA,which is built on the top of symmetric communication interface(SCIF),can provide a shared memory abstraction between CPU and MIC,and simplify the programming by allowing programmers to access the shared data regardless where the referenced data is located.Furthermore,CoGA takes advantage of SCIF remote memory access and optimizes the data transmission performance between CPU and MIC.The evaluation on data transmission bandwidth,communication latency and sparse-matrix vector multiplication problem proves that CoGA is practical and effective.