对统一计算设备架构CUDA技术进行了研究,分析了CUDA体系结构及其内存访问机制的显著特点,总结了CUDA并行程序常见的内存访问问题,针对全局内存的非对齐访问和共享内存的访问冲突,提出了相应的内存访问优化策略;最后,利用直方图均衡算法对此优化技术进行了测试,对比了优化前后的程序执行时间;实验结果表明,利用此优化技术可以大大缩短CUDA程序的执行时间,并且图像像素越大,优化效果越好。
We analyze the distinct features of CUDA (Compute Unified Device Architecture) and the mechanism of its memory accesses, summa rize the representative issues of memory accesses in CUDA parallel programs, and present the optimization strategy aiming at non--coalesced accesses of global memory and bank conflicts of shared memory. Using a histogram equalization algorithm for tests, we compare the execution time of original to optimized programs. The experimental results show that the greater the image pixels, the better the optimization results.