随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.
As the increasing demand of high performance computing,many-core architecture becomes to the trend of future processor architecture.Fast Fourier Transform(FFT),both computing intensive and bandwidth intensive,is one of the most important applications of the high performance computing.For both software and hardware developers,it is a challenge to implement high efficiency and scalable FFT algorithm on many-core processor.Based on Godson-T processor,the authors developed an optimized implementation of 1-D FFT through implicitly matrix transpose hidden as well as overlapping computation and communication.The performance of optimized 1-D FFT algorithm achieves more than 3 times better and reduces almost 1/3 L2 Cache consumption.After the analysis of on-chip network congestion problem,the authors suggest that increasing the access bandwidth of L2 cache can alleviate the negative impact on on-chip network and L2 Cache which is brought by burst L2 Cache access.As a result,the performance and scalability of memory bandwidth limited applications,such as FFT,can be further improved.