谱聚类是数据挖掘领域最常用的聚类算法之一,但对于如何利用多核CPU与资源有限的众核加速器设计并实现一个在异构单节点上能够处理大规模数据集的高效谱聚类算法,目前尚无理想的解决方案.PSCH(parallel spectral clustering for hybrids)算法是专为CPU-GPU异构计算环境设计的并行T近邻(T-nearest-neighbors,TNN)谱聚类算法,通过分块计算相似性矩阵打破了GPU设备内存的限制,所能处理的数据集规模仅受限于CPU主存的容量.PSCH算法中使用CUDA设计实现双缓冲轮转4段流水机制,通过重叠计算与传输在打破存储瓶颈的同时保证了高计算性能.PSCH算法采用隐式重启动Lanczos方法(implicitly restarted Lanczos method,IRIM)在异构硬件上计算稀疏特征矩阵的特征分解,减轻了特征分解步骤的计算瓶颈.PSCH算法在配有一块GTX 480GPU的单节点上能够对百万以上规模的数据集进行聚类,并对实验中的4个数据集取得了相对于使用16进程的MPI并行谱聚类PSC算法2.0~4.5倍的性能.
Spectral clustering is one of the most popular clustering algorithms in the data mining field.However,this algorithm suffers from the storage and computational bottlenecks heavily when dealing with large-scale datasets.Current work focuses on improving the spectral clustering on both algorithm and implementation levels.But how to design an efficient spectral clustering algorithm,which can handle million scale datasets on a single node with multicore CPU and manycore accelerators,is still an unsolved problem.A parallel spectral clustering using T-nearest-neighbors(TNN)and its implementation for CPU-GPU heterogeneous computing environment,named parallel spectral clustering for hybrids(PSCH),is proposed in this paper.It breaks the GPU device memory limitation by partitioning the TNN similarity matrix into blocks,so the dataset scale only subjects to the size of the host memory.In PSCH,the 4-stage pipeline mechanism with dual rotating buffers is designed to compute the TNN similarity matrix using CUDA,which keeps all the CPU,GPU,and PCIe bus busy to achieve high performance gains while breaking the device memory limitation.The implicitly restarted Lanczos method(IRIM)on GPU is employed for the eigen-decomposition of the sparse TNN similarity matrix,alleviating the computational bottleneck of the eigensolver.The results show that PSCH is highly-efficient at exploring the GPU memory bandwidth and hybrid CPU-GPU computation power.PSCH is able to cluster million scale datasets on a single node equipped with one GTX 480 GPU and achieve 2.0~4.5times performance gains compared with the MPI parallel spectral clustering implementation PSC using 16 processes for 4datasets.