人群分层是遗传关联研究的一个问题,因为它可能会突出位点的基础人口结构而非疾病相关的基因位点。目前,主成分分析法已被证明是一种有效的方法来纠正人口分层。然而,传统的主成分分析算法在处理大型数据集时耗时。我们开发了一个图形处理单元(GPU)的基于PCA软件shesispca(http://analysis.bio-x.cn/shesismain.htm)是高度并行的一个最高加速比大于100的CPU版本比较。一种基于X-means聚类算法也被实现为一个方法来检测人群和获得匹配的病例和对照组为了降低通货膨胀和增加功率的基因组。对模拟和真实数据集的一项研究表明,shesispca跑在一个非常高的速度,精度不降低。因此,shesispca可以帮助纠正群体分层算法比基于传统的CPU更高效。
Population stratification is a problem in genetic association studies because it is likely to highlight loci that underlie the population structure rather than disease-related loci. At present, principal component analysis (PCA) has been proven to be an effective way to correct for population stratification. However, the conventional PCA algorithm is time-consuming when dealing with large datasets. We developed a Graphic processing unit (GPU)-based PCA software named SHEsisPCA (http://analysis.bio-x.cn/SHEsisMain.htm) that is highly parallel with a highest speedup greater than 100 compared with its CPU version. A cluster algorithm based on X-means was also implemented as a way to detect population subgroups and to obtain matched cases and controls in order to reduce the genomic inflation and increase the power. A study of both simulated and real datasets showed that SHEsisPCA ran at an extremely high speed while the accuracy was hardly reduced. Therefore, SHEsisPCA can help correct for population stratification much more efficiently than the conventional CPU-based algorithms.