针对气相色谱-质谱联用(GC-MS)数据处理过程复杂且计算量大、处理时间过长而严重拖延实验进度的问题,以多样本保留时间对齐为例,设计了基于分布式平台Sector/Sphere的GC—MS数据处理并行框架,实现了多样本并行对齐算法。首先分布式计算所有样本的相似度矩阵;然后依据层次聚类原理将原样本集划分为小样本集,分布式对齐各小样本集内部的样本;最后以各小样本集的平均样本作为对齐依据合并各样本集的对齐结果。实验结果表明:多样本并行对齐算法的错误率为2.9%,由4台Pc组成的集群处理大量样本时,最高加速比达到3.29;能够在保证较高正确率的前提下提升计算速度,解决处理时间过长的问题。
To deal with the problem that the process of Gas Chromatography-Mass Spectrography (GC-MS) data is complex and time consuming which delays the whole experimental progress, taking the alignment of multiple samples as an example, a parallel framework for processing GC-MS data on Sector/Sphere was proposed, and an algorithm of aligning multiple samples in parallel was implemented. First, the similarity matrix of all the samples was computed, then the sample set was divided into small sample sets according to hierarchical clustering and samples in each set were aligned respectively, finally the results of each set were merged according to the average sample of the set. The experimental results show that the error rate of the parallel alignment algorithm is 2.9% and the speedup ratio reaches 3.29 using the cluster with 4 PC, which can speed up the process at a high accuracy, and handle the problem that the processing time is too long.